Friday, May 27, 2011

CA Legislation New site

CA legislation transformed (w. new website--check it out!) If you want to skip the more technical post below altogether, just go to calaw.tabulaw.com. It has no styling or search function yet, but compare the navigational flow to California's official legislative site: http://www.leginfo.ca.gov/calaw.html

How to Convert All Files in a Directory: CA Legislation

Starting with the unstructured data in California's legislation, it takes many steps to add structure to a single Section. Or rather, to add back in the metadata that the Section's original drafters intended, to help a reader understand and navigate the law. The next step is to apply the transformations to all of the Sections in the law.
California helpfully makes all of its codes available for FTP download in a set of nested folders. It would be great if more government agencies made their data available in bulk. But we still have a problem: How to recursively iterate through all the files and folders in the directory (29 folders, 50,000 files  sections in total) and apply the parsing transformations to each file. Each file consists of a (variable) number of sections, e.g. here.

For this task, I went back to another old Linux utility: Find. If you type "Find /" from a command prompt in Linux (also MacOS), you get a list of all of the files and folders on your computer. Don't do this. It will take a long time, and is not really useful for anything. But you can use this powerful command within a single directory, and send the list of file names to a program that will operate on each one. In this case, I wrapped this all in a Python program, using the POpen() function to run any Linux commands that I wanted. Gory details below the fold.
CA Codes After

If you want to skip the details and go straight to the results, I've put the newly transformed California code sections on a website (calaw.tabulaw.com). Currently, the design is very simple and has no styling, whatsoever. But I welcome you to do a before and after comparison and let me know what you think in the comments.

In my view, converting CA Legislation to structured data makes navigating the code much easier. It also reveals some problems with the version on California's website-- repeated sections, stray text markings--that should probably be cleaned up. More about these anomalies, and the brave new world that structured data can bring to law, in future posts.

Tuesday, May 24, 2011

How to Convert Citations to Hyperlinks: CA Laws

Steps 3 and 4 in converting the California legal Codes to structured HTML involve identifying references within the text (e.g. "pursuant to Section 480" or "under Section 15000 of the Vehicles Code"). This presents two challenges: (1) identifying the correct Code (the high level subject matter of the law), and (2) identifying the section in that Code.
This becomes more complex than it would seem, because California's legislature uses a variety of different forms to refer to other Sections and Codes. The most straightforward is of the form, "Section X of the Y Code". But there are many, many variants. An example:
"pursuant to the provisions of Part 2.5 (commencing with Section 18901) of Division 13 of the Health and Safety Code"
To deal with these variations, I started by identifying all Code references. I used the Linux sed utility to do this and to enclose each Code reference with html tags. This is a simplified version of the RegEx for one Code:
s_Health and Safety Code_<a href="/Code-hsc">Health and Safety Code</a>_
To identify the Section number(s), I compiled a list of the most common forms of reference, and created a RegEx expression for each. There is an additional problem, though: many of the references contain many subreferences and cover more than one line of the text:
pursuant to Chapter 3.5 (commencing with Section
11340), Chapter 4 (commencing with Section 11370), or Chapter 5
(commencing with Section 11500), of Part 1 of Division 3 of Title 2
of the Government Code
Hmm. A worthy challenge.
The Chapter, Part, Division and Title references do not seem to add any independent information for our purposes. So I look for, and skip over, anything of the form [Part OR Division OR Title] [number] of [Part OR Division OR Title]...
Now we have:
Section
11340),...Section 11370),...
...Section 11500),...
of the <a href="/Code-gov">Government Code</a>
With the Code reference previously identified we can now focus on finding the various Section references, and associating them with the right Code. I go into a bit more technical detail on this after the fold and in the next post on how I put it all together to run through all of the Code sections (18k files; 50k files  sections) in one sitting.

Monday, May 23, 2011

How to: Convert Sections Into Hyperlink Targets

How to find section headings in a text document and convert them to targets for hyperlinks?

If you have ever had this burning question, you'll want to read on. Or you can take my word for it that it would have been better for this information to be included in the documents when they were originally published.

This post describes Step 2 of 5 to convert California statutes to structured html: Identify section, subsection and subdivision headings. To do this, I am using an old (1970s) Linux program called "sed" (stream editor).

There are lots of ways to do this using more modern programming languages, but sed has the advantages that it is VERY fast, and it has built in the operations of opening, editing and closing a file. It's basically a "find and replace" function on steroids, without the need for Congressional hearings.I must admit, that once I got the hang of sed, and its improved cousin, "Super Sed", it was pretty addictive: with one command, you can change all capital letters in a document to lower case, or replace all vowels with a *, or mark all numbers and letters at the beginning of a paragraph as section and subsection headings. Sed goes through a file one line at a time and makes these substitutions. Sed is quite powerful and there are actually a number of other things you can do with sed, operating one line at a time through a text. If this sounds like fun to you, look here for a good tutorial.

I was working with California state statutes, which I had earlier converted to html. Fortunately, the statute text has a very regular structure: sections, subdivisions and other levels of the document were marked at the beginning of lines, with consistent spacing setting them apart.

So to find the section headings, I just needed to create a set of rules (using RegEx), that describe each kind of section heading. California statutes use headings with the following levels:

100
100.1
100.1 (a)
100.1 (a) (1)

So I needed to describe each of these section headings in a way that they could be identified and separated from any other numbers and letters that are found within the statutes. Here's an example of a rule that does this:

s_^<p>([1-9]\d*)\._<p><span class="section level1" id="sec-\1\.">\1\.<\/span>_

It looks gory, but is actually pretty tame. In essence, it says to substitute (s_) any number at the beginning of a line (^) and beginning of a paragraph (<p>) with a label (<span>) that will identify this number as a section heading. Each kind of heading requires another rule to describe it, and then all of these rules are applied to the file using the ssed (Super Sed) command. The result converts a section heading like this:

<p>15210. Notwithstanding any other provision of this code, as used in

to something like this:

<p><span class="section level1" id="sec15210.">15210.</span> Notwithstanding any other provision of this code, as used in

Not rocket science, but one step closer to structured data. The <span> will allow us to separate out this section from the rest of the text in order, for example, to link to this section from another section that references it.

The next step is to find all of the references to other sections that are found inside the statute text and to place links from those references to the sections they refer to. Unfortunately, those references may cross over more than one line, it is harder to use a line-by-line editor such as sed to do the job. For this, I put together a short search and replace program in the Python programming language, which is more flexible and has a lot of tools to for working with text. That will be step 3 in the 5 step process, for a future post.

As I mentioned earlier, I will be publishing the final scripts on Github, and will be publishing the hyperlinked version of California legislative information. And hopefully this can inspire California's legislature to publish the statutes in a structured data format to begin with, which can be combined with the OpenStates data to make it easier to see the changes that would be made by any proposed legislation.


Wednesday, May 18, 2011

How to convert Text to HTML: Using txt2html Perl Module

To convert CA statutes from text to html, I used the txt2html module, written in Perl by Kathryn Anderson. This was step 1 in the 5-step process I used to reintroduce metadata to the statutes.
While there are a number of free online text to html converters, and some commercial packages you can download, txt2html is a fast, flexible and accurate open source option. It also works from the command-line, so I could combine it with the next steps I needed, to identify section headings, etc.
After installing the module on my Mac OS (described below), I used the following command, all in one line, from the Terminal:
txt2html --explicitheadings --indentparbreak --maketables --make_anchors --xhtml --outfile /path/to/file.html /path/to/file
What does this do? The "/path/to/file" needs to be substituted with the actual path to the text file you want to convert and "/path/to/file.html" is the name and path of the new document you want to create.
One of the advantages of this module for converting to html, is that it has a good assortment of options. I found that the ones above were needed in order to keep the structure of the text file in place. For example, without these options, some of the subdivisions of a statute got collapsed with earlier subdivisions into the same paragraph.
Here is a before and after of a section of the text:

Before:
15210. Notwithstanding any other provision of this code, as used in
this chapter, the following terms have the following meanings:
(a) "Commercial driver's license" means a driver's license issued
by a state or other jurisdiction, in accordance with the standards
contained in Part 383 of Title 49 of the Code of Federal Regulations,
which authorizes the licenseholder to operate a class or type of
commercial motor vehicle.
(b) (1) "Commercial motor vehicle" means any vehicle or
After:
<p>15210. Notwithstanding any other provision of this code, as used in this chapter, the following terms have the following meanings:
<br/>&nbsp;&nbsp;&nbsp;(a) "Commercial driver's license" means a driver's license issued
by a state or other jurisdiction, in accordance with the standards
contained in Part 383 of Title 49 of the Code of Federal Regulations, which authorizes the licenseholder to operate a class or type of
commercial motor vehicle.
<br/>&nbsp;&nbsp;&nbsp;(b) (1) "Commercial motor vehicle" means any vehicle or
This did everything I wanted, and nothing I didn't. Installation instructions after the fold.

Monday, May 16, 2011

California Laws: Converting Plain Text to HTML

How to convert legislation from plain text to structured html? For those more interested in results than process, you're in luck. This post focuses primarily on the results of the transformation.

I am working here with California's statutes, published in plain text on the legislature's website. There are many layers of meaning that could be added to the raw text, and as a start, I'm focusing on elements that make reading and navigating the statute easier. In particular:
  • identifying where sections, subdivisions and other elements start and end and
  • adding hyperlinks from a reference to the section referenced (adding a hyperlink from references like this: "as defined in Section 203 of the Government Code").
For example, here are sections of the CA Vehicle Code that set out definitions for the sections that follow. Here are the same sections in html, after the transformations described below.
Before (no links)
After (now with links)


Nothing earth-shattering, but for even this level of metadata, it took a number of steps to add the structural information back in to the statutes (see an outline of the process below the fold). After a bit more polishing, I will upload my scripts to Github, in the hopes that my hacks can be improved upon.

For those who want to skip straight to the conclusion, here it is: automated transformations can add back in much of the metadata that is needed to navigate statutes. But the automated methods will not catch all of the relevant information--even all of the relevant references to other primary legal sources. To add the rest of this information into a public domain electronic format will require (a) that governments publish the data in a structured format to begin with, (b) a Wikipedia-like platform for expert crowdsourcing of legal sources, (c) a fundamental change in the current pay model for publishing of legal information or (d) all of the above.

Now to see, in more detail, what was gained from this first layer of transformations of the text.
What works:
  • Sections (e.g. 15210.) , subdivisions (e.g. 15210(b)) and sub-subdivisions (e.g. 15210(b)(1)) identified.
  • References to each of the 29 California Codes are linked.
  • Most references to other Sections are hyperlinked.
What doesn't (yet) work:
  • I haven't yet posted the linked documents online.
  • Further subdivisions (e.g. 15210(b)(2)(A)) have not yet been identified in the text.
  • The parser does not yet recognize some forms of reference to other Sections. E.g. where the reference is set out as a list of three or more: "in the manner described under Section 2800.1, 2800.2 or 2800.3..."
  • References to separate legislative Acts are not linked (e.g. "the Commercial Motor Vehicle Safety Act")
  • References outside the CA Codes are not yet linked, e.g. references to U.S. Federal statute or regulations.
To address the points above will require a few more layers of filtering.

Friday, May 13, 2011

California Law: Recovering Meaning and Metadata with RegEx


In a previous post, I mentioned some of the challenges in recovering meaningful structural information (titles, paragraphs) from pdfs, and why government entities should retain this information when they publish electronic documents.
I'll have more to say in future posts about what information is important to retain, which, at a minimum, should include document structure (titles, sections, paragraphs and other meaningful divisions) as well as references to other documents (statutes, Constitutional provisions, court decisions, etc.). This does not even touch on the meaning of the documents, but at least makes it possible to more easily navigate electronic documents.
As a motivation for my next (technical) post on how to recover some of this information from existing plain text legislation using a variety of open source Linux utilities, Perl packages and Python functions, I'll take a look at a section from California's legislation.
California's legislation is divided into 29 codes, which can be searched from this quaint official web site from California's Legislature. The statutes themselves are posted in plain text, so a visitor to the site has to do repeated searches in order to assemble all of the references necessary to make sense of any given section of the Code.
As an example, here's an apparently simple exercise for using the CA code site:
What does a visitor to California need in order to legally drive a vehicle in the state without a California driver's license? (No Googling allowed!) One of the provisions of the relevant CA statute has 3 external references, which I've identified with italics:
(b) Any person entitled to the exemption contained in subdivision (a), while operating, within this state, a commercial vehicle, as
defined in subdivision (b) of Section 15210, shall have in his or her
possession a current medical certificate of a type described in
subdivision (c) of Section 12804.9, which has been issued within two
years of the date of operation of that vehicle.
How to make sense of this?
Wouldn't it be nice to have links to these references, at least, to know what definitions are being cited? My next post will discuss the many step process to identify these references and add hyperlinks through sequence of "search and replace" functions using RegEx. Once this link information is added, navigating and analyzing the law is still not a walk in the park, but it becomes more manageable:
(b) Any person entitled to the exemption contained in subdivision (a), while operating, within this state, a commercial vehicle, as
defined in subdivision (b) of Section 15210, shall have in his or her
possession a current medical certificate of a type described in
subdivision (c) of Section 12804.9, which has been issued within two
years of the date of operation of that vehicle.
As a final note: The major legal publishers (WestLaw, LexisNexis) provide this kind of link information in their commercial databases for lawyers. And Cornell's LII has added links to navigate between references on the LII version of the Code, a major step forward for public access. But even better will be when Congress (and eventually the states) includes these reference links when they first publish legislation. Another reason why this announcement is significant.

Friday, May 6, 2011

Better Access to State Legislatures: Sunlight Foundation's Open States Project

James Turk at the Sunlight Foundation announced important progress today in their Open States Project, to make state legislative information available in e-friendly formats.  They have now brought on line data from New York and Hawaii, bringing their total to 25 states and well on the way to their goal of making all state legislative information available through Open States by 2012.


Open States is an open source collaboration, primarily of programmers, and I've been watching their work for a while now.  Our efforts at Tabulaw in the legislative arena are focused on converting existing statutes and laws into e-friendly formats and providing an accessible user interface for this information.  It will be powerful to cross-link the data on bills and legislative proposals that Open States collects with already enacted state legislation, to show how the proposals would change existing law, and what impact the proposals will have.

Congratulations to Sunlight and the Open States team on this milestone!

Wednesday, May 4, 2011

Better Access to Court Opinions: GPO Announces Pilot

Today, the Government Printing Office (GPO), through its digital printing arm, FDSys announced a pilot project in twelve Federal Courts to publish electronic court opinions.  Read the announcement here (pdf) [update: here's the announcement from the U.S. Courts website, with links].

Court opinions are already available from the courts' websites, and -- for those with an access account -- from PACER, the Federal courts' electronic filing system.  The difference now, presumably, is that GPO will introduce some uniformity to the electronic format for published court opinions.

That is a good thing.  Even better, will be for these opinions to include metadata about the document structure.  As I discussed yesterday, most court opinions today are published online in pdf format scrambling much of the information about document structure, and losing much of the value from publishing in an electronic form.

Tuesday, May 3, 2011

Losing Data in PDF: All the King's Sources

A quick exercise:
  1. Find the Supreme Court opinion in AT&T Mobility v. Concepcion, (hint: look here), the recent case on contracts that block class action suits,
  2. Find all of the (nearly 30) briefs that were submitted to the Court (hint: look here), and
  3. Determine which arguments from the briefs were discussed in the Court opinion.
This kind of information -- which arguments attracted the Court's attention and how those arguments were treated -- is valuable to litigators going before the Court and to anyone interested in seeing how the Court arrives at its decisions. It's also the kind of project that should lend itself well to computerized analysis.
While this exercise can certainly be done by brute force, by reading each brief and comparing it to the final Court decision, it should also be possible to use software to compare the text and sections of each brief with the final opinion. You can imagine a number of ways of doing this: compare documents at the individual word level, compare section titles, compare case references, etc. However, as soon as you set about to write a program to make these comparisons, you are confronted with a problem: the documents that are available from the Court website, or the American Bar Association (where the briefs are found), are all in Adobe pdf format. Not so bad, you think. These were originally electronic documents, not scanned images, so they were encoded in pdf with their text. Just extract the text and work from there. It turns out not to be that easy. When you scratch beneath the surface of a pdf, you see that it is mainly a graphic representation of the document. A great deal of the structure of the document is simply not encoded: sections, citations, paragraphs -- even the difference between footnotes and the main text -- are all gone.
You can see that by trying to convert a pdf to text or to web format (html). Google Documents has a nice feature that does this, and here is Google's web-converted version of the opinion above (AT&T Mobility v. Concepcion). The way that Google presents the converted document shows the original pdf image of each page, followed by the converted version. A few items jump out from the first page of converted text. Words that were divided at the end of a line in the original, e.g. 'uncon- scionable', are still broken even in the middle of the paragraph. Text formatting, such as italics for case citations, is gone, and formatting of some paragraphs has been significantly disrupted: the top paragraph on page 3 is right-justified in the Google Docs version. Even more problematic is the title section, where the names of the Supreme Court justices are broken up:

Key information about the case--who joined which opinions--has been lost. This information can be recovered in a variety of ways, including by manually coding the vote of each justice in the case, but how wasteful, considering that all of that information was available in the original (electronic) version of the document. In fact, the original sets out the Justices names in all caps to set them apart visually:

Ironically, the Court's extra effort to provide a distinctive visual layout that highlights the Justices' names actually breaks Google's algorithm for parsing the pdf text. With a little bit of forethought, the Court could preserve both the layout and the key structural information, to make their opinions more accessible to the general public, as well as to meet Federal government accessibility standards. (Though these standards are not directly binding on the courts--another sad irony.)
So, for now, we have the technical challenge of converting pdfs to structured text, which is tough enough. Google Documents misses many of the most important text features and in another post, I will discuss other (imperfect) options to do pdf to text conversions, including the pdftotext and pdftohtml programs, and the open source Apache pdfBox and Tika projects.
But for a lawyer, or anyone who cares about the "official" or binding version of the court opinion, the problem goes beyond the encoding of the pdf opinion that the court publishes on its website. As the Court website explains, there are six different versions of opinions published by the Court.
Prior to the issuance of (1) bound volumes of the U.S. Reports, the Court's official decisions appear in three temporary printed forms: (2) bench opinions (which are transmitted electronically to subscribers over the Court's Project Hermes service); (3) slip opinions (which are posted on this website); and (4) preliminary prints.
In addition to these four forms, Court opinions are published (5) in pdf on the website and (6) in bound, printed volumes by a commercial publisher under contract with the Government printing office. So which one is the "official" version? The print versions: "Only the bound volumes of the United States Reports contain the final, official text of the opinions of the Supreme Court of the United States. In case of discrepancies between the bound volume and any other version of a case--whether print or electronic, official or unofficial--the bound volume controls."
A variation on this policy can be found on the U.S. House website, describing the publication of Federal legislation, whose "official" version is the one printed by the Government Printing Office once each year. Considering that nearly all legal research is now done electronically, there is a serious disconnect here between actual practice and the policies of these two branches of Government.
So, not only do the electronic versions of legal sources from the Court and Congress lose or scramble much of the original structural information, their official policies undermine the value of publishing in electronic form to begin with. In practice, what this means is that attorneys, and even the Court, ends up relying on one of the two major commercial databases, Westlaw and Lexis/Nexis for the electronic versions that they publish, after those companies input their own version of the documents' structure. As a result, the Court ends up subscribing to these commercial databases to get access back to its own original sources. What an odd state of affairs.
The solution, technically simple, will take some political will, or some technical enlightenment from the Court: publish Court opinions in an official electronic format that includes important structural information. This could be as simple as publishing the document in a "tagged" pdf format, or even better, to move toward a more "native" electronic format such as XML. The Executive Branch has done with the Federal Register (a report of all official government updates) and the Code of Federal Regulations, which are now both published in XML. The recent letter from House Speaker Boehner and Majority Leader Cantor urging the House to publish legislation in XML moves things closer in that branch, as well.
Any bets on how long it will take until all three branches are publishing in a native electronic format?