Friday, May 13, 2011

California Law: Recovering Meaning and Metadata with RegEx


In a previous post, I mentioned some of the challenges in recovering meaningful structural information (titles, paragraphs) from pdfs, and why government entities should retain this information when they publish electronic documents.
I'll have more to say in future posts about what information is important to retain, which, at a minimum, should include document structure (titles, sections, paragraphs and other meaningful divisions) as well as references to other documents (statutes, Constitutional provisions, court decisions, etc.). This does not even touch on the meaning of the documents, but at least makes it possible to more easily navigate electronic documents.
As a motivation for my next (technical) post on how to recover some of this information from existing plain text legislation using a variety of open source Linux utilities, Perl packages and Python functions, I'll take a look at a section from California's legislation.
California's legislation is divided into 29 codes, which can be searched from this quaint official web site from California's Legislature. The statutes themselves are posted in plain text, so a visitor to the site has to do repeated searches in order to assemble all of the references necessary to make sense of any given section of the Code.
As an example, here's an apparently simple exercise for using the CA code site:
What does a visitor to California need in order to legally drive a vehicle in the state without a California driver's license? (No Googling allowed!) One of the provisions of the relevant CA statute has 3 external references, which I've identified with italics:
(b) Any person entitled to the exemption contained in subdivision (a), while operating, within this state, a commercial vehicle, as
defined in subdivision (b) of Section 15210, shall have in his or her
possession a current medical certificate of a type described in
subdivision (c) of Section 12804.9, which has been issued within two
years of the date of operation of that vehicle.
How to make sense of this?
Wouldn't it be nice to have links to these references, at least, to know what definitions are being cited? My next post will discuss the many step process to identify these references and add hyperlinks through sequence of "search and replace" functions using RegEx. Once this link information is added, navigating and analyzing the law is still not a walk in the park, but it becomes more manageable:
(b) Any person entitled to the exemption contained in subdivision (a), while operating, within this state, a commercial vehicle, as
defined in subdivision (b) of Section 15210, shall have in his or her
possession a current medical certificate of a type described in
subdivision (c) of Section 12804.9, which has been issued within two
years of the date of operation of that vehicle.
As a final note: The major legal publishers (WestLaw, LexisNexis) provide this kind of link information in their commercial databases for lawyers. And Cornell's LII has added links to navigate between references on the LII version of the Code, a major step forward for public access. But even better will be when Congress (and eventually the states) includes these reference links when they first publish legislation. Another reason why this announcement is significant.