Friday, May 27, 2011

How to Convert All Files in a Directory: CA Legislation

Starting with the unstructured data in California's legislation, it takes many steps to add structure to a single Section. Or rather, to add back in the metadata that the Section's original drafters intended, to help a reader understand and navigate the law. The next step is to apply the transformations to all of the Sections in the law.
California helpfully makes all of its codes available for FTP download in a set of nested folders. It would be great if more government agencies made their data available in bulk. But we still have a problem: How to recursively iterate through all the files and folders in the directory (29 folders, 50,000 files  sections in total) and apply the parsing transformations to each file. Each file consists of a (variable) number of sections, e.g. here.

For this task, I went back to another old Linux utility: Find. If you type "Find /" from a command prompt in Linux (also MacOS), you get a list of all of the files and folders on your computer. Don't do this. It will take a long time, and is not really useful for anything. But you can use this powerful command within a single directory, and send the list of file names to a program that will operate on each one. In this case, I wrapped this all in a Python program, using the POpen() function to run any Linux commands that I wanted. Gory details below the fold.
CA Codes After

If you want to skip the details and go straight to the results, I've put the newly transformed California code sections on a website (calaw.tabulaw.com). Currently, the design is very simple and has no styling, whatsoever. But I welcome you to do a before and after comparison and let me know what you think in the comments.

In my view, converting CA Legislation to structured data makes navigating the code much easier. It also reveals some problems with the version on California's website-- repeated sections, stray text markings--that should probably be cleaned up. More about these anomalies, and the brave new world that structured data can bring to law, in future posts.
This is a simplified snippet of the script I used. The first line defines a command to run through all of the files in a directory and 'print' them to use in the next set of commands. Then I iterate over all of the files in this list, first converting the file to html, then running a set of functions to find all of the section and subdivision titles, and then to add links to references within the text (not shown below).
import os
from subprocess import Popen
cmd = "find path/to/file -print"
fileslist = Popen(cmd, shell=True, stdout=PIPE)
for file in fileslist.stdout.readlines():
    if os.path.isfile(file):
        print "THE FILE IS:"+file
        # Runs Linux commands and channels output to the PIPE output
        parsedfile = Popen("txt2html --explicit_headings --indent_par_break --make_tables --make_anchors --xhtml " + file + "| ssed -R -n -f File_to_Parse_Sections", bufsize=-1, stdout=PIPE, shell=True)
...