Wednesday, November 30, 2011

Data and Law: Two interesting new posts

What is the role of data in law?  I was trained as a scientist, so I have a tendency to look for evidence and falsifiable hypotheses in law and my daily life.  On this blog, I have discussed one aspect of data in law, the value of adding structure and metadata to legal texts.  In a similar vein, Grant Vergottini throws down the gauntlet, in his latest post,  for the creation of a uniform semantic web for legal documents.  He writes that current publishing practices provide "no uniformity between jurisdictions, minimal analysis capability (typically word search), and links connecting references and citations between documents are most often missing."

Grant asks a number of thought-provoking questions about creating a uniform semantic web for legal documents, which I think need to be addressed by the legal technology community, and the broader legal community:
What standards would be required? What services would be required?...Should the legal entities that are sources of law assume responsibility for publishing legal documents or should this be left to third party providers?
Edward Bryant takes a different approach in his post about using data in law, focusing on the value of using data to make policy decisions, which are then implemented in laws or regulations.  He discusses recommendations by the Ohio tax board to streamline processing of challenges to the state authority's valuations of residential property. The tax board apparently recommends streamlining challenges on residential property, but not commercial property, on the assumption that the residential claims will be less complex.  Bryant points out that the board's recommendation would be more credible if it used a little bit of data to correlate case complexity with the type or amount of claim.

I would expand Bryant's point to suggest that many of our leading decision-makers are not equipped to make data-driven decisions. Often, policy decisions like this are made with little or no relevant data-- or even in the face of contrary data.  Requiring some data-intensive technical training for lawyers would be a good start.  How about one semester of Evidence that focused, not on the FRE, but on how to gather and evaluate objective evidence in support of policy or legal decisions? I suspect that if lawyers, in general, were more data-literate, we'd have an easier time answering the questions that Grant poses above, on the way to create a uniform semantic web for law.

Monday, November 28, 2011

How to Convert HTML to Text, With Formatting

My current best answer: this html2text package from Germany.  It can be installed easily on a MacOS with Macports ($ sudo port install html2text), and on other Unix-like systems through their package managers.  It has a number of useful options, and I use it like this:

html2text -nobs -ascii -width 200 -style pretty -o filename.txt - < filename.html

So now that you know my current answer, here's the problem: not all html is created equal.  Legislative data is published in a variety of formats, some uglier than others.  Extracting information requires cleaning these formats up.

When I started to work with California legislation, I had the problem of converting the state's plain text into a simple html for use on web pages.  To do that, I used a Perl text2html module.  While it takes many steps to produce web-friendly html from California's laws, at least the plain text was not cluttered with formatting symbols and tags that could interfere with the core text.

The problems in other states is far worse.  Some versions of Iowa's bills, for example, appear to be published directly from Microsoft Word to the web, which means that they're littered with a maze of formatting information--sometimes positioning each word on the page--that is not related to the text of the bill.  Other states use hundreds of cells of an html table (or multiple tables) to format the bill.  Looking at the file on the state's website, you wouldn't know that the underlying data is so messy.

Simply stripping all of the html tags won't work, because that eliminates all the formatting information, including information that can change the bill's meaning (spaces, paragraphs). That's unfortunate, because there are many html libraries that would make stripping out the tags easy (e.g. Beautiful Soup for python, or similar libraries in other languages).  What I want to do is preserve the formatting, but do it with spaces and paragraphs, not tables or graphically positioning words.

Ironically, the most effective way to clean this messy data is also the easiest: copy and paste the bill displayed on your web browser.  After all, the formatting was made for the browser to interpret, and the copy-paste function (at least on a Mac) is quite faithful to the formatting.  However, automating this copy and paste process is far from simple and, with one exception, I have not seen any programs that make use of this native browser capability to convert files in bulk.  The exception is the use of the Linux web browser, Lynx, which has a function "Lynx -dump".  However, this converter  apparently has a number of faults, including an inability to process tables.  Anyone know how to use Chrome or Firefox to automate conversion of html to text for large numbers of files? This is still the solution I'd prefer.

But barring that, I found a close second, in the form of the html2text program.  Although it's relatively old (2004), it's fast and deals reasonably with tables and other formatting such as underlining and strikeouts.

Edit: Upon the suggestion by Frank Bennett, below, I installed the w3m text browser and used it to produce formatted text from html using the following command-line syntax:
w3m filename.html -dump > file.txt
Like html2text, it is fast and produces clean output, actually somewhat too clean.  The saved file strips some important formatting information, like <u> (underline) tags, so some caution is in order when using this method.

Friday, November 18, 2011

Legislative Model: How Much to Open Source?

Should legislative data schemes be open source?  That is the question that Grant Vergottini raises in his blog post today, To Go Open Source or Not.  It's a thought-provoking topic and I encourage you to join the discussion on Grant's blog. Some background and a bit of my thinking is below:

Some states (e.g. Oregon) have claimed copyright in the organization of state statutes in order to protect contracts that the state has with legal publishers or other monopoly arrangements to publish the state's laws. That is not the case with most states or jurisdictions, whose bills and statutes themselves are indisputably part of the public domain.

However, even when the legislative text and organization is part of the public domain, access is limited by inconsistent publishing formats and lack of common standards.  Anyone who has tried to use the public internet to search for information on state or even federal laws realizes how difficult this can be.  I have discussed the situation with the U.S. tax code, which my company, Tabulaw, is working to make more accessible at tax26.com.

I have also discussed the difficulty of accessing California's laws, which gave rise to a hackathon to improve the situation.  California, thanks to Grant's work, does have an underlying XML-based data structure, SLIM, that allows California's legislature to easily research and modify the laws and makes the technical process of writing bills more efficient.  However, this benefit has not--until recently--translated into improved access for the public.  Grant and his company have recently open-sourced SLIM which, in theory, could make it easier to make California's laws more accessible to the public, and also make the model available to use with legislation in other jurisdictions.  This could move us toward a standardization of legislative data.

On the one hand, that would be a big step forward for public access, but it does raise some concerns, as Grant points out: it would mean that one company (in this case Grant's) would own the basic data structure for public laws.  This is something that already happens, de facto, with large swaths of government documents stored in pdf, a proprietary but open sourced format.   I am also disturbed by the claim, by the private publishers of the BlueBook, to copyright in a principle standard that has been adopted for citations of legal sources, and other copyright claims that encumber the basic ways that legal citations are written. So there are clear potential problems with a privately owned standard even if open-sourced.

Wouldn't it be nice if governments at all levels would collaborate to create a single nationwide public domain data standard for legislation? That would, for example, make it easier to identify all state laws related to abortion or to compare education laws across jurisdictions.  It might be nice, but it's also less likely than the Congressional SuperCommittee reaching a compromise.  I won't be holding my breath.

I do think that a privately created, widely adopted, and open sourced standard is the next best option.  I think that the value of having a standard set of metadata in legislation outweighs the risks of private ownership of the standard.  And I believe that it is in the interests of all involved, including the owners of the standard, to make the open source licensing of the standard clear and permanent, in order to encourage the widest possible use of the standard.

Monday, November 14, 2011

Legal Informatics' New Blog, from Grant Vergottini

I'm excited to see the start of a new blog on legal informatics (and more), from Grant Vergottini.  Grant, a key participant and organizer of the California Law Hackathon, and his business partner, Bradlee Chang, developed the authoring system that California's legislature uses to write our state laws.  So he knows a thing or two about legislative data.

Grant's vision, which I share, is that at some point, legislation from around the world will be published in a standard format so that "you or your business can easily research the laws to which you are subject" due to the growth of an industry that "caters to the needs of the legal profession based on open worldwide standards."

There are a number of questions of how that vision will come to be.  I touched on some of these questions in my answer on Quora about the non-technical barriers to using version control for legislation, which stimulated a lively discussion.  I'm hopeful, with Grant's new blog, that we can have more of those discussions to work out both the non-technical (mostly political) and technical challenges in the way of open legislative data standards.