Tuesday, May 3, 2011

Losing Data in PDF: All the King's Sources

A quick exercise:
  1. Find the Supreme Court opinion in AT&T Mobility v. Concepcion, (hint: look here), the recent case on contracts that block class action suits,
  2. Find all of the (nearly 30) briefs that were submitted to the Court (hint: look here), and
  3. Determine which arguments from the briefs were discussed in the Court opinion.
This kind of information -- which arguments attracted the Court's attention and how those arguments were treated -- is valuable to litigators going before the Court and to anyone interested in seeing how the Court arrives at its decisions. It's also the kind of project that should lend itself well to computerized analysis.
While this exercise can certainly be done by brute force, by reading each brief and comparing it to the final Court decision, it should also be possible to use software to compare the text and sections of each brief with the final opinion. You can imagine a number of ways of doing this: compare documents at the individual word level, compare section titles, compare case references, etc. However, as soon as you set about to write a program to make these comparisons, you are confronted with a problem: the documents that are available from the Court website, or the American Bar Association (where the briefs are found), are all in Adobe pdf format. Not so bad, you think. These were originally electronic documents, not scanned images, so they were encoded in pdf with their text. Just extract the text and work from there. It turns out not to be that easy. When you scratch beneath the surface of a pdf, you see that it is mainly a graphic representation of the document. A great deal of the structure of the document is simply not encoded: sections, citations, paragraphs -- even the difference between footnotes and the main text -- are all gone.
You can see that by trying to convert a pdf to text or to web format (html). Google Documents has a nice feature that does this, and here is Google's web-converted version of the opinion above (AT&T Mobility v. Concepcion). The way that Google presents the converted document shows the original pdf image of each page, followed by the converted version. A few items jump out from the first page of converted text. Words that were divided at the end of a line in the original, e.g. 'uncon- scionable', are still broken even in the middle of the paragraph. Text formatting, such as italics for case citations, is gone, and formatting of some paragraphs has been significantly disrupted: the top paragraph on page 3 is right-justified in the Google Docs version. Even more problematic is the title section, where the names of the Supreme Court justices are broken up:

Key information about the case--who joined which opinions--has been lost. This information can be recovered in a variety of ways, including by manually coding the vote of each justice in the case, but how wasteful, considering that all of that information was available in the original (electronic) version of the document. In fact, the original sets out the Justices names in all caps to set them apart visually:

Ironically, the Court's extra effort to provide a distinctive visual layout that highlights the Justices' names actually breaks Google's algorithm for parsing the pdf text. With a little bit of forethought, the Court could preserve both the layout and the key structural information, to make their opinions more accessible to the general public, as well as to meet Federal government accessibility standards. (Though these standards are not directly binding on the courts--another sad irony.)
So, for now, we have the technical challenge of converting pdfs to structured text, which is tough enough. Google Documents misses many of the most important text features and in another post, I will discuss other (imperfect) options to do pdf to text conversions, including the pdftotext and pdftohtml programs, and the open source Apache pdfBox and Tika projects.
But for a lawyer, or anyone who cares about the "official" or binding version of the court opinion, the problem goes beyond the encoding of the pdf opinion that the court publishes on its website. As the Court website explains, there are six different versions of opinions published by the Court.
Prior to the issuance of (1) bound volumes of the U.S. Reports, the Court's official decisions appear in three temporary printed forms: (2) bench opinions (which are transmitted electronically to subscribers over the Court's Project Hermes service); (3) slip opinions (which are posted on this website); and (4) preliminary prints.
In addition to these four forms, Court opinions are published (5) in pdf on the website and (6) in bound, printed volumes by a commercial publisher under contract with the Government printing office. So which one is the "official" version? The print versions: "Only the bound volumes of the United States Reports contain the final, official text of the opinions of the Supreme Court of the United States. In case of discrepancies between the bound volume and any other version of a case--whether print or electronic, official or unofficial--the bound volume controls."
A variation on this policy can be found on the U.S. House website, describing the publication of Federal legislation, whose "official" version is the one printed by the Government Printing Office once each year. Considering that nearly all legal research is now done electronically, there is a serious disconnect here between actual practice and the policies of these two branches of Government.
So, not only do the electronic versions of legal sources from the Court and Congress lose or scramble much of the original structural information, their official policies undermine the value of publishing in electronic form to begin with. In practice, what this means is that attorneys, and even the Court, ends up relying on one of the two major commercial databases, Westlaw and Lexis/Nexis for the electronic versions that they publish, after those companies input their own version of the documents' structure. As a result, the Court ends up subscribing to these commercial databases to get access back to its own original sources. What an odd state of affairs.
The solution, technically simple, will take some political will, or some technical enlightenment from the Court: publish Court opinions in an official electronic format that includes important structural information. This could be as simple as publishing the document in a "tagged" pdf format, or even better, to move toward a more "native" electronic format such as XML. The Executive Branch has done with the Federal Register (a report of all official government updates) and the Code of Federal Regulations, which are now both published in XML. The recent letter from House Speaker Boehner and Majority Leader Cantor urging the House to publish legislation in XML moves things closer in that branch, as well.
Any bets on how long it will take until all three branches are publishing in a native electronic format?