Tuesday, March 31, 2015

A Cite to Behold: Standardizing Electronic Legal Citations

Citations are the hyperlinks of law, so it is natural to expect textual citations to be hyperlinked in electronic laws. Numerous parallel efforts are taking on this task for federal government documents online, making those documents easier to read and navigate. Derek Willis (@derekwillis) of the New York Times noticed these links in House Bills published on Congress.gov:

Linked citations are also now found in web interfaces for federal regulations. The terrific site, FederalRegister.gov, parses and adds citations to the user interface. For example, a recent proposed FCC rule on federalregister.gov has links to other regulations, Public Laws and to the U.S. Code:

However, these links are not found in the source XML for the document:

That means that when a regulation is finalized and goes into the Code of Federal Regulations (CFR), it will not have the citation links. To remedy that, the eRegulations project from the Consumer Financial Protection Bureau (CFPB) has its own code that parses citations and adds links for CFPB's user-friendly regulations interface. This code seems to work independently from and in parallel to federalregister.gov, doing essentially the same thing at a later stage of the publication process for regulations.

While hyperlinks are being added to citations in various contexts, each implementation uses different methods to parse the citations and different formats for the links. This leads to inconsistent coverage for textual citations and inconsistent formats for identical textual citations. Before long, the lack of consistency will also lead to broken links. Now is therefore a good time to bring some uniformity to these efforts toward a standardized form for government electronic citations.

Here, I'd like to explore what that standardized form should look like. To serve its purpose, a citation (electronic or text) should be uniqueclear and permanent. A good citation leads to only one document, unambiguously identifies the portion of the document that is referenced, and will still be valid in the future.

Not all written citations meet these thresholds. In the United States Code, for example, there are two sections designated 5 U.S.C. 3598 (see the footnote for the section). That's right, Congress sometimes passes laws with duplicate section numbers. (True also for 10 USC 127c18 USC 3928 USC 193246 USC 70119, which have been redesignated during codification). There are also written citations that are ambiguous because they are missing context (e.g. 'Section 101 of such Act') or refer to law that itself has since been changed. Nonetheless, to the degree that the written citation is unique and unambiguous, the underlying hyperlink should be as well.

Two years ago, Grant Vergottini discussed progress on standardized electronic citations. Many of the ideas in that post were included in the electronic citation solution underlying the U.S. Code XML data model, called USLM. (Xcential helped develop USLM as part of the House Modernization Project). That two-part solution has proven to be efficient and extensible.

The first part involves an identifier: a standard, permanent path for the citation (e.g. 26 U.S.C. 501(c)) -> '/uslm/us/usc/t26/s501/c'). This identifier is clear and unique. It is condensed and human-readable.

In USLM each subunit of text is marked with a identifier of this form, making it easy to create a link to that subunit. So the subsection corresponding to 26 U.S.C. 501(c) has @identifier = '/uslm/us/usc/t26/s501'

The XML of 26 U.S.C. 501(c) with @identifier = /uslm/us/usc/t26/501/c
Conversely, any text that cites that subsection has @ref = '/uslm/us/usc/t26/501/c'.

Text citing 26 U.S.C. 501(c) with @href = /uslm/us/usc/t26/501/c

These identifiers allow fine-grained resolution for links to the U.S. Code from within the Code or from any outside source. They can also be extended to disambiguate between two provisions with identical section numbers, by concatenating additional metadata, such as the law that originally introduced the section (e.g. 'uslm/us/usc/t5/s3598/[/uslm/us/pl/100/32/s5] '). The U.S. Code also includes outbound references to other documents with identifiers of this type:

The second part of the solution is an online service called a 'resolver', which converts an identifier into a URL. The resolver keeps track of the most current and most official sources for documents and ensures that the embedded identifier can remain constant while the sources for these documents--various websites and APIs--may be in flux. Recently, for example, the best source for Congressional bills shifted from thomas.gov to congress.gov, the Library of Congress's new site. As long as the references are based on a standardized identifier, and not the current URL for the source, the resolver can redirect the link in the future to the appropriate future source.  This takes care of the permanence of electronic citations, allowing government websites to update and improve, without affecting the embedded electronic references in legal documents.

This approach has been informed by, and is intended to be compatible with the proposals being developed in the OASIS legal citations working group.

While the core of this solution is now built into the U.S. Code, more needs to be done in order to make it workable more generally. First, identifiers should start to be adopted in other document sources. As more government documents are published in machine-readable form, prompted by good governance or by legislation such as the DATA Act, these documents should have built-in identifiers at the paragraph level, or other appropriate subunits.

Second, a government-wide resolver (which could be hosted on a domain like 'citations.gov') could be used to translate identifiers into current URL's, and add new document sources (e.g. agency reports) as these become part of the electronic government ecosystem. That way, projects such as CFPB could simply add the resolver domain to the correct identifier (e.g. https://citations.gov/uslm/us/usc/501/c) convert a permanent identifier into a link to the source text. An extension of the resolver can take a textual citation as input, parse the text and return the correct identifier (e.g. 'section 501(c) of the United States Code' -> '/uslm/us/usc/t26/501/c').

With the great energy and infusion of talent into the US Digital Service, combined with the great work that is being done at the Library of Congress, GPO and as part of the House Modernization Project, we can now start to build electronic citations into federal documents that will stand the test of time.

Monday, March 16, 2015

Almost Digital Congress: ToDo List

When fitness hero Jack Lalanne, then in his nineties, was asked about his sex life, he said that he and his wife had sex almost every night. "Almost Monday, almost Tuesday, almost Wednesday...". Similarly, if you ask me about progress in modernizing the legislative process, I would say that Congress is almost digital. Not bad, considering the age of the institution, but still leaves something to be desired.

The confluence of last week's panel at SXSW, organized by the Congressional Data Coalition, and this week's #Hack4Congress show we've come a long way in the last couple of years [1]. In this, and the next few posts, I'll discuss what needs to be done to eliminate the "almost" and reap the full benefits of a digital Congress.

My focus is on the key documents that Congress produces: bills, amendments and ultimately, laws. Whether your guide is the Library of Congress or Schoolhouse Rock, the process should be familiar. An introduced bill goes through a number of changes before it becomes law. This is a change-tracking problem and at each stage we need to know (1) the original document, (2) the proposed change and (3) how that proposed change will affect the original.

I was recently asked, for example, if it is possible to show an automated comparison of the impacts on the law of two patent bills now before Congress: Coons's STRONG Patent Act (S. 632) and Goodlatte's Innovation Act, H.R. 9 (114th). The short answer is no. As a pre-requisite, each bill needs to be available in machine-readable form. H.R. 9 is available in XML, but Coons's Act is currently available only in pdf, not text or XML.

As a second pre-requisite, we would need a tool to apply the language of the bill directly to the law which it amends, in this case Title 35 of the U.S. Code. Take the example of section 4 of Goodlatte's bill. It currently reads:

(a) Amendments.—Section 290 of title 35, United States Code, is amended—
(1) in the heading, by striking “suits” and inserting “suits; disclosure of interests”;
(2) by striking “The clerks” and inserting “(a) Notice Of Patent Suits.—The clerks”...

These changes would be displayed in the context of the target law (section 290 of Title 35) in redlining, e.g.:
§ 290 Notice of patent suits suits; disclosure of interests
The clerks (a) Notice Of Patent Suits.—The clerks of the courts of the United States, within one month after the filing of an action under this title shall give notice thereof...
I've mocked up these changes by hand above, but if this process were automated, it would be possible to show the impacts of this patent bill side-by-side with other proposed patent bills. The 1929 "Ramseyer Rule" requires this kind of redlining when a bill is reported out of committee, but that is often late in the process. Ideally, the textual impact of a bill could be seen as soon as it is introduced.

In California, my company, Xcential, has worked with the legislature to automate this kind of tracking in its "As Amends the Law" feature. For any bill before the Assembly, the public can see the additions and deletions that the bill would effect in the context of current law. You can see how this works with the bill AB 26 Medical cannabis. You can see the bill's impacts on the current law, as well as the changes between different versions of the bill as it is amended.

The other major change-tracking challenge is the application of amendments to bills. This is the aim of the Amendment Impact Program, which Xcential is working on as part of the U.S. House's Modernization Project. Progress on this project was reported by the Committee on House Administration and in House Legislative Counsel Sandra Strokoff's presentation at the 2014 Legislative Data Transparency Conference. [While I work on these projects for Xcential, the information in this blog all comes from public sources and my opinions here are my own.]

In total, to follow a bill from introduction to law, Congress needs to:
  1. Ensure bill transparency: 
  2.  Bill text in consistent machine-readable formats. Many bills, particularly in the Senate, are still only available in pdf until later in the process.
  3. Automate bill-to-law change tracking:
  4. Apply bills directly to the U.S. Code and to non-positive law statutes.
  5. Pass codification bills:
  6. Make all titles of the U.S. Code into "positive law". This is primarily a political challenge, which I have discussed previously.
  7. Ensure amendment transparency:
  8. Amendment text in consistent machine-readable formats.
  9. Automate amendment-to-bill change tracking:
  10. Now being tackled in the Amendment Impact Program, discussed above.
In future posts, I will discuss these goals in more detail: the benefits that each of them brings, the progress that has been made toward achieving them, and some of the challenges that remain.

[1] Almost two years ago, I declared that legislative data has a posse and since then, that posse has been getting results. Congress now publishes machine-readable forms of most bills and amendments. The U.S. Code is also digitized, and updates are made available on the Law Revision Counsel's website shortly after new bills are passed. These and other advances were highlighted in testimony submitted (pdf) by the Congressional Data Coalition to House appropriators.

The House Majority is making digital transparency as a priority, adding this section to the House Rules for the 114th Congress:
The Committee on House Administration, the Clerk, and other officers and officials of the House shall continue efforts to broaden the availability of legislative documents in machine readable formats in the One Hundred Fourteenth Congress in furtherance of the institutional priority of providing public availability and use of legislative information produced by the House and its committees.
Fortunately, more and more citizens are interested and educated about what it will take to fulfill this commitment.