Tuesday, June 11, 2013

UUID for Legal Text

There has been a lot of interest and I have gotten great feedback on the post about the book I'm writing with Grant about legislative data.

Data standards are always a hot topic (relatively hot-- we normalize against interest in this field in general, not against interest in the Kardashians:
).

Among the questions on data standards that have sparked interest is the question of how to assign unique identifiers to legal text. These are needed for many reasons, in a variety of contexts. The most straightforward is to be able to hyperlink to a specific subsection of a bill or law.

Some options for creating the unique identifier include:

  • A unique randomish code (e.g. based on the current  datetime)
  • A hash of the text of the section
  • A URN or URL identifier based on a standard, human-readable path to the section (e.g. us/uscode/title26/section100)
  • Some combination of the above
Version control is a very important consideration: Section 100 of title 26 may be amended and the identifier should tell us which version we're citing.  Some very technically savvy minds at the Law Revision Counsel of the U.S. House of Representatives, have suggested a combined approach with one identifier for the Code section, and one that specifies the version (e.g. the version as amended by P.L. 114-XYZ).

Another question is whether the id should itself carry information about the text. In the case of a hash, we could use a similarity-preserving hash, e.g. simhash, so that texts that are related would result in hashes that are close to each other. This might have advantages, for example, in citing to court documents. Text in one court opinion that is similar to text in another may provide useful precedent; a search algorithm could collect similar text sections based on these Simhashes.

Rather than get ahead of myself and draft out the entire chapter on unique identifiers, I'll stop here and invite your comments.
  • What is important to preserve in a unique identifier for legal texts?
  • What id schemes have proven successful in other document-based structures?
  • What would Google (or Linus Torvalds) do?

If you have Insights or connections to People With Insights-- please comment here or let me or Grant know.