Tuesday, May 2, 2017

XML Editing is Hard

I was reminded this week that our best customers at Xcential are those who try to build a legislative data system themselves first. And today's XCKD perfectly sums up why:


Thursday, October 22, 2015

Git for Law Revisited

Laws change. Each time a new U.S. law is enacted, it enters a backdrop of approximately 22 million words of existing law. The new law may strike some text, add some text, and make other adjustments that trickle through the legal corpus. Seeing these changes in context would help lawmakers and the public better understand their impact.

To software engineers, this problem sounds like a perfect application for automated change management. Input an amendment, output tracked changes (see sample below). In the ideal system such changes could be seen as soon as the law is enacted -- or even while a bill is being debated. We are now much closer to this ideal.

Changes to 16 U.S.C. 3835 by law 113-79

On Quora, on this blog, and elsewhere, I've discussed some of the challenges to using git, an automated change management system, to track laws. The biggest technical challenge has been that most laws, and most amendments to those laws, have not been structured in a computer friendly way. But that is changing.

The Law Revision Counsel (LRC) compiles the U.S. Code, through careful analysis of new laws, identifying the parts of existing law that will be changed (in a process called Classification), and making those changes by hand. The drafting and revision process takes great skill and legal expertise.

So, for example, the LRC makes changes to the current U.S. Code, following the language of a law such as this one:
Sample provision, 113-79 section 2006(a)
LRC attorneys identify the affected provisions of the U.S. Code and then carry out each of these instructions (strike "The Secretary", insert "During fiscal year"..."). Since 2011, the LRC is using and publishing the final result of this analysis in XML format. One of the consequences of this format change is that it becomes feasible to automatically match the "before" to the "after" text, and produce a redlined version as seen above, showing the changes in context.

To produce this redlined version, I ran xml_diff, an open-source program written by Joshua Tauberer of govtrack.us, who also works with my company, Xcential, on modernization projects for the U.S. House. The results can be remarkably accurate. As a pre-requisite, it is necessary to have a "before" and "after" version in XML format and a small enough stretch of text to make the comparison manageable.

Automating this analysis is in its infancy, and won't (yet) work for every law. However, the progress that has been made points the way toward a future when such redlining can be shown in real-time for laws around the world.

Wednesday, September 16, 2015

More Elasticsearch: Flexibility without duplicates

People want everything. When they're searching, they want flexibility and they want precision, too. Legal researchers, especially, show this cognitive dissonance: in their personal lives they are used to Google's flexibility ("show me that hairy dog that looks like a mop"), and at work they use 'Advanced' search interfaces that can find the right legal document, if only they write a search query that is sufficiently complex ("show me the rule between September 1981-1983 that has the words 'excessive' and 'sanctions' within 4 words of each other, and does not have the word 'contraband'").

To search through legal documents, precision is important: 42 U.S.C 2000e-5 (a section of the United States Code) is not the same as 42 U.S.C. 2000e. At the same time, a text search for 'discriminate', should probably also return results that have the word 'discrimination'. To handle this in Elasticsearch (ES) seemed at first simple: create two indexes, or two 'types' within a single index. In essence, we'd index the documents once with a permissive analyzer that doesn't discriminate between 'discriminate' and 'discrimination' (an English-language analyzer) and once with a strict analyzer, that breaks words on whitespace and will only match exact terms (read more on ES analyzers here). Search the first index when you want a flexible match and the second one when you want an exact match. So far so good.

None or too many

But what about combining a flexible match with a strict one ("section 2000e-5" AND discriminate)? You either get no results or duplicates. No results are returned if you're looking for the overlap of the two terms: by design, the two indexes were created separately.  OTOH, if you're looking for matches of either term, you get duplicates, one from each index. Back to the drawing board.

To remove duplicates, the internet suggests field collapsing: index each document, using the same ID value in both indexes, group by ID and set 'top_hits' to 1, to get just one of the two duplicates. Unfortunately, grouping also breaks the nice results pagination that comes with ES. So you can de-duplicate results, but can't easily paginate them. This is a problem for searches that return hundreds, or thousands of results. For a nice afternoon detour, you can read why pagination and aggregation don't play well together.

Two fields in one index

O.K., then, how about indexing each field twice within the same document in the index. The two copies should have different names and should be analyzed differently. For example, one could be called 'flex_docText' and the other 'exact_docText'. Combined flexible and exact searches will point to the same document. And while each field is indexed and searched differently, the original text that ES stores will be the same, so we only need to return one of these fields (it doesn't matter which) to the user.


The first step is to create the new index with a 'mapping' for the two fields that defines the different analyzers to use for each: 
POST myindex
    "mytype" : {
          "properties" : {
                "flex_docText" : { "type": "string",
          "analyzer" : "english" },
                "exact_docText" : { "type": "string",
          "analyzer" : "whitespace" }

Next, index the documents, making sure to index the 'docText' field twice, once under each name. This can be as easy as including the content twice when creating the document:

PUT /myindex/mytype
  "flex_docText": "This is the text to be indexed.",
  "exact_docText":  "This is the text to be indexed."

Indexing from SQL

An additional complication arises when importing data from a SQL database. As described in my earlier post, a nice open source JDBC import tool was built for this purpose. So nice, in fact, that it directly takes the output of a SQL query and sends it to Elasticsearch to be indexed. The downside is that the data is indexed with just the name it has in the SQL query.  So, if your database column is named 'docText', in a table named 'myTable', you might use this query:

SELECT docText FROM myTable

The JDBC import tool would then index one field, called docText. If you want to create two parallel fields in the index, it is necessary to rename the database column, and extract it twice from the database, using the following syntax:

SELECT docText as flex_docText, docText as exact_docText FROM myTable

In fact, you can extract the same data as many times as you want, under different names, and apply different analysis to the data in the index mapping.  Does that really work? Yes, that really works.  Now if you want to highlight search results and avoid duplicates, that's a story for another day.

Tuesday, September 15, 2015

Elasticsearch on Windows: Security

A government client wants to be able to search through their data. In a typical case, an office or agency has a long-standing rulebook that employees still carry around in a paper binder, because they lack a browsable, searchable electronic version. For example, the GSA's consulting group, 18F recently described how they helped the Department of Labor envision and design an electronic version of the Field Operations Handbook. Their prototype included a search layer built on Elasticsearch and served withApache (presumably on a Linux server).

In almost any nook and cranny of government, you will find similar handbooks, guidebooks, rulebooks and other books that are crying out for user-friendly search layer. Elasticsearch is an excellent choice for such a project, but it often needs to be integrated with the government's (usually Windows-based) servers. Kudos to 18F for introducing Linux in many of their projects, but for now, those projects are the exception. For the rest of us, using Elasticsearch on Windows presents a number of challenges. All of these can be overcome, though documentation for these solutions is scarce.

In my last post, I described how data from a MS SQL Server can be imported to and indexed in Elasticsearch. In future posts, I will discuss some indexing and analysis tricks that can make the search experience smoother (like how to avoid retrieving duplicate results). This post is dedicated to security of the Elasticsearch cluster on Windows.

Elasticsearch lacks built-in security

Out of the box, Elasticsearch:
  1.  Uses a separate port (default = :9200) for requests and responses. Searching directly from the browser would require opening the port to web traffic on the server.
  2. Allows data to be changed and deleted, in addition to being retrieved. This is great for a developer who can quickly create, delete and update a search index. You don't usually want your users to have these superpowers, though.
  3. Provides data about the entire search cluster and all indexes on the cluster. That's like having a SQL server open to the world. While this may please Bobby Tables, it makes the rest of us uncomfortable.
Elasticsearch is not built with integrated security features. You can use a separate plugin, called 'Shield', for security, though that introduces another dependency and may be overkill for many use cases. It is also possible to route all traffic to Elasticsearch through your server-side web application, but this requires a lot of duplicate logic to accept and return requests.

Another option is to use features of the IIS webserver itself. For this, you will have to build a search application interface, but for any practical search project you will want to do that anyway. I have stripped down one such GUI to focus on text-base document searches. This starter application (on Github) takes user input, formulates searches and returns a paginated list of results. Do let me know if you use it: I have discovered many useful improvements to satisfy specific client needs.

Using a Reverse Proxy on IIS

With IIS serving your main application, you can create a reverse proxy on IIS for Elasticsearch requests. The reverse proxy will translate and reroute url web requests into internal requests to Elasticsearch. How does this work? Your application requests to http://www.mydomain.com/search are routed by IIS internally to http:localhost:9200/_search. The internal address is not accessible to outside web traffic.

  1. No need to expose another port. All traffic can be routed through a url on your default web port (:80).
  2. Block delete and change requests-- simply don't set up any urls that will route these requests to Elasticsearch.
  3. Use IIS security features as needed for your main application (e.g. Anonymous login or Windows login).
On IIS, setting up a reverse proxy is not as tough as it sounds. It requires two modules that can be installed using the Web Platform Installer: URLRewrite and ARR. Thorough instructions for setting up the reverse proxy can be found here. The essentials (and ES-specific details) follow:

Install Web Platform Installer

Pretty straightforward, from a Microsoft downloads page.

Install ARR and URLRewrite modules

Open the Web Platform Installer interface, search for "ARR" and "URLRewrite" respectively, and follow instructions to install each of them.

Create the URLRewrite rules for your website

  1. Open the IIS Server Manager and navigate to your website in the Connections column on the left side. Double click on the Application Request Routing Cache icon (ARR, under IIS, in the Features View). 
  2. Open Server Proxy Settings on the right side of ARR and make sure the checkbox is selected for Enable Proxy. Close the ARR menu. Double click on the  URLRewrite icon. Click Add Rule -> Blank Rule.
  3. Write the rewrite rules for Elasticsearch. The main rewrite rule will  use Regular Expressions to match the a pattern like: search/_search(.*), and rewrite it as http://localhost:9200/_search{R:1}
  4. You may want to expose other Elasticsearch API's as well, and it is best to create a rewrite URL for each of them. For example, to check the health of the cluster, match search/_cluster/health to http://localhost:9200/_cluster/health. If you are having trouble writing using the IIS Manager UI to enter these rules, consult the  blog I referred to earlier, or directly add the rules to your web.config file.
  5. When you are done entering the rewrite rules, you will have an XML file in your website folder called web.config that will include a <rewrite> section.  It should look something like this file: https://gist.github.com/aih/8f2b8d76b44d8836bd77

Test it out

From your application, you should now be able to submit a query (either as query parameters in the url, or with a json query payload) to http://[www.yoursite.com]/search, and get the response from Elasticsearch. Note that now the path at /search is reserved for Elasticsearch traffic, so your application cannot use that path (e.g. for a webpage).  If this is a problem, you can use any path you prefer for your reverse proxy settings.

Your thoughts?

The Elastic team described setting up a reverse proxy with Nginx for many of the same reasons, and this does seem like a clean way to expose only the search API to external web traffic. Are there other architectures you have used for Elasticsearch on Windows? Do you see security vulnerabilities with the approach I've descriped? I'd like to hear your thoughts in the comments.