Wednesday, September 16, 2015

More Elasticsearch: Flexibility without duplicates

People want everything. When they're searching, they want flexibility and they want precision, too. Legal researchers, especially, show this cognitive dissonance: in their personal lives they are used to Google's flexibility ("show me that hairy dog that looks like a mop"), and at work they use 'Advanced' search interfaces that can find the right legal document, if only they write a search query that is sufficiently complex ("show me the rule between September 1981-1983 that has the words 'excessive' and 'sanctions' within 4 words of each other, and does not have the word 'contraband'").

To search through legal documents, precision is important: 42 U.S.C 2000e-5 (a section of the United States Code) is not the same as 42 U.S.C. 2000e. At the same time, a text search for 'discriminate', should probably also return results that have the word 'discrimination'. To handle this in Elasticsearch (ES) seemed at first simple: create two indexes, or two 'types' within a single index. In essence, we'd index the documents once with a permissive analyzer that doesn't discriminate between 'discriminate' and 'discrimination' (an English-language analyzer) and once with a strict analyzer, that breaks words on whitespace and will only match exact terms (read more on ES analyzers here). Search the first index when you want a flexible match and the second one when you want an exact match. So far so good.

None or too many


But what about combining a flexible match with a strict one ("section 2000e-5" AND discriminate)? You either get no results or duplicates. No results are returned if you're looking for the overlap of the two terms: by design, the two indexes were created separately.  OTOH, if you're looking for matches of either term, you get duplicates, one from each index. Back to the drawing board.

To remove duplicates, the internet suggests field collapsing: index each document, using the same ID value in both indexes, group by ID and set 'top_hits' to 1, to get just one of the two duplicates. Unfortunately, grouping also breaks the nice results pagination that comes with ES. So you can de-duplicate results, but can't easily paginate them. This is a problem for searches that return hundreds, or thousands of results. For a nice afternoon detour, you can read why pagination and aggregation don't play well together.

Two fields in one index

O.K., then, how about indexing each field twice within the same document in the index. The two copies should have different names and should be analyzed differently. For example, one could be called 'flex_docText' and the other 'exact_docText'. Combined flexible and exact searches will point to the same document. And while each field is indexed and searched differently, the original text that ES stores will be the same, so we only need to return one of these fields (it doesn't matter which) to the user.

How-to

The first step is to create the new index with a 'mapping' for the two fields that defines the different analyzers to use for each: 
POST myindex
{"mappings":{
    "mytype" : {
          "properties" : {
                "flex_docText" : { "type": "string",
          "analyzer" : "english" },
                "exact_docText" : { "type": "string",
          "analyzer" : "whitespace" }
          }
    }
  }
};
https://gist.github.com/aih/79155bd4835d3781b380

Next, index the documents, making sure to index the 'docText' field twice, once under each name. This can be as easy as including the content twice when creating the document:

PUT /myindex/mytype
{
  "flex_docText": "This is the text to be indexed.",
  "exact_docText":  "This is the text to be indexed."
}

Indexing from SQL

An additional complication arises when importing data from a SQL database. As described in my earlier post, a nice open source JDBC import tool was built for this purpose. So nice, in fact, that it directly takes the output of a SQL query and sends it to Elasticsearch to be indexed. The downside is that the data is indexed with just the name it has in the SQL query.  So, if your database column is named 'docText', in a table named 'myTable', you might use this query:

SELECT docText FROM myTable

The JDBC import tool would then index one field, called docText. If you want to create two parallel fields in the index, it is necessary to rename the database column, and extract it twice from the database, using the following syntax:

SELECT docText as flex_docText, docText as exact_docText FROM myTable

In fact, you can extract the same data as many times as you want, under different names, and apply different analysis to the data in the index mapping.  Does that really work? Yes, that really works.  Now if you want to highlight search results and avoid duplicates, that's a story for another day.