Monday, August 3, 2015

Elasticsearch and MS SQL Server: Would you like search with that?

In our data-rich world, everything goes better with search. We've gotten used to having search as a baked-in feature of applications and expect the search to be full-featured and flexible. So when building a data-rich enterprise web application, the question is often not whether to include search, but how.

If the data is stored on MS SQL Server, which is usually the case with the government customers we work with at Xcential, one option is the full-text search for MS SQL Server. However, full-text search is a relatively recent addition to SQL Server. Its feature set is limited and in my opinion its query language, an extension of SQL, gets over-complicated quite quickly.

There are also a number of open source search engines (Elasticsearch  and Solr, based on Apache Lucene, Sphinx) which are fast, robust and used in many of the search functions we are used to on websites such as Twitter, Facebook and Craigslist. The widespread use of these search engines also means that people are familiar with their features and they meet users' general expectations of how search should work. As with any open source project primarily developed in Linux environments, it is important to ask how well do these projects play with Windows?

In particular, how can the data from SQL Server be indexed and queried? My recent experience with Elasticsearch shows that, with some caveats, this can now be done rather painlessly.

Import SQL Server Data to Elasticsearch

If you search Google for 'SQL Server and Elasticsearch', there are many outdated references for how to create an index. Elasticsearch has deprecated the previous tools for importing SQL data, from what they used to call 'rivers'. But do not despair. Jörg Prante has built an excellent open-source JDBC importer to support a variety of SQL databases, including SQL server. The instructions for using the tool are clear and straightforward, but as Prante writes, Windows user have had lots of problems using it. There are some tricks.

The general instructions for using the jdbc importer are here. But it doesn't quite work. Steps 1 and 2 are to download and install elasticsearch. That actually goes quite smoothly. You will need Java (>1.8 update 20 or >1.7 update 55) installed on your system, but if I recall, the elasticsearch.bat script will walk you through installing that, if you don't have it already. Once elasticsearch is installed, I would recommend also installing the Marvel plugin (free developer trial) and starting the Sense interface. This will make it much easier to monitor the indexes you create and to test queries in elasticsearch.

Step 3 is to install the jdbc importer, using the command line wget application. While you could install wget by itself, you'll need the bash terminal from Cygwin for a later step, so might as well install Cygwin now. From the new Cygwin bash terminal, you can use wget to install the jdbc importer.

wget http://xbib.org/repository/org/xbib/elasticsearch/importer/elasticsearch-jdbc/<version>/elasticsearch-jdbc-<version>-dist.zip

Also in cygwin, move the zip file into your \Program Files (x86) directory and unzip using the unzip utility from Cygwin. (This may require administrative privileges.)

In steps 4 and 5, you download the jdbc driver from Microsoft and put it into the /lib folder of the JDBC driver package.  

At step 6 (set up database, make sure to allow TCP/IP connections), there are a number of nuances:
1. Enable TCP/IP in the SQL SERVER CONFIGURATION MANAGER ( (Start Menu > Microsoft SQL Server > Configuration Tools > SQL Server Configuration Manager > SQL Server Network Configuration)
2. Start the SQL SERVER BROWSER Service. To do this, open the CONFIGURATION MANAGER as an administrator (right click to open). Once in the Manager, right click the BROWSER SERVICE, click Properties and change the Start Mode to Enabled. (http://stackoverflow.com/a/21378235)
3. Explicitly set the TCP to 1433, as explained in these [two](http://stackoverflow.com/a/18850073) [answers](http://stackoverflow.com/a/24299346).
Also, change the database server authentication to 'Mixed Mode'. Give permissions on the database to a user and set the password for that user. 

Step 7 is to start the elasticsearch engine. Easy enough.

Step 8 has a sample script to import data into elasticsearch.

For windows, the variables in path names need to be in quotes and do not have braces. So I changed these lines in the script:
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
to
-cp "$lib/*" \
-Dlog4j.configurationFile="$bin"/log4j2.xml \ 
In order to run the script, I found it most convenient to copy and paste the text directly into the Cygwin terminal. Note, that to paste text into a Cygwin terminal, you can right click and chose 'paste', or type 'shift-Insert'.

I copied and pasted (shift-Insert) the script into the bash terminal.

I have updated the instructions for using the jdbc importer with MS SQL Server in this gist. Next, I'll discuss how to build a general purpose user interface for the Elasticsearch engine.