Wednesday, February 15, 2012

Convert PDF to Text, HTML, Word...

I've put together a small demonstration site to convert pdfs to clean html: you can try it out here.  There are many caveats that go along with this (e.g. the current server is not very stable, it only works with javascript enabled browsers, only 5 documents at a time, limited size on each document, no OCR, etc.).  But I thought I'd get it out there for all the legal data fans to try it out and get a conversation started about data encoding. Do you have a favorite way of getting text out of pdfs?

PDF documents are the only available starting point for a lot of government legal information.  I've discussed some of the problems with this before, and suffice it to say that this is a recurring problem in legal informatics.  To extract useful metadata, and to make the documents web-accessible, it is usually necessary to convert the PDF to a more portable format. The devil is in the details.

While there are many programs available that make the conversion from pdf to text, html or MS Word, there are many trade-offs, the biggest of which is to preserve layout or to make it easier to extract metadata.  Most of the converters to html that I have found, for example, include a huge number of extra tags that clutter up the text, break up sentences and paragraphs and generally make it very hard to extract meaningful metadata from the document.

I've combined a couple of open source programs (pdf2text -> txt2html) and an open source tool to upload documents, to make this small site.  If you find it useful, or need to convert large volumes of pdf documents to clean html, get in touch.