Monday, November 28, 2011

How to Convert HTML to Text, With Formatting

My current best answer: this html2text package from Germany.  It can be installed easily on a MacOS with Macports ($ sudo port install html2text), and on other Unix-like systems through their package managers.  It has a number of useful options, and I use it like this:

html2text -nobs -ascii -width 200 -style pretty -o filename.txt - < filename.html

So now that you know my current answer, here's the problem: not all html is created equal.  Legislative data is published in a variety of formats, some uglier than others.  Extracting information requires cleaning these formats up.

When I started to work with California legislation, I had the problem of converting the state's plain text into a simple html for use on web pages.  To do that, I used a Perl text2html module.  While it takes many steps to produce web-friendly html from California's laws, at least the plain text was not cluttered with formatting symbols and tags that could interfere with the core text.

The problems in other states is far worse.  Some versions of Iowa's bills, for example, appear to be published directly from Microsoft Word to the web, which means that they're littered with a maze of formatting information--sometimes positioning each word on the page--that is not related to the text of the bill.  Other states use hundreds of cells of an html table (or multiple tables) to format the bill.  Looking at the file on the state's website, you wouldn't know that the underlying data is so messy.

Simply stripping all of the html tags won't work, because that eliminates all the formatting information, including information that can change the bill's meaning (spaces, paragraphs). That's unfortunate, because there are many html libraries that would make stripping out the tags easy (e.g. Beautiful Soup for python, or similar libraries in other languages).  What I want to do is preserve the formatting, but do it with spaces and paragraphs, not tables or graphically positioning words.

Ironically, the most effective way to clean this messy data is also the easiest: copy and paste the bill displayed on your web browser.  After all, the formatting was made for the browser to interpret, and the copy-paste function (at least on a Mac) is quite faithful to the formatting.  However, automating this copy and paste process is far from simple and, with one exception, I have not seen any programs that make use of this native browser capability to convert files in bulk.  The exception is the use of the Linux web browser, Lynx, which has a function "Lynx -dump".  However, this converter  apparently has a number of faults, including an inability to process tables.  Anyone know how to use Chrome or Firefox to automate conversion of html to text for large numbers of files? This is still the solution I'd prefer.

But barring that, I found a close second, in the form of the html2text program.  Although it's relatively old (2004), it's fast and deals reasonably with tables and other formatting such as underlining and strikeouts.

Edit: Upon the suggestion by Frank Bennett, below, I installed the w3m text browser and used it to produce formatted text from html using the following command-line syntax:
w3m filename.html -dump > file.txt
Like html2text, it is fast and produces clean output, actually somewhat too clean.  The saved file strips some important formatting information, like <u> (underline) tags, so some caution is in order when using this method.