Wednesday, May 18, 2011

How to convert Text to HTML: Using txt2html Perl Module

To convert CA statutes from text to html, I used the txt2html module, written in Perl by Kathryn Anderson. This was step 1 in the 5-step process I used to reintroduce metadata to the statutes.
While there are a number of free online text to html converters, and some commercial packages you can download, txt2html is a fast, flexible and accurate open source option. It also works from the command-line, so I could combine it with the next steps I needed, to identify section headings, etc.
After installing the module on my Mac OS (described below), I used the following command, all in one line, from the Terminal:
txt2html --explicitheadings --indentparbreak --maketables --make_anchors --xhtml --outfile /path/to/file.html /path/to/file
What does this do? The "/path/to/file" needs to be substituted with the actual path to the text file you want to convert and "/path/to/file.html" is the name and path of the new document you want to create.
One of the advantages of this module for converting to html, is that it has a good assortment of options. I found that the ones above were needed in order to keep the structure of the text file in place. For example, without these options, some of the subdivisions of a statute got collapsed with earlier subdivisions into the same paragraph.
Here is a before and after of a section of the text:

Before:
15210. Notwithstanding any other provision of this code, as used in
this chapter, the following terms have the following meanings:
(a) "Commercial driver's license" means a driver's license issued
by a state or other jurisdiction, in accordance with the standards
contained in Part 383 of Title 49 of the Code of Federal Regulations,
which authorizes the licenseholder to operate a class or type of
commercial motor vehicle.
(b) (1) "Commercial motor vehicle" means any vehicle or
After:
<p>15210. Notwithstanding any other provision of this code, as used in this chapter, the following terms have the following meanings:
<br/>&nbsp;&nbsp;&nbsp;(a) "Commercial driver's license" means a driver's license issued
by a state or other jurisdiction, in accordance with the standards
contained in Part 383 of Title 49 of the Code of Federal Regulations, which authorizes the licenseholder to operate a class or type of
commercial motor vehicle.
<br/>&nbsp;&nbsp;&nbsp;(b) (1) "Commercial motor vehicle" means any vehicle or
This did everything I wanted, and nothing I didn't. Installation instructions after the fold.

Installing txt2html can be quite a challenge: it requires about a zillion other packages and each of them also have many prerequisites. (Google "txt2html install", and you'll see that I'm not the first to face this challenge.) The instructions here tell you where to get the source code, and then say "Look out for the dependencies!"
To deal with all of this, install Perl's package manager, cpanm. On MacOS X, cpan was already installed, so to install the more up to date version, I followed instructions here and ran the following, (with root privileges):
sudo cpan App::cpanminus
This results in about a dozen prompts whether you want to install various packages. So stay with it and just hit return for each (or follow the instructions here to silence these prompts).
Next, run:
sudo cpanm HTML::TextToHTML
Now you should be able to convert any txt file to html using the command above, or minimally:
$ txt2html filename.txt > filename.html