2nd February 2005, 1 min read

Gutenberg books as marked up XML

Project Gutenberg is a fantastic project where a large collection of books has been scanned and made available for free. The problem has been that they are available as text which makes automated processing sometimes a problem. Extracting the title of a book can be a problem (though an easy one). However, the nice people at the HTML Writer Guild have maked up a large collection of Gutenberg book using a XML with a publicly available DTD.

Possible application: have a given book be automatically integrated in a content management system (learning management system).

You might also want to consider GutenMark as a tool to process Gutenberg books (output to LaTeX and HTML).