Tag Archives: Letters

XML Processing with Python

As the Letters of 1916 is a crowdsourced project the transcriptions of the letters contain irregular xml markup and in some cases not well-formed xml. At first I thought that I might be able to use the xml markup for my analysis, but the inconsistent quality of encoding makes this a useless attempt (see also my previous post).

Python has a number of libraries for XML-processing (a list including also libraries that are not par of the standard library is available here). The most popular ones are xml.etree, xml.dom, and xml.sax, which are all part of the Python standard library. I decided to use the lxml library, which has a similar API as xml.etree and was therefore easy enough to use. The library is pretty quick thanks to the underlying C libraries libxml2 and libxslt and it has limited support for XPath 1.0 and XSLT 1.0.

Because of the XPath support to get all the text out of a XML encoded document is as easy as:

for letter in letters:
    root = etree.fromstring(letter) 
    text_lst = root.xpath(".//text()") 
# The result is a list of text nodes that can be combined with " ".join() 

The problem however was that a good deal of what was supposed to be xml or plain text was in reality not well-formed xml (I discussed this in another post). To find out how many of the letters would not parse, I made the following changes:

for letter in letters:
    syntaxErr = 0
    try:
        root = etree.fromstring(letter)
        text_lst = root.xpath(".//text()")    
    except etree.XMLSyntaxError:
        syntaxErr += 1

To remove xml markup from the letter transcriptions was because of the numerous syntax errors in the transcriptions not possible. I found eventually regular expressions the best solution for this task.

Advertisements

Kildare Launch

The Letters of 1916 project was officially launched on Research Night 27th September, 2013. Recently the project moved from Trinity College Dublin to its next phase at An Foras Feasa, the Digital Humanities center at the National University of Ireland Maynooth (NUIM). Following this move, the ‘Kildare Launch’ of the project took place on 8 May 2014 at NUIM. The evening started with an encoding and digital imaging lab. This was a great chance for the audience to get an introduction on how everyone can contribute to the Letters project by transcribing or uploading letters. The Labs were followed by talks by Professor Susan Schreibman, Robert Doyle, Dr Brian Hughes, and Lar Joye. Videos of the presentations should be available soon on the Letters 1916 homepage.
1916 Letter project

Starting The Project

During the internship at the Center of High Performance Computing and the Letters of 1916 project I will build a text analysis tool for the online letter collection. The structure of this analysis tool can be roughly divided into three phases/steps: import of data – text analysis – visual output.

Each of these steps is a challenging task and already from the beginning a number of issues are apparent:

Data Import: The letters are all encoded in some form of TEI/XML. But because this is a crowd-sourcing project the data is certainly messy and it is not clear what is encoded and how consistent. The same for metadata. It will therefore be interesting to see how helpful the TEi markup will be in the final text analysis.
Text Analysis and Visual Output: As first step the text analysis tool will just produce a histogram-like wordcount and frequency distribution. For the text processing part it will important be cleaned the text of punctuation and markup to allow proper tokenization into words.
…and there will be more challenges ahead as the internship progresses.