Category Archives: Crowdsourcing Project

Twitter chat on #AskLetters1916

About every month the Letters of 1916 project organises a Twitter chat. Different topics related to the letters project have been discussed in the past – Women in the Great War, Crowdsourcing, etc. Tonight the discussion was about Text Analysis and Topic Modelling of the 1916 letters.

Here is a link to the Twitter page: Link

XML Encoding of Letters

The Letters of 1916 project is a crowdsourcing Digital Scholarly Editing project, and the transcribers are encouraged to mark up the letter transcriptions with TEI/XML tags. The TEI markup should eventually be used to identify features such as line breaks, titles and headers, but also names of people, organisations and places. Because it is assumed that most of the transcribers do not have previous experience with TEI or XML an editor with a toolbar is part of the transcription interface to guide the transcriber to use the correct TEI elements.
One of my first tasks was to have a look at the crowdsourced transcriptions and find out to what extend they were transcribed. It was interesting to find that there was a lot of markup in place. My replacement function counted 7395 start-tags, end-tags, and XML comments. If this is however related to the 166149 word tokens of the letters, the amount of encoded text does not seem so much anymore. The numbers can not be directly related, but if we assume that at least every 10 words there could be a line-break we get a quote of about 45% markup. Again this is highly speculative, because close investigation of individual letters shows that some are very detailed encoded (using the TEI del-element, add-element, line-break element, hi element and others), while other contain no tags at all.
The next step was to test if the transcriptions were well-formed XML and could be parsed with one of pythons libraries. I used the lxml library for this task, and found that over 40% of the letters would through a XMLSyntaxError. In most cases this was due to the use of ‘&’ instead of the entity ‘&’. After I had dealt with this problem by replacing all ‘&’ before trying to parse the transcription strings to xml, I still counted about 100 XMLSyntaxError out of 850 letters. In most of the cases this was due to not well-formed XML, opening-tags without closing-tags or (less common) overlapping elements.

XML Processing with Python

As the Letters of 1916 is a crowdsourced project the transcriptions of the letters contain irregular xml markup and in some cases not well-formed xml. At first I thought that I might be able to use the xml markup for my analysis, but the inconsistent quality of encoding makes this a useless attempt (see also my previous post).

Python has a number of libraries for XML-processing (a list including also libraries that are not par of the standard library is available here). The most popular ones are xml.etree, xml.dom, and xml.sax, which are all part of the Python standard library. I decided to use the lxml library, which has a similar API as xml.etree and was therefore easy enough to use. The library is pretty quick thanks to the underlying C libraries libxml2 and libxslt and it has limited support for XPath 1.0 and XSLT 1.0.

Because of the XPath support to get all the text out of a XML encoded document is as easy as:

for letter in letters:
    root = etree.fromstring(letter) 
    text_lst = root.xpath(".//text()") 
# The result is a list of text nodes that can be combined with " ".join() 

The problem however was that a good deal of what was supposed to be xml or plain text was in reality not well-formed xml (I discussed this in another post). To find out how many of the letters would not parse, I made the following changes:

for letter in letters:
    syntaxErr = 0
        root = etree.fromstring(letter)
        text_lst = root.xpath(".//text()")    
    except etree.XMLSyntaxError:
        syntaxErr += 1

To remove xml markup from the letter transcriptions was because of the numerous syntax errors in the transcriptions not possible. I found eventually regular expressions the best solution for this task.