Category Archives: TEI

Topic Modelling with Python: Gensim

One investigation of my internship is into topic modelling of the 1916 letters. I decided to use Python, because I was already familiar with the language before I started the internship and Python has good libraries for natural language processing and topic modelling. I tested the nltk and the gensim toolkit. The nltk is a well known toolkit and I use parts of it occasionally. For an introduction I recommend the documentation and the O’Reilly book available via the NLTK website.

The gensim library is a library for ‘topic modelling for humans’, so I hope it is as easy to use and intuitive as it claims to be. It is quickly installed via easy_install or pip and it is build on NumPy and SciPy, which have to be installed in order to use it.

Another Performance Test

Out if personal interest I made another performance test. This time I targeted a more complex function that was written to import a huge table of transcriptions and metadata about all the letters from an EXCEL file. The importing from EXCEL is very easy thanks to the Python’s xlrd module. Each row in the EXCEL file contains the transcription of a letter page and metadata of the letter, but this means that in case if there are more than one page of a letter the metadata is duplicated. My function loops over the rows in the EXCEL table and creates Letter objects (my custom class to represent letters). The function merges multiple pages of the same letter and adds them to the same Letter object, tokenizes the transcriptions and cleans the text from XML-like markup and punctuation. The final step is to store all the Letter objects to a file using Pythons shelve module. In that way all these objects do not have to stay in memory.

My approach approach for the storage is to save all the Letter objects in a dictionary, with the letter id’s as keys and than push the whole dictionary to the shelve file. Worked quite well and was not too slow (considering the amount of letters and what the function has to do).

performance_3

The function took nearly 19 – 21 seconds (based on three tests) to run over about 2200 entries in the EXCEL file, create 850 letter objects, add the letters to a dictionary, clean the transcriptions and merge them if they are from the same letter, and finally store everything in a shelve file. This is not too bad considering that most of the time (13.3) seconds was used by the xlrd.open_workbook(filename=file_path, encoding_override=”utf-8″) function, that retrieves the data from the EXCEL file, and creates a handy to use workbook object. And Python cp1252.py decode took also quite long. Furthermore, a lot of time was spend creating the letter objects (1.6) and cleaning and adding the transcriptions (2.6). There might be some opportunity for optimization – I might come back to revising this function later in the project.

The stored shelve dictionary contains now a dictionary with all the letter objects. Out of curiosity I wrote a second function that does all the steps that the first one does, but instead of creating a dictionary and than storing it, each letter object will be stored in the shelve file when it is created. The shelve file is after all dictionary-like. I assumed that it might take a bit longer, and was surprised how long it actually took.

performance_4

Nearly 60 seconds!!! That was three times the time of the other function. After I had a look at the statistics returned by cProfile I found that most time was spend calling the keys() method on the shelve file in order to retrieve the keys (= letter ids) that were already stored. Not sure why this happened. Is it because keys() returns a list, instead of a faster data structure, like a set? Still 40 seconds more is very significant.  Could it have something to do with DeadlockWrap?


d = shelve.open("letters.shelve")

#Than follows the for loop that gets the data from the EXCEL file and prepares it for storage

if letter_id not in d.keys():       #This is the interesting part: if the letter id is not already used
d[letter_id] = l                     # as key in the shelve file add a new letter object

d.sync()    # the shelve file has to be synced in order to ensure object is written to the file

d.close()   # and closed after the for-loop

Interesting was that by replacing the d.keys() with a set ‘s’ that stores all the keys of letter objects – s.update(letter_id), it took the function less than half of the time, even if every object is written to the shelve file directly:

performance_5

It is still not as fast as my first solution, and will probably never be, because of the numerous calls to the shelve file and the constant sync. It was however interesting to see how a small change like replacing d.keys() with the set ‘s’ can have so tremendous effects.

Getting Rid of XML markup with Regular Expressions

After I had found that the croudsourced xml markup was not very helpful and even made over 10% of my 850 letters not parse with the lxml library, I experimented with regular expressions. I find the Python re library fairly easy to use and very hand. When it comes to regular expression and Python there are many, many tutorials online. The documentation on the Python website is (like always) a good starting point, because it gives an overview over modules functions and how to use them. I found also the Google Developers Tutorial a good read. For a longer introduction with case studies and also critical remarks on when not to use regular expressions, and performance issues see Dive into Python.

For my purposes the following code worked quite well:

pat = "<[/\w\d\s\"\'=]+>|<!--[/\w\d\s\"\'=.,-]+-->"
expr = re.compile(pat)
for letter in letters:
    ''.join(expr.split(letter))

XML Encoding of Letters

The Letters of 1916 project is a crowdsourcing Digital Scholarly Editing project, and the transcribers are encouraged to mark up the letter transcriptions with TEI/XML tags. The TEI markup should eventually be used to identify features such as line breaks, titles and headers, but also names of people, organisations and places. Because it is assumed that most of the transcribers do not have previous experience with TEI or XML an editor with a toolbar is part of the transcription interface to guide the transcriber to use the correct TEI elements.
One of my first tasks was to have a look at the crowdsourced transcriptions and find out to what extend they were transcribed. It was interesting to find that there was a lot of markup in place. My replacement function counted 7395 start-tags, end-tags, and XML comments. If this is however related to the 166149 word tokens of the letters, the amount of encoded text does not seem so much anymore. The numbers can not be directly related, but if we assume that at least every 10 words there could be a line-break we get a quote of about 45% markup. Again this is highly speculative, because close investigation of individual letters shows that some are very detailed encoded (using the TEI del-element, add-element, line-break element, hi element and others), while other contain no tags at all.
The next step was to test if the transcriptions were well-formed XML and could be parsed with one of pythons libraries. I used the lxml library for this task, and found that over 40% of the letters would through a XMLSyntaxError. In most cases this was due to the use of ‘&’ instead of the entity ‘&’. After I had dealt with this problem by replacing all ‘&’ before trying to parse the transcription strings to xml, I still counted about 100 XMLSyntaxError out of 850 letters. In most of the cases this was due to not well-formed XML, opening-tags without closing-tags or (less common) overlapping elements.

XML Processing with Python

As the Letters of 1916 is a crowdsourced project the transcriptions of the letters contain irregular xml markup and in some cases not well-formed xml. At first I thought that I might be able to use the xml markup for my analysis, but the inconsistent quality of encoding makes this a useless attempt (see also my previous post).

Python has a number of libraries for XML-processing (a list including also libraries that are not par of the standard library is available here). The most popular ones are xml.etree, xml.dom, and xml.sax, which are all part of the Python standard library. I decided to use the lxml library, which has a similar API as xml.etree and was therefore easy enough to use. The library is pretty quick thanks to the underlying C libraries libxml2 and libxslt and it has limited support for XPath 1.0 and XSLT 1.0.

Because of the XPath support to get all the text out of a XML encoded document is as easy as:

for letter in letters:
    root = etree.fromstring(letter) 
    text_lst = root.xpath(".//text()") 
# The result is a list of text nodes that can be combined with " ".join() 

The problem however was that a good deal of what was supposed to be xml or plain text was in reality not well-formed xml (I discussed this in another post). To find out how many of the letters would not parse, I made the following changes:

for letter in letters:
    syntaxErr = 0
    try:
        root = etree.fromstring(letter)
        text_lst = root.xpath(".//text()")    
    except etree.XMLSyntaxError:
        syntaxErr += 1

To remove xml markup from the letter transcriptions was because of the numerous syntax errors in the transcriptions not possible. I found eventually regular expressions the best solution for this task.