Category Archives: High Performance

The Humanities Programmer

Following a comment by Alex O’Connor I pushed all my code up on GitHub. I had planned to do this at some stage, but it never crossed my mind that somebody would be interested to study how I am writing the code for this project. On closer thinking about it, it is actually a fascinating topic.  More and more humanities research with no or little CS background learn programming languages in order to have another tool in their toolbox for text processing, online publishing, etc.

The interest in and use of programming languages by Humanities scholars goes way back into the 1960 and 1970s when collation concordances and collation software was developed. The use of this software required at least some knowledge of a programming language. From 1966 on a number of articles about programming languages for humanities research appeared in the journal Computers and the Humanities. The ability of a language to allow the Humanities Scholar ‘to split, scan, measure, compare, and join strings’ were essential, but also tasks like text formatting required programming knowledge at that time. The article also emphasizes that in the future programming languages for “complex pattern-matching problems that arise in musical or graphic art analysis” will become important too. A 1971 article in the same journal gives an overview over languages ‘easy to learn’ for humanities scholars (ALGOL, APL/360, BASIC, COBOL, FORTRAN, PL/I, SNAP, SNOBOL).

The most popular languages of recent years for humanities scholars are probably JavaScript, PHP, and Python. JavaScript and PHP because of their frequent use in web development, while Python is becoming more popular as a language for Natural Language Processing. This is for instance demonstrated by the many courses and summer schools addressing Python programming for humanities scholars. Examples are, the 2013 DARIAH Summer School in Goettingen or the this years Summer School in Goettingen, or ESU in Leipzig. Also the Austrian Centre for Digital Humanities in Graz, where I studied DH before coming to Dublin, moved from teaching Java programming to Python. Python is certainly a much more accessible language for humanities scholars and very useful for text processing. With more and more humanities scholars using programming languages (sometimes also only as a tool for one research task) it becomes relevant to explore how these scholars with often no CS background write code and generate software. Such studies will contribute to future developments of programming languages.

Long story short, I uploaded the latest version of my Python code to GitHub, so interested people can observe how my project progressed, and some might be even interested to contribute.

 

Better Performance – Text Streaming

One way to get faster performance for processing of a big text corpus is to use streaming methods. Streaming means basically to keep the data stored in a file and access it when necessary, instead of keeping all data in memory.

Recently I looked into the gensim library, a library for topic modelling with Python, and they provide easy ways to save/load text corpora, dictionaries ect. In their tutorial they also suggest to create a corpus object that uses a streaming method:

class MyCorpus(object):
    def __iter__(self):
       for line in open('mycorpus.txt'):
       # assume there's one document per line, tokens separated by whitespace
             yield dictionary.doc2bow(line.lower().split())

This corpus class reads the lines directly from a text file instead of keeping the whole text stored in memory, a MyCorpus instance is fairly small, because it has just a reference to ‘mycorpus.txt’. This is very memory efficient.
I tried to use a similar approach for my TxtCorpus class. However, my corpus is not reading from a text file, but instead I pickled a dictionary of instances of my Letter class. Each 1916 Letter is an object that gets pickled and stored. The TxtCorpus class retrieves them, or data stored in them. In my example below the method get_txt() returns the transcriptions:

class TxtCorpus(object):
    def __init__(self, file_name):
        self.file = file_name

    def __iter__(self):
        for key, item in item_from_pickle(self.file).items():
            # returns the transcriptions stored in the Letter's instance
            yield item.get_txt()

Another Performance Test

Out if personal interest I made another performance test. This time I targeted a more complex function that was written to import a huge table of transcriptions and metadata about all the letters from an EXCEL file. The importing from EXCEL is very easy thanks to the Python’s xlrd module. Each row in the EXCEL file contains the transcription of a letter page and metadata of the letter, but this means that in case if there are more than one page of a letter the metadata is duplicated. My function loops over the rows in the EXCEL table and creates Letter objects (my custom class to represent letters). The function merges multiple pages of the same letter and adds them to the same Letter object, tokenizes the transcriptions and cleans the text from XML-like markup and punctuation. The final step is to store all the Letter objects to a file using Pythons shelve module. In that way all these objects do not have to stay in memory.

My approach approach for the storage is to save all the Letter objects in a dictionary, with the letter id’s as keys and than push the whole dictionary to the shelve file. Worked quite well and was not too slow (considering the amount of letters and what the function has to do).

performance_3

The function took nearly 19 – 21 seconds (based on three tests) to run over about 2200 entries in the EXCEL file, create 850 letter objects, add the letters to a dictionary, clean the transcriptions and merge them if they are from the same letter, and finally store everything in a shelve file. This is not too bad considering that most of the time (13.3) seconds was used by the xlrd.open_workbook(filename=file_path, encoding_override=”utf-8″) function, that retrieves the data from the EXCEL file, and creates a handy to use workbook object. And Python cp1252.py decode took also quite long. Furthermore, a lot of time was spend creating the letter objects (1.6) and cleaning and adding the transcriptions (2.6). There might be some opportunity for optimization – I might come back to revising this function later in the project.

The stored shelve dictionary contains now a dictionary with all the letter objects. Out of curiosity I wrote a second function that does all the steps that the first one does, but instead of creating a dictionary and than storing it, each letter object will be stored in the shelve file when it is created. The shelve file is after all dictionary-like. I assumed that it might take a bit longer, and was surprised how long it actually took.

performance_4

Nearly 60 seconds!!! That was three times the time of the other function. After I had a look at the statistics returned by cProfile I found that most time was spend calling the keys() method on the shelve file in order to retrieve the keys (= letter ids) that were already stored. Not sure why this happened. Is it because keys() returns a list, instead of a faster data structure, like a set? Still 40 seconds more is very significant.  Could it have something to do with DeadlockWrap?


d = shelve.open("letters.shelve")

#Than follows the for loop that gets the data from the EXCEL file and prepares it for storage

if letter_id not in d.keys():       #This is the interesting part: if the letter id is not already used
d[letter_id] = l                     # as key in the shelve file add a new letter object

d.sync()    # the shelve file has to be synced in order to ensure object is written to the file

d.close()   # and closed after the for-loop

Interesting was that by replacing the d.keys() with a set ‘s’ that stores all the keys of letter objects – s.update(letter_id), it took the function less than half of the time, even if every object is written to the shelve file directly:

performance_5

It is still not as fast as my first solution, and will probably never be, because of the numerous calls to the shelve file and the constant sync. It was however interesting to see how a small change like replacing d.keys() with the set ‘s’ can have so tremendous effects.

High Performance Text Processing: An example

It is aboslutly amazing how much a programs performance can be optimised. Or better, how much slower a badly written function can be.
My first approach to create a list of clean word tokens and strip punctuation characters, whitespace and TEI markup resulted in a function that worked fine in my tests and returned the results I wanted. In my unittests the function had only to process small strings, and when I tried it on the over 850 letters in 1916 Letters corpus it took about 6 to 8 seconds to run (several attempts).
performance_1
This is only one function and in anticipation that the letter corpus will grow over the next years 8 seconds is too much.
My first approach to delete punctuation, spaces and markup was to loop over a list that contained all the stuff that I did not want. I split the text up along whitespaces and looped over each item in the list of words removing leading and trailing punctuation and spaces. To identify XML like markup I used regular expressions (Python re module). It worked okay, but as I said before – quite slow and the function was about 30 line long!
When I started looking for a better and faster solution to my problem, I found out that pre-combiled regular expressions in Python are pretty fast, because they are a C-based library, and they make the code also shorter.

def strip_punct_regex(strg):
    lst_words = strg.split()
    pat = "[\W]*(\w+[\w\'-/.]*\w+|\w|&)[\W]*"
    regex = re.compile(pat)
    lst_clean = []
    for item in lst_words:
        mm = regex.match(item)
        if mm:
            lst_clean.append(mm.group(1).lower())
    return lst_clean

From about 30 lines of code, two for loops and several if-else statements, thanks to Regular Expressions I came down to 10 lines, one for loop and one if statement. Allthough the extensive use of Regular Expressions seems not to be suggested, and there are performance issues as well (see Dive into Python). In my case I found that it made my code much simpler and also quicker.
performance_2
Online resources:
On performance with regular expressions see also: Python – Performance Tests of Regular Expressions
On the python re module: the Python documentation, many examples are also on the website Python Module of The Week
You Tube vide on High Performance Text Processing

Storing the Letter Objects: Python’s Shelve

Recently I looked into ways to store my Python Letter objects to a file after they are created. This has two advantages:

  1. Increased Performance, because the 850 objects do not have to be kept in memory
  2. The importer function takes quite a bit of time – 20 sec. If I want to run the whole program several times for testing, it is very annoying to wait each run 20 seconds for the importer (The reason why it takes so long is discussed in another post). My data won’t change (at least not during testing) and therefor it is handy to load it directly from a file instead of running the importer module

I found that Pythons pickle and shelve libraries where useful tools to work with. A good tutorial to shelve can be found in O’Reilly’s book: Programming Python, or on the Module of the Week blog. The shelve module is great because it allows to store objects in a dictionary-like way, where the objects can be fetched by keys.

High Performance Text Processing

I realised that the program that I wrote in the first week to import and process the letters and print a simple histogram-like visualisation of words and their distribution was pretty slow. It worked great with the unittests I had set up, because they were mostly short texts to prope if the individual functions do what I wanted them to do. When I tried the program however on the over 850 letters, it took ‘forever to get a response’ ( up to a minute). After searching online for a solution I found a few good guidelines how I can make the program faster.
On the Python homepage there is a section on performance tips. This was one of the starting places for me. There are many useful tips there. To monitor the performance of functions the Python cProfile and profile modules are very good and easy to use. They gave me a good idea where I could optimise my program. I found also the following video by Daniel Krasner very useful, as it gives a easy and simple introduction to the problem and also a few useful solutions:

lxml and shelve

As described in a previous post I use shelve a pickle based python module to serialize my letter objects so they do not have to stay in memory and can be cashed to increase performance. The problem that I encountered is that pickle does not like lxml objects. The problem is also described by other people here: http://stackoverflow.com/questions/8274438/saving-an-lxml-etree-elementtree-object

So, the easiest for me was to switch back to strings and instead to pickle lxml objects, I pickle an xml string and an object method on my custumn letter object that transforms the string into an lxml _Element object.