Monthly Archives: May 2014

The Humanities Programmer

Following a comment by Alex O’Connor I pushed all my code up on GitHub. I had planned to do this at some stage, but it never crossed my mind that somebody would be interested to study how I am writing the code for this project. On closer thinking about it, it is actually a fascinating topic.  More and more humanities research with no or little CS background learn programming languages in order to have another tool in their toolbox for text processing, online publishing, etc.

The interest in and use of programming languages by Humanities scholars goes way back into the 1960 and 1970s when collation concordances and collation software was developed. The use of this software required at least some knowledge of a programming language. From 1966 on a number of articles about programming languages for humanities research appeared in the journal Computers and the Humanities. The ability of a language to allow the Humanities Scholar ‘to split, scan, measure, compare, and join strings’ were essential, but also tasks like text formatting required programming knowledge at that time. The article also emphasizes that in the future programming languages for “complex pattern-matching problems that arise in musical or graphic art analysis” will become important too. A 1971 article in the same journal gives an overview over languages ‘easy to learn’ for humanities scholars (ALGOL, APL/360, BASIC, COBOL, FORTRAN, PL/I, SNAP, SNOBOL).

The most popular languages of recent years for humanities scholars are probably JavaScript, PHP, and Python. JavaScript and PHP because of their frequent use in web development, while Python is becoming more popular as a language for Natural Language Processing. This is for instance demonstrated by the many courses and summer schools addressing Python programming for humanities scholars. Examples are, the 2013 DARIAH Summer School in Goettingen or the this years Summer School in Goettingen, or ESU in Leipzig. Also the Austrian Centre for Digital Humanities in Graz, where I studied DH before coming to Dublin, moved from teaching Java programming to Python. Python is certainly a much more accessible language for humanities scholars and very useful for text processing. With more and more humanities scholars using programming languages (sometimes also only as a tool for one research task) it becomes relevant to explore how these scholars with often no CS background write code and generate software. Such studies will contribute to future developments of programming languages.

Long story short, I uploaded the latest version of my Python code to GitHub, so interested people can observe how my project progressed, and some might be even interested to contribute.

 

tf-idf – Term Frequency Inverse Document Frequency

Term Frequency Inverse Document Frequency, or short tf-idf, is a way to measure how important a term is in context of a document or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

With gensim the tf-idf can be calculated using the gensim.models.tfidfmodel:

from gensim import modelst
doc_bow = [(0, 1), (1, 1)] #bag-of-words, for instance created by document2bow
fidf = models.TfidfModel(corpus)
#Result: [(0, 0.70710678), (1, 0.70710678)]

This example is taken from the gensim tutorial and shows in a few steps how the transformation works. A “bag-of-words”, list of tuples of word-id and frequency, is used as corpus and TfidfModel class transforms the values into “TfIdf real-valued weights”.

Named Entity Recognition

Named Entity Recognition (NER) is the task to identify and tag entities such as person names, company names, place names, days, etc. from unstructured text. Unstructured texts are for instance plain text files, such as the letters I am working on after the XML markup is removed. As discussed in previous posts the XML markup added through crowdsourcing was inconsistent and in most cases did not parse anyway.
NER is relevant for my project as it allows me to identify and if necessary to build up a stopword list of names that are necessary to be striped in a pre-processing stage. One issue with my letter corpus is that each transcription starts with address information. Furthermore, a personal name like ‘Peter’ provides me with little useful information about a letters content.
Another problem is that at this stage a big part of the corpus are letters to and from Lady Clonbrock of Galway (Augusta Caroline Dillon, wife of Luke Gerald Dillon), for Lady Clonbrock’s correspondance with soldiers in WW1 see this article. Initial tests have already shown that some generated topics are based on names rather than content words, and the high frequency of names (due to address header etc.) makes interpretation of the topics difficult.
The importance of a similar pre-processing for a corpus of 19-cent. literary texts was described by Matthew Jockers and David Mimno in ‘Significant Themes in 19th-Century Literature‘.

Like Jockers and Mimno I am also using the Stanford NLP-software. It is a Java-based software including different tools for Natural Language Processing. A demo of the NER tagger can be found here.
I found the tool very user-friendly and there is a lot of documentation online. There are also several interfaces to other programming languages available. I used the NLTK interface. The setup was the setup was straightforward. Instructions can be found on the Stanford NLP website, or alternatively on this blog. I just had to download the software and a model file, and point the NLTK to my Java Development Kit. This is done in the internals.py file in the NLTK module. On line 72 I simply added the path to def config_java():

def config_java(bin="C:/Program Files/Java/jdk1.8.0_05/bin/java.exe", options=None, verbose=True):

One issue that kept me occupied for a while was that I got a ‘Java command failed!’ error. After a while I found that the problem was that I had config_java pointed to an older version of JDK (1.7).

Better Performance – Text Streaming

One way to get faster performance for processing of a big text corpus is to use streaming methods. Streaming means basically to keep the data stored in a file and access it when necessary, instead of keeping all data in memory.

Recently I looked into the gensim library, a library for topic modelling with Python, and they provide easy ways to save/load text corpora, dictionaries ect. In their tutorial they also suggest to create a corpus object that uses a streaming method:

class MyCorpus(object):
    def __iter__(self):
       for line in open('mycorpus.txt'):
       # assume there's one document per line, tokens separated by whitespace
             yield dictionary.doc2bow(line.lower().split())

This corpus class reads the lines directly from a text file instead of keeping the whole text stored in memory, a MyCorpus instance is fairly small, because it has just a reference to ‘mycorpus.txt’. This is very memory efficient.
I tried to use a similar approach for my TxtCorpus class. However, my corpus is not reading from a text file, but instead I pickled a dictionary of instances of my Letter class. Each 1916 Letter is an object that gets pickled and stored. The TxtCorpus class retrieves them, or data stored in them. In my example below the method get_txt() returns the transcriptions:

class TxtCorpus(object):
    def __init__(self, file_name):
        self.file = file_name

    def __iter__(self):
        for key, item in item_from_pickle(self.file).items():
            # returns the transcriptions stored in the Letter's instance
            yield item.get_txt()

Latent Semantic Analysis (LSA)

“LSA is a fully automatic mathematical/ statistical technique for extracting and inferring relations of expected usage of words in passages of discourse.”  Sounds difficult –  well I think it is. It seems to be the most used technique for topic modelling in the DH. Gensim and Mallet both great tools for topic modelling use it. I found a general introduction here: Introduction

 

Topic Modelling with Python: Gensim

One investigation of my internship is into topic modelling of the 1916 letters. I decided to use Python, because I was already familiar with the language before I started the internship and Python has good libraries for natural language processing and topic modelling. I tested the nltk and the gensim toolkit. The nltk is a well known toolkit and I use parts of it occasionally. For an introduction I recommend the documentation and the O’Reilly book available via the NLTK website.

The gensim library is a library for ‘topic modelling for humans’, so I hope it is as easy to use and intuitive as it claims to be. It is quickly installed via easy_install or pip and it is build on NumPy and SciPy, which have to be installed in order to use it.

Topic modelling Tools

I had a look at a number of topic modelling tools. The first was Mallet, a tool frequently used for topic modelling. For instance, my collegue Emma Clarke, TCD and now NUIM, used Mallet to extract topics from the 19th century transactions of the Royal Irish Academy (on JSTOR). Her related blog entry is available here. For a detailed description on how to setup and use Mallet I recommand the blog post on the programming historian.

Another software that is quite popular for topic modelling in DH is the Topic Modelling Tool (TMT), and its use and examples are described by Miriam on her DH blog.

After searching a while the internet I found also a Python module, “gensim”, which claims to be for “topic modelling for humans”. It is not as easy to use as the above mentioned tools, but on is website there is a detailed tutorial, its developer Radim answered questions in a number of online forum, google groups etc, and also the API is very well documented. At a later stage I will use Mallet in order to compare the results that i get from gensim with another topic modelling tool.

Another Performance Test

Out if personal interest I made another performance test. This time I targeted a more complex function that was written to import a huge table of transcriptions and metadata about all the letters from an EXCEL file. The importing from EXCEL is very easy thanks to the Python’s xlrd module. Each row in the EXCEL file contains the transcription of a letter page and metadata of the letter, but this means that in case if there are more than one page of a letter the metadata is duplicated. My function loops over the rows in the EXCEL table and creates Letter objects (my custom class to represent letters). The function merges multiple pages of the same letter and adds them to the same Letter object, tokenizes the transcriptions and cleans the text from XML-like markup and punctuation. The final step is to store all the Letter objects to a file using Pythons shelve module. In that way all these objects do not have to stay in memory.

My approach approach for the storage is to save all the Letter objects in a dictionary, with the letter id’s as keys and than push the whole dictionary to the shelve file. Worked quite well and was not too slow (considering the amount of letters and what the function has to do).

performance_3

The function took nearly 19 – 21 seconds (based on three tests) to run over about 2200 entries in the EXCEL file, create 850 letter objects, add the letters to a dictionary, clean the transcriptions and merge them if they are from the same letter, and finally store everything in a shelve file. This is not too bad considering that most of the time (13.3) seconds was used by the xlrd.open_workbook(filename=file_path, encoding_override=”utf-8″) function, that retrieves the data from the EXCEL file, and creates a handy to use workbook object. And Python cp1252.py decode took also quite long. Furthermore, a lot of time was spend creating the letter objects (1.6) and cleaning and adding the transcriptions (2.6). There might be some opportunity for optimization – I might come back to revising this function later in the project.

The stored shelve dictionary contains now a dictionary with all the letter objects. Out of curiosity I wrote a second function that does all the steps that the first one does, but instead of creating a dictionary and than storing it, each letter object will be stored in the shelve file when it is created. The shelve file is after all dictionary-like. I assumed that it might take a bit longer, and was surprised how long it actually took.

performance_4

Nearly 60 seconds!!! That was three times the time of the other function. After I had a look at the statistics returned by cProfile I found that most time was spend calling the keys() method on the shelve file in order to retrieve the keys (= letter ids) that were already stored. Not sure why this happened. Is it because keys() returns a list, instead of a faster data structure, like a set? Still 40 seconds more is very significant.  Could it have something to do with DeadlockWrap?


d = shelve.open("letters.shelve")

#Than follows the for loop that gets the data from the EXCEL file and prepares it for storage

if letter_id not in d.keys():       #This is the interesting part: if the letter id is not already used
d[letter_id] = l                     # as key in the shelve file add a new letter object

d.sync()    # the shelve file has to be synced in order to ensure object is written to the file

d.close()   # and closed after the for-loop

Interesting was that by replacing the d.keys() with a set ‘s’ that stores all the keys of letter objects – s.update(letter_id), it took the function less than half of the time, even if every object is written to the shelve file directly:

performance_5

It is still not as fast as my first solution, and will probably never be, because of the numerous calls to the shelve file and the constant sync. It was however interesting to see how a small change like replacing d.keys() with the set ‘s’ can have so tremendous effects.

High Performance Text Processing: An example

It is aboslutly amazing how much a programs performance can be optimised. Or better, how much slower a badly written function can be.
My first approach to create a list of clean word tokens and strip punctuation characters, whitespace and TEI markup resulted in a function that worked fine in my tests and returned the results I wanted. In my unittests the function had only to process small strings, and when I tried it on the over 850 letters in 1916 Letters corpus it took about 6 to 8 seconds to run (several attempts).
performance_1
This is only one function and in anticipation that the letter corpus will grow over the next years 8 seconds is too much.
My first approach to delete punctuation, spaces and markup was to loop over a list that contained all the stuff that I did not want. I split the text up along whitespaces and looped over each item in the list of words removing leading and trailing punctuation and spaces. To identify XML like markup I used regular expressions (Python re module). It worked okay, but as I said before – quite slow and the function was about 30 line long!
When I started looking for a better and faster solution to my problem, I found out that pre-combiled regular expressions in Python are pretty fast, because they are a C-based library, and they make the code also shorter.

def strip_punct_regex(strg):
    lst_words = strg.split()
    pat = "[\W]*(\w+[\w\'-/.]*\w+|\w|&)[\W]*"
    regex = re.compile(pat)
    lst_clean = []
    for item in lst_words:
        mm = regex.match(item)
        if mm:
            lst_clean.append(mm.group(1).lower())
    return lst_clean

From about 30 lines of code, two for loops and several if-else statements, thanks to Regular Expressions I came down to 10 lines, one for loop and one if statement. Allthough the extensive use of Regular Expressions seems not to be suggested, and there are performance issues as well (see Dive into Python). In my case I found that it made my code much simpler and also quicker.
performance_2
Online resources:
On performance with regular expressions see also: Python – Performance Tests of Regular Expressions
On the python re module: the Python documentation, many examples are also on the website Python Module of The Week
You Tube vide on High Performance Text Processing

Storing the Letter Objects: Python’s Shelve

Recently I looked into ways to store my Python Letter objects to a file after they are created. This has two advantages:

  1. Increased Performance, because the 850 objects do not have to be kept in memory
  2. The importer function takes quite a bit of time – 20 sec. If I want to run the whole program several times for testing, it is very annoying to wait each run 20 seconds for the importer (The reason why it takes so long is discussed in another post). My data won’t change (at least not during testing) and therefor it is handy to load it directly from a file instead of running the importer module

I found that Pythons pickle and shelve libraries where useful tools to work with. A good tutorial to shelve can be found in O’Reilly’s book: Programming Python, or on the Module of the Week blog. The shelve module is great because it allows to store objects in a dictionary-like way, where the objects can be fetched by keys.