Tag Archives: Text Processing

Close of my internship

My internship finished last week (end of July) and I used last week to wrap up everything and create a few nice visualisations of my dataset, clean up my Python scripts and write a report.

The Python scripts that I used for extracting the texts and metadate and for cleaning the texts is available on GitHub. For Topic Modelling I used Mallet and Gensim. Mallet is a Java tool, while Gensim is a Python library. My implementation of Gensim can also be found on GitHub too.

I started my internship with knowing little about topic modelling and related tools. During my internship I learned about topic modelling as a tool to investigate a large corpus of text. I learned about benefits and pitfalls of using this technique. I explored my data set using tools such as Mallet, Gensim, nltk, and Gephi. I learned more about Python programming, how to optimise your programs and how to make them faster. Finally, I learned also a good bit about my data set, the letters of 1916, and what are issues working with it. I wrote a short internship report for DAH focusing on the objectives of the internship and my learning outcomes.

My internship report as pdf: ReportForDAHJuly2014

Advertisements

The Humanities Programmer

Following a comment by Alex O’Connor I pushed all my code up on GitHub. I had planned to do this at some stage, but it never crossed my mind that somebody would be interested to study how I am writing the code for this project. On closer thinking about it, it is actually a fascinating topic.  More and more humanities research with no or little CS background learn programming languages in order to have another tool in their toolbox for text processing, online publishing, etc.

The interest in and use of programming languages by Humanities scholars goes way back into the 1960 and 1970s when collation concordances and collation software was developed. The use of this software required at least some knowledge of a programming language. From 1966 on a number of articles about programming languages for humanities research appeared in the journal Computers and the Humanities. The ability of a language to allow the Humanities Scholar ‘to split, scan, measure, compare, and join strings’ were essential, but also tasks like text formatting required programming knowledge at that time. The article also emphasizes that in the future programming languages for “complex pattern-matching problems that arise in musical or graphic art analysis” will become important too. A 1971 article in the same journal gives an overview over languages ‘easy to learn’ for humanities scholars (ALGOL, APL/360, BASIC, COBOL, FORTRAN, PL/I, SNAP, SNOBOL).

The most popular languages of recent years for humanities scholars are probably JavaScript, PHP, and Python. JavaScript and PHP because of their frequent use in web development, while Python is becoming more popular as a language for Natural Language Processing. This is for instance demonstrated by the many courses and summer schools addressing Python programming for humanities scholars. Examples are, the 2013 DARIAH Summer School in Goettingen or the this years Summer School in Goettingen, or ESU in Leipzig. Also the Austrian Centre for Digital Humanities in Graz, where I studied DH before coming to Dublin, moved from teaching Java programming to Python. Python is certainly a much more accessible language for humanities scholars and very useful for text processing. With more and more humanities scholars using programming languages (sometimes also only as a tool for one research task) it becomes relevant to explore how these scholars with often no CS background write code and generate software. Such studies will contribute to future developments of programming languages.

Long story short, I uploaded the latest version of my Python code to GitHub, so interested people can observe how my project progressed, and some might be even interested to contribute.

 

Named Entity Recognition

Named Entity Recognition (NER) is the task to identify and tag entities such as person names, company names, place names, days, etc. from unstructured text. Unstructured texts are for instance plain text files, such as the letters I am working on after the XML markup is removed. As discussed in previous posts the XML markup added through crowdsourcing was inconsistent and in most cases did not parse anyway.
NER is relevant for my project as it allows me to identify and if necessary to build up a stopword list of names that are necessary to be striped in a pre-processing stage. One issue with my letter corpus is that each transcription starts with address information. Furthermore, a personal name like ‘Peter’ provides me with little useful information about a letters content.
Another problem is that at this stage a big part of the corpus are letters to and from Lady Clonbrock of Galway (Augusta Caroline Dillon, wife of Luke Gerald Dillon), for Lady Clonbrock’s correspondance with soldiers in WW1 see this article. Initial tests have already shown that some generated topics are based on names rather than content words, and the high frequency of names (due to address header etc.) makes interpretation of the topics difficult.
The importance of a similar pre-processing for a corpus of 19-cent. literary texts was described by Matthew Jockers and David Mimno in ‘Significant Themes in 19th-Century Literature‘.

Like Jockers and Mimno I am also using the Stanford NLP-software. It is a Java-based software including different tools for Natural Language Processing. A demo of the NER tagger can be found here.
I found the tool very user-friendly and there is a lot of documentation online. There are also several interfaces to other programming languages available. I used the NLTK interface. The setup was the setup was straightforward. Instructions can be found on the Stanford NLP website, or alternatively on this blog. I just had to download the software and a model file, and point the NLTK to my Java Development Kit. This is done in the internals.py file in the NLTK module. On line 72 I simply added the path to def config_java():

def config_java(bin="C:/Program Files/Java/jdk1.8.0_05/bin/java.exe", options=None, verbose=True):

One issue that kept me occupied for a while was that I got a ‘Java command failed!’ error. After a while I found that the problem was that I had config_java pointed to an older version of JDK (1.7).

Better Performance – Text Streaming

One way to get faster performance for processing of a big text corpus is to use streaming methods. Streaming means basically to keep the data stored in a file and access it when necessary, instead of keeping all data in memory.

Recently I looked into the gensim library, a library for topic modelling with Python, and they provide easy ways to save/load text corpora, dictionaries ect. In their tutorial they also suggest to create a corpus object that uses a streaming method:

class MyCorpus(object):
    def __iter__(self):
       for line in open('mycorpus.txt'):
       # assume there's one document per line, tokens separated by whitespace
             yield dictionary.doc2bow(line.lower().split())

This corpus class reads the lines directly from a text file instead of keeping the whole text stored in memory, a MyCorpus instance is fairly small, because it has just a reference to ‘mycorpus.txt’. This is very memory efficient.
I tried to use a similar approach for my TxtCorpus class. However, my corpus is not reading from a text file, but instead I pickled a dictionary of instances of my Letter class. Each 1916 Letter is an object that gets pickled and stored. The TxtCorpus class retrieves them, or data stored in them. In my example below the method get_txt() returns the transcriptions:

class TxtCorpus(object):
    def __init__(self, file_name):
        self.file = file_name

    def __iter__(self):
        for key, item in item_from_pickle(self.file).items():
            # returns the transcriptions stored in the Letter's instance
            yield item.get_txt()

Topic modelling Tools

I had a look at a number of topic modelling tools. The first was Mallet, a tool frequently used for topic modelling. For instance, my collegue Emma Clarke, TCD and now NUIM, used Mallet to extract topics from the 19th century transactions of the Royal Irish Academy (on JSTOR). Her related blog entry is available here. For a detailed description on how to setup and use Mallet I recommand the blog post on the programming historian.

Another software that is quite popular for topic modelling in DH is the Topic Modelling Tool (TMT), and its use and examples are described by Miriam on her DH blog.

After searching a while the internet I found also a Python module, “gensim”, which claims to be for “topic modelling for humans”. It is not as easy to use as the above mentioned tools, but on is website there is a detailed tutorial, its developer Radim answered questions in a number of online forum, google groups etc, and also the API is very well documented. At a later stage I will use Mallet in order to compare the results that i get from gensim with another topic modelling tool.

High Performance Text Processing: An example

It is aboslutly amazing how much a programs performance can be optimised. Or better, how much slower a badly written function can be.
My first approach to create a list of clean word tokens and strip punctuation characters, whitespace and TEI markup resulted in a function that worked fine in my tests and returned the results I wanted. In my unittests the function had only to process small strings, and when I tried it on the over 850 letters in 1916 Letters corpus it took about 6 to 8 seconds to run (several attempts).
performance_1
This is only one function and in anticipation that the letter corpus will grow over the next years 8 seconds is too much.
My first approach to delete punctuation, spaces and markup was to loop over a list that contained all the stuff that I did not want. I split the text up along whitespaces and looped over each item in the list of words removing leading and trailing punctuation and spaces. To identify XML like markup I used regular expressions (Python re module). It worked okay, but as I said before – quite slow and the function was about 30 line long!
When I started looking for a better and faster solution to my problem, I found out that pre-combiled regular expressions in Python are pretty fast, because they are a C-based library, and they make the code also shorter.

def strip_punct_regex(strg):
    lst_words = strg.split()
    pat = "[\W]*(\w+[\w\'-/.]*\w+|\w|&)[\W]*"
    regex = re.compile(pat)
    lst_clean = []
    for item in lst_words:
        mm = regex.match(item)
        if mm:
            lst_clean.append(mm.group(1).lower())
    return lst_clean

From about 30 lines of code, two for loops and several if-else statements, thanks to Regular Expressions I came down to 10 lines, one for loop and one if statement. Allthough the extensive use of Regular Expressions seems not to be suggested, and there are performance issues as well (see Dive into Python). In my case I found that it made my code much simpler and also quicker.
performance_2
Online resources:
On performance with regular expressions see also: Python – Performance Tests of Regular Expressions
On the python re module: the Python documentation, many examples are also on the website Python Module of The Week
You Tube vide on High Performance Text Processing

Storing the Letter Objects: Python’s Shelve

Recently I looked into ways to store my Python Letter objects to a file after they are created. This has two advantages:

  1. Increased Performance, because the 850 objects do not have to be kept in memory
  2. The importer function takes quite a bit of time – 20 sec. If I want to run the whole program several times for testing, it is very annoying to wait each run 20 seconds for the importer (The reason why it takes so long is discussed in another post). My data won’t change (at least not during testing) and therefor it is handy to load it directly from a file instead of running the importer module

I found that Pythons pickle and shelve libraries where useful tools to work with. A good tutorial to shelve can be found in O’Reilly’s book: Programming Python, or on the Module of the Week blog. The shelve module is great because it allows to store objects in a dictionary-like way, where the objects can be fetched by keys.