I recently generated topics of the 1916 Letters project data using two different topic modelling software: Mallet, a topic modelling program written in Java, and on I wrote a script based on the Python topic modelling library Gensim. Mallet uses an implementation of LDA, while Gensim uses its own implementation of LDA, but allows also the transformation to other models and has wrapper for other implementations. For instance, there is also a Mallet wrapper (since version 0.9.0), but I could not get it to work. Anyway, the point is that the standard Gensim implementation of LDA is different from Mallet and when I ran Gensim and Mallet on the 1916 Letters data I got different results. On first sight the computer generated topics did not make much sense to me, but when I clustered the letters according to their relationships to the topics I found that similar letters would cluster together. So that showed both Gensim and Mallet worked.
Here is a first attempt to generate 16 topics. I chose the number 16 because at the moment when people upload their letters to the Letters of 1916 website they have to assign one of 16 predefined topics to their letter. Topics are for instance: World War 1, Family life, Art and literature, etc. One of the research questions I am working on is if the human assigned topics and the computer generated topics differ.
Here is my first Gensim and Mallet topic output:
In one of my previous posts I talked about using PyEnchant to regularise word spelling. Another process that was suggested to use a stemmer.
A stemmer is a program that reduced a word to its word stem or base form. For instance, in English a stemmer would remove suffix endings such as -ed, -ly, -s, -ing, and others. Hence, ‘walking’, ‘walked’, ‘walks’ would all be reduced to ‘walk’. This can be very useful when your analysis depends on word frequency. A problem is however that the stemmer can be sometimes too radical and change ‘july:juli’, ‘county:counti’, or ‘enclose:enclos’. This does not effect the analysis, but when presenting the results it might be worth to check the correct spelling.
I implemented a stemmer from nltk.stem and saved a list of the original word and stemmed form to a file. This allowed me to spot stemming issues. Following is my stemming function:
st = stem.PorterStemmer()
stem_words = 
for w in wordlst:
PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.
In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.
My spell check function is included into the cleaner module and looks something like this at the moment:
with open(SPELL_CHECK_PWL, "r") as f:
all_pwl = f.read().lower()
d = enchant.DictWithPWL("en_US", temp_pwl_file)
err = 
for w in wordlst:
if not d.check(w):
first_sug = d.suggest(w)
if w != first_sug.lower():
The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.
A sample form the result file looks something like this:
Working with the 1916 data I found (what people with experience have always told me) that cleaning of your data is an essential step. It could be even the most important step. Inconsistent, messy, and fault leads to problems and wrong results in the analysis and interpretation stages of your research.
In regards to the 1916 letters wrong spelling, inconsistent markup and comments in the text, inconsistent metadata are all sources for error. I knew from the start of my internship that cleaning the 1916 data would be one of the challenges. I did a bit of research and found very useful tips. Emma Clarke a former Mphil student here in TCD did recently a topic modelling project and talking to her and reading her Mphil thesis was very helpful. Furthermore,I found the O’Reilly Bad Data Handbook an interesting read.
Term Frequency Inverse Document Frequency, or short tf-idf, is a way to measure how important a term is in context of a document or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
With gensim the tf-idf can be calculated using the gensim.models.tfidfmodel:
from gensim import modelst
doc_bow = [(0, 1), (1, 1)] #bag-of-words, for instance created by document2bow
fidf = models.TfidfModel(corpus)
#Result: [(0, 0.70710678), (1, 0.70710678)]
This example is taken from the gensim tutorial and shows in a few steps how the transformation works. A “bag-of-words”, list of tuples of word-id and frequency, is used as corpus and TfidfModel class transforms the values into “TfIdf real-valued weights”.
Named Entity Recognition (NER) is the task to identify and tag entities such as person names, company names, place names, days, etc. from unstructured text. Unstructured texts are for instance plain text files, such as the letters I am working on after the XML markup is removed. As discussed in previous posts the XML markup added through crowdsourcing was inconsistent and in most cases did not parse anyway.
NER is relevant for my project as it allows me to identify and if necessary to build up a stopword list of names that are necessary to be striped in a pre-processing stage. One issue with my letter corpus is that each transcription starts with address information. Furthermore, a personal name like ‘Peter’ provides me with little useful information about a letters content.
Another problem is that at this stage a big part of the corpus are letters to and from Lady Clonbrock of Galway (Augusta Caroline Dillon, wife of Luke Gerald Dillon), for Lady Clonbrock’s correspondance with soldiers in WW1 see this article. Initial tests have already shown that some generated topics are based on names rather than content words, and the high frequency of names (due to address header etc.) makes interpretation of the topics difficult.
The importance of a similar pre-processing for a corpus of 19-cent. literary texts was described by Matthew Jockers and David Mimno in ‘Significant Themes in 19th-Century Literature‘.
Like Jockers and Mimno I am also using the Stanford NLP-software. It is a Java-based software including different tools for Natural Language Processing. A demo of the NER tagger can be found here.
I found the tool very user-friendly and there is a lot of documentation online. There are also several interfaces to other programming languages available. I used the NLTK interface. The setup was the setup was straightforward. Instructions can be found on the Stanford NLP website, or alternatively on this blog. I just had to download the software and a model file, and point the NLTK to my Java Development Kit. This is done in the internals.py file in the NLTK module. On line 72 I simply added the path to def config_java():
def config_java(bin="C:/Program Files/Java/jdk1.8.0_05/bin/java.exe", options=None, verbose=True):
One issue that kept me occupied for a while was that I got a ‘Java command failed!’ error. After a while I found that the problem was that I had config_java pointed to an older version of JDK (1.7).
One investigation of my internship is into topic modelling of the 1916 letters. I decided to use Python, because I was already familiar with the language before I started the internship and Python has good libraries for natural language processing and topic modelling. I tested the nltk and the gensim toolkit. The nltk is a well known toolkit and I use parts of it occasionally. For an introduction I recommend the documentation and the O’Reilly book available via the NLTK website.
The gensim library is a library for ‘topic modelling for humans’, so I hope it is as easy to use and intuitive as it claims to be. It is quickly installed via easy_install or pip and it is build on NumPy and SciPy, which have to be installed in order to use it.