When the Letters of 1916 corpus is clustered to the 16 topics generated with Gensim and Mallet it seems that 16 topics might be too much. In one of my last posts I have shown visualisations created with Gephi, and I colored the letter nodes based on the categories that was assigned by the person that uploaded the letter. Only letters assigned to four or five of these categories actually clustered together. So after I talked with my internship supervisor Dermot it was decided that I try to reduce the number of topics to see what happens, and I would create visualisations for 4, 8, 12 generated topics. I could observer that that with 4, 8, and 12 topics the clustering was still the same as with 16 topics. However, lesser topics shows that many letters from generic categories such as 1916 Rising, or The Irish Question cluster with one of the four distinct topics.
At first I generated 16 topics (the reason is explained in a previous post) with Gensim and Mallet. When I visualised my data with Gephi I got an interesting result.
Mallet – 16 topics
Gensim – 16 topics
The Mallet output shows clearly a clustering of human assigned topics (colors) around computer generated topics (the black nodes, numbered Topic 0 – 15). At least letters assigned to four topics seem to cluster also together based on computer generated topics: Letters categorised as World War 1, Family life, Official documents and Love letters. See for instance, the clustering of letters assigned to the category of WW1 and Family life. It seems that the language of letters with these two categories are quite close:
The above mentioned categories cluster quite nicely. Another observation is that the green nodes for the categories Easter Rising and Irish question are all over the place and it is questionable if this is a useful category. The remaining categories are not used much at the moment, and they are not really visible. However, they could get more important when the data set grows.
The visualisation of the Gensim topics is not so clear at first glance, because there are many more edges. But a similar red, blue and yellow clustering can be observed. One issue with the Gensim algorithm was however that it responded much more to address information in the letters, and this influences the topic modelling process. This can be observed when looking at the generated topics, the clustering of the letters and the transcriptions of the individual letters. Address information is currently part of the transcription. The plan for the future it to encode the letters in TEI. When they are TEI encoded the stripping out of address information, salutation, etc. will be easier and much clearer topics can be generated.
Gephi is a suit for interactive visualisation of network data. It is very often used for topic modelling in the Digital Humanities. As an introduction I suggest just play around with it, a how-do reading would be Gephi for the historically inclined. The best is however to get a few data sets and just try to use Gephi. For examples see the following blogs:
Essentially a challenge is to transform the output you get from Mallet or Gensim into a useful input for Gephi (edges and nodes files). On his blog Elijah goes into detail explaining how he visualized the Mallet output.
I wrote a function in my export/outputter module that converts Mallet output to Gephi edges data and saves it to a file. To view the module feel free to have a look at my project on GitHub.
PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.
In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.
My spell check function is included into the cleaner module and looks something like this at the moment:
with open(SPELL_CHECK_PWL, "r") as f:
all_pwl = f.read().lower()
d = enchant.DictWithPWL("en_US", temp_pwl_file)
err = 
for w in wordlst:
if not d.check(w):
first_sug = d.suggest(w)
if w != first_sug.lower():
The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.
A sample form the result file looks something like this:
Named Entity Recognition (NER) is the task to identify and tag entities such as person names, company names, place names, days, etc. from unstructured text. Unstructured texts are for instance plain text files, such as the letters I am working on after the XML markup is removed. As discussed in previous posts the XML markup added through crowdsourcing was inconsistent and in most cases did not parse anyway.
NER is relevant for my project as it allows me to identify and if necessary to build up a stopword list of names that are necessary to be striped in a pre-processing stage. One issue with my letter corpus is that each transcription starts with address information. Furthermore, a personal name like ‘Peter’ provides me with little useful information about a letters content.
Another problem is that at this stage a big part of the corpus are letters to and from Lady Clonbrock of Galway (Augusta Caroline Dillon, wife of Luke Gerald Dillon), for Lady Clonbrock’s correspondance with soldiers in WW1 see this article. Initial tests have already shown that some generated topics are based on names rather than content words, and the high frequency of names (due to address header etc.) makes interpretation of the topics difficult.
The importance of a similar pre-processing for a corpus of 19-cent. literary texts was described by Matthew Jockers and David Mimno in ‘Significant Themes in 19th-Century Literature‘.
Like Jockers and Mimno I am also using the Stanford NLP-software. It is a Java-based software including different tools for Natural Language Processing. A demo of the NER tagger can be found here.
I found the tool very user-friendly and there is a lot of documentation online. There are also several interfaces to other programming languages available. I used the NLTK interface. The setup was the setup was straightforward. Instructions can be found on the Stanford NLP website, or alternatively on this blog. I just had to download the software and a model file, and point the NLTK to my Java Development Kit. This is done in the internals.py file in the NLTK module. On line 72 I simply added the path to def config_java():
One way to get faster performance for processing of a big text corpus is to use streaming methods. Streaming means basically to keep the data stored in a file and access it when necessary, instead of keeping all data in memory.
Recently I looked into the gensim library, a library for topic modelling with Python, and they provide easy ways to save/load text corpora, dictionaries ect. In their tutorial they also suggest to create a corpus object that uses a streaming method:
for line in open('mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
This corpus class reads the lines directly from a text file instead of keeping the whole text stored in memory, a MyCorpus instance is fairly small, because it has just a reference to ‘mycorpus.txt’. This is very memory efficient.
I tried to use a similar approach for my TxtCorpus class. However, my corpus is not reading from a text file, but instead I pickled a dictionary of instances of my Letter class. Each 1916 Letter is an object that gets pickled and stored. The TxtCorpus class retrieves them, or data stored in them. In my example below the method get_txt() returns the transcriptions:
def __init__(self, file_name):
self.file = file_name
for key, item in item_from_pickle(self.file).items():
# returns the transcriptions stored in the Letter's instance
It is aboslutly amazing how much a programs performance can be optimised. Or better, how much slower a badly written function can be.
My first approach to create a list of clean word tokens and strip punctuation characters, whitespace and TEI markup resulted in a function that worked fine in my tests and returned the results I wanted. In my unittests the function had only to process small strings, and when I tried it on the over 850 letters in 1916 Letters corpus it took about 6 to 8 seconds to run (several attempts).
This is only one function and in anticipation that the letter corpus will grow over the next years 8 seconds is too much.
My first approach to delete punctuation, spaces and markup was to loop over a list that contained all the stuff that I did not want. I split the text up along whitespaces and looped over each item in the list of words removing leading and trailing punctuation and spaces. To identify XML like markup I used regular expressions (Python re module). It worked okay, but as I said before – quite slow and the function was about 30 line long!
When I started looking for a better and faster solution to my problem, I found out that pre-combiled regular expressions in Python are pretty fast, because they are a C-based library, and they make the code also shorter.
lst_words = strg.split()
pat = "[\W]*(\w+[\w\'-/.]*\w+|\w|&)[\W]*"
regex = re.compile(pat)
lst_clean = 
for item in lst_words:
mm = regex.match(item)
The Letters of 1916 project is a crowdsourcing Digital Scholarly Editing project, and the transcribers are encouraged to mark up the letter transcriptions with TEI/XML tags. The TEI markup should eventually be used to identify features such as line breaks, titles and headers, but also names of people, organisations and places. Because it is assumed that most of the transcribers do not have previous experience with TEI or XML an editor with a toolbar is part of the transcription interface to guide the transcriber to use the correct TEI elements.
One of my first tasks was to have a look at the crowdsourced transcriptions and find out to what extend they were transcribed. It was interesting to find that there was a lot of markup in place. My replacement function counted 7395 start-tags, end-tags, and XML comments. If this is however related to the 166149 word tokens of the letters, the amount of encoded text does not seem so much anymore. The numbers can not be directly related, but if we assume that at least every 10 words there could be a line-break we get a quote of about 45% markup. Again this is highly speculative, because close investigation of individual letters shows that some are very detailed encoded (using the TEI del-element, add-element, line-break element, hi element and others), while other contain no tags at all.
The next step was to test if the transcriptions were well-formed XML and could be parsed with one of pythons libraries. I used the lxml library for this task, and found that over 40% of the letters would through a XMLSyntaxError. In most cases this was due to the use of ‘&’ instead of the entity ‘&’. After I had dealt with this problem by replacing all ‘&’ before trying to parse the transcription strings to xml, I still counted about 100 XMLSyntaxError out of 850 letters. In most of the cases this was due to not well-formed XML, opening-tags without closing-tags or (less common) overlapping elements.
The Letters of 1916 project was officially launched on Research Night 27th September, 2013. Recently the project moved from Trinity College Dublin to its next phase at An Foras Feasa, the Digital Humanities center at the National University of Ireland Maynooth (NUIM). Following this move, the ‘Kildare Launch’ of the project took place on 8 May 2014 at NUIM. The evening started with an encoding and digital imaging lab. This was a great chance for the audience to get an introduction on how everyone can contribute to the Letters project by transcribing or uploading letters. The Labs were followed by talks by Professor Susan Schreibman, Robert Doyle, Dr Brian Hughes, and Lar Joye. Videos of the presentations should be available soon on the Letters 1916 homepage.
During the internship at the Center of High Performance Computing and the Letters of 1916 project I will build a text analysis tool for the online letter collection. The structure of this analysis tool can be roughly divided into three phases/steps: import of data – text analysis – visual output.
Each of these steps is a challenging task and already from the beginning a number of issues are apparent:
Data Import: The letters are all encoded in some form of TEI/XML. But because this is a crowd-sourcing project the data is certainly messy and it is not clear what is encoded and how consistent. The same for metadata. It will therefore be interesting to see how helpful the TEi markup will be in the final text analysis.
Text Analysis and Visual Output: As first step the text analysis tool will just produce a histogram-like wordcount and frequency distribution. For the text processing part it will important be cleaned the text of punctuation and markup to allow proper tokenization into words.
…and there will be more challenges ahead as the internship progresses.