When the Letters of 1916 corpus is clustered to the 16 topics generated with Gensim and Mallet it seems that 16 topics might be too much. In one of my last posts I have shown visualisations created with Gephi, and I colored the letter nodes based on the categories that was assigned by the person that uploaded the letter. Only letters assigned to four or five of these categories actually clustered together. So after I talked with my internship supervisor Dermot it was decided that I try to reduce the number of topics to see what happens, and I would create visualisations for 4, 8, 12 generated topics. I could observer that that with 4, 8, and 12 topics the clustering was still the same as with 16 topics. However, lesser topics shows that many letters from generic categories such as 1916 Rising, or The Irish Question cluster with one of the four distinct topics.
At first I generated 16 topics (the reason is explained in a previous post) with Gensim and Mallet. When I visualised my data with Gephi I got an interesting result.
Mallet – 16 topics
Gensim – 16 topics
The Mallet output shows clearly a clustering of human assigned topics (colors) around computer generated topics (the black nodes, numbered Topic 0 – 15). At least letters assigned to four topics seem to cluster also together based on computer generated topics: Letters categorised as World War 1, Family life, Official documents and Love letters. See for instance, the clustering of letters assigned to the category of WW1 and Family life. It seems that the language of letters with these two categories are quite close:
The above mentioned categories cluster quite nicely. Another observation is that the green nodes for the categories Easter Rising and Irish question are all over the place and it is questionable if this is a useful category. The remaining categories are not used much at the moment, and they are not really visible. However, they could get more important when the data set grows.
The visualisation of the Gensim topics is not so clear at first glance, because there are many more edges. But a similar red, blue and yellow clustering can be observed. One issue with the Gensim algorithm was however that it responded much more to address information in the letters, and this influences the topic modelling process. This can be observed when looking at the generated topics, the clustering of the letters and the transcriptions of the individual letters. Address information is currently part of the transcription. The plan for the future it to encode the letters in TEI. When they are TEI encoded the stripping out of address information, salutation, etc. will be easier and much clearer topics can be generated.
Gephi is a suit for interactive visualisation of network data. It is very often used for topic modelling in the Digital Humanities. As an introduction I suggest just play around with it, a how-do reading would be Gephi for the historically inclined. The best is however to get a few data sets and just try to use Gephi. For examples see the following blogs:
Essentially a challenge is to transform the output you get from Mallet or Gensim into a useful input for Gephi (edges and nodes files). On his blog Elijah goes into detail explaining how he visualized the Mallet output.
I wrote a function in my export/outputter module that converts Mallet output to Gephi edges data and saves it to a file. To view the module feel free to have a look at my project on GitHub.
PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.
In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.
My spell check function is included into the cleaner module and looks something like this at the moment:
with open(SPELL_CHECK_PWL, "r") as f:
all_pwl = f.read().lower()
d = enchant.DictWithPWL("en_US", temp_pwl_file)
err = 
for w in wordlst:
if not d.check(w):
first_sug = d.suggest(w)
if w != first_sug.lower():
The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.
A sample form the result file looks something like this:
Named Entity Recognition (NER) is the task to identify and tag entities such as person names, company names, place names, days, etc. from unstructured text. Unstructured texts are for instance plain text files, such as the letters I am working on after the XML markup is removed. As discussed in previous posts the XML markup added through crowdsourcing was inconsistent and in most cases did not parse anyway.
NER is relevant for my project as it allows me to identify and if necessary to build up a stopword list of names that are necessary to be striped in a pre-processing stage. One issue with my letter corpus is that each transcription starts with address information. Furthermore, a personal name like ‘Peter’ provides me with little useful information about a letters content.
Another problem is that at this stage a big part of the corpus are letters to and from Lady Clonbrock of Galway (Augusta Caroline Dillon, wife of Luke Gerald Dillon), for Lady Clonbrock’s correspondance with soldiers in WW1 see this article. Initial tests have already shown that some generated topics are based on names rather than content words, and the high frequency of names (due to address header etc.) makes interpretation of the topics difficult.
The importance of a similar pre-processing for a corpus of 19-cent. literary texts was described by Matthew Jockers and David Mimno in ‘Significant Themes in 19th-Century Literature‘.
Like Jockers and Mimno I am also using the Stanford NLP-software. It is a Java-based software including different tools for Natural Language Processing. A demo of the NER tagger can be found here.
I found the tool very user-friendly and there is a lot of documentation online. There are also several interfaces to other programming languages available. I used the NLTK interface. The setup was the setup was straightforward. Instructions can be found on the Stanford NLP website, or alternatively on this blog. I just had to download the software and a model file, and point the NLTK to my Java Development Kit. This is done in the internals.py file in the NLTK module. On line 72 I simply added the path to def config_java():
One way to get faster performance for processing of a big text corpus is to use streaming methods. Streaming means basically to keep the data stored in a file and access it when necessary, instead of keeping all data in memory.
Recently I looked into the gensim library, a library for topic modelling with Python, and they provide easy ways to save/load text corpora, dictionaries ect. In their tutorial they also suggest to create a corpus object that uses a streaming method:
for line in open('mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
This corpus class reads the lines directly from a text file instead of keeping the whole text stored in memory, a MyCorpus instance is fairly small, because it has just a reference to ‘mycorpus.txt’. This is very memory efficient.
I tried to use a similar approach for my TxtCorpus class. However, my corpus is not reading from a text file, but instead I pickled a dictionary of instances of my Letter class. Each 1916 Letter is an object that gets pickled and stored. The TxtCorpus class retrieves them, or data stored in them. In my example below the method get_txt() returns the transcriptions:
def __init__(self, file_name):
self.file = file_name
for key, item in item_from_pickle(self.file).items():
# returns the transcriptions stored in the Letter's instance
It is aboslutly amazing how much a programs performance can be optimised. Or better, how much slower a badly written function can be.
My first approach to create a list of clean word tokens and strip punctuation characters, whitespace and TEI markup resulted in a function that worked fine in my tests and returned the results I wanted. In my unittests the function had only to process small strings, and when I tried it on the over 850 letters in 1916 Letters corpus it took about 6 to 8 seconds to run (several attempts).
This is only one function and in anticipation that the letter corpus will grow over the next years 8 seconds is too much.
My first approach to delete punctuation, spaces and markup was to loop over a list that contained all the stuff that I did not want. I split the text up along whitespaces and looped over each item in the list of words removing leading and trailing punctuation and spaces. To identify XML like markup I used regular expressions (Python re module). It worked okay, but as I said before – quite slow and the function was about 30 line long!
When I started looking for a better and faster solution to my problem, I found out that pre-combiled regular expressions in Python are pretty fast, because they are a C-based library, and they make the code also shorter.
lst_words = strg.split()
pat = "[\W]*(\w+[\w\'-/.]*\w+|\w|&)[\W]*"
regex = re.compile(pat)
lst_clean = 
for item in lst_words:
mm = regex.match(item)