When the Letters of 1916 corpus is clustered to the 16 topics generated with Gensim and Mallet it seems that 16 topics might be too much. In one of my last posts I have shown visualisations created with Gephi, and I colored the letter nodes based on the categories that was assigned by the person that uploaded the letter. Only letters assigned to four or five of these categories actually clustered together. So after I talked with my internship supervisor Dermot it was decided that I try to reduce the number of topics to see what happens, and I would create visualisations for 4, 8, 12 generated topics. I could observer that that with 4, 8, and 12 topics the clustering was still the same as with 16 topics. However, lesser topics shows that many letters from generic categories such as 1916 Rising, or The Irish Question cluster with one of the four distinct topics.
At first I generated 16 topics (the reason is explained in a previous post) with Gensim and Mallet. When I visualised my data with Gephi I got an interesting result.
Mallet – 16 topics
Gensim – 16 topics
The Mallet output shows clearly a clustering of human assigned topics (colors) around computer generated topics (the black nodes, numbered Topic 0 – 15). At least letters assigned to four topics seem to cluster also together based on computer generated topics: Letters categorised as World War 1, Family life, Official documents and Love letters. See for instance, the clustering of letters assigned to the category of WW1 and Family life. It seems that the language of letters with these two categories are quite close:
The above mentioned categories cluster quite nicely. Another observation is that the green nodes for the categories Easter Rising and Irish question are all over the place and it is questionable if this is a useful category. The remaining categories are not used much at the moment, and they are not really visible. However, they could get more important when the data set grows.
The visualisation of the Gensim topics is not so clear at first glance, because there are many more edges. But a similar red, blue and yellow clustering can be observed. One issue with the Gensim algorithm was however that it responded much more to address information in the letters, and this influences the topic modelling process. This can be observed when looking at the generated topics, the clustering of the letters and the transcriptions of the individual letters. Address information is currently part of the transcription. The plan for the future it to encode the letters in TEI. When they are TEI encoded the stripping out of address information, salutation, etc. will be easier and much clearer topics can be generated.
from gensim import modelst
doc_bow = [(0, 1), (1, 1)] #bag-of-words, for instance created by document2bow
fidf = models.TfidfModel(corpus)
#Result: [(0, 0.70710678), (1, 0.70710678)]
This example is taken from the gensim tutorial and shows in a few steps how the transformation works. A “bag-of-words”, list of tuples of word-id and frequency, is used as corpus and TfidfModel class transforms the values into “TfIdf real-valued weights”.
One investigation of my internship is into topic modelling of the 1916 letters. I decided to use Python, because I was already familiar with the language before I started the internship and Python has good libraries for natural language processing and topic modelling. I tested the nltk and the gensim toolkit. The nltk is a well known toolkit and I use parts of it occasionally. For an introduction I recommend the documentation and the O’Reilly book available via the NLTK website.
The gensim library is a library for ‘topic modelling for humans’, so I hope it is as easy to use and intuitive as it claims to be. It is quickly installed via easy_install or pip and it is build on NumPy and SciPy, which have to be installed in order to use it.
I had a look at a number of topic modelling tools. The first was Mallet, a tool frequently used for topic modelling. For instance, my collegue Emma Clarke, TCD and now NUIM, used Mallet to extract topics from the 19th century transactions of the Royal Irish Academy (on JSTOR). Her related blog entry is available here. For a detailed description on how to setup and use Mallet I recommand the blog post on the programming historian.
Another software that is quite popular for topic modelling in DH is the Topic Modelling Tool (TMT), and its use and examples are described by Miriam on her DH blog.
After searching a while the internet I found also a Python module, “gensim”, which claims to be for “topic modelling for humans”. It is not as easy to use as the above mentioned tools, but on is website there is a detailed tutorial, its developer Radim answered questions in a number of online forum, google groups etc, and also the API is very well documented. At a later stage I will use Mallet in order to compare the results that i get from gensim with another topic modelling tool.