My internship finished last week (end of July) and I used last week to wrap up everything and create a few nice visualisations of my dataset, clean up my Python scripts and write a report.
The Python scripts that I used for extracting the texts and metadate and for cleaning the texts is available on GitHub. For Topic Modelling I used Mallet and Gensim. Mallet is a Java tool, while Gensim is a Python library. My implementation of Gensim can also be found on GitHub too.
I started my internship with knowing little about topic modelling and related tools. During my internship I learned about topic modelling as a tool to investigate a large corpus of text. I learned about benefits and pitfalls of using this technique. I explored my data set using tools such as Mallet, Gensim, nltk, and Gephi. I learned more about Python programming, how to optimise your programs and how to make them faster. Finally, I learned also a good bit about my data set, the letters of 1916, and what are issues working with it. I wrote a short internship report for DAH focusing on the objectives of the internship and my learning outcomes.
At first I generated 16 topics (the reason is explained in a previous post) with Gensim and Mallet. When I visualised my data with Gephi I got an interesting result.
Mallet – 16 topics
Gensim – 16 topics
The Mallet output shows clearly a clustering of human assigned topics (colors) around computer generated topics (the black nodes, numbered Topic 0 – 15). At least letters assigned to four topics seem to cluster also together based on computer generated topics: Letters categorised as World War 1, Family life, Official documents and Love letters. See for instance, the clustering of letters assigned to the category of WW1 and Family life. It seems that the language of letters with these two categories are quite close:
The above mentioned categories cluster quite nicely. Another observation is that the green nodes for the categories Easter Rising and Irish question are all over the place and it is questionable if this is a useful category. The remaining categories are not used much at the moment, and they are not really visible. However, they could get more important when the data set grows.
The visualisation of the Gensim topics is not so clear at first glance, because there are many more edges. But a similar red, blue and yellow clustering can be observed. One issue with the Gensim algorithm was however that it responded much more to address information in the letters, and this influences the topic modelling process. This can be observed when looking at the generated topics, the clustering of the letters and the transcriptions of the individual letters. Address information is currently part of the transcription. The plan for the future it to encode the letters in TEI. When they are TEI encoded the stripping out of address information, salutation, etc. will be easier and much clearer topics can be generated.
Gephi is a suit for interactive visualisation of network data. It is very often used for topic modelling in the Digital Humanities. As an introduction I suggest just play around with it, a how-do reading would be Gephi for the historically inclined. The best is however to get a few data sets and just try to use Gephi. For examples see the following blogs:
Essentially a challenge is to transform the output you get from Mallet or Gensim into a useful input for Gephi (edges and nodes files). On his blog Elijah goes into detail explaining how he visualized the Mallet output.
I wrote a function in my export/outputter module that converts Mallet output to Gephi edges data and saves it to a file. To view the module feel free to have a look at my project on GitHub.