New topic model created in January 2015 with about 1344 letters of the Letters of 1916 corpus. The workflow and results were presented at DHd 2015 in Graz.
The following two images show Gephi visualisations of this new topic model. It has to be emphasised that because the Letters of 1916 corpus is constantly growing that this topic model can only be snapshot of the corpus at the time.
The image on the left shows the topic model colored following the categories asigned by people who uploaded letters. When a new letter is uploaded to the Letters of 1916 website the uploader is asked to choose a category for the letter. This is a bit tricky as many letters might fit into several categories. However, the topic model shows clear clustering of some of these human-assigned categories. For instance, the red nodes are all of the category ‘Love letter’, blue nodes are all of the category ‘World War I’ and the yellow nodes are of the category ‘Official documents’ and ‘Business letters’. The orange cluster in the center (Topic 6 – or T6) is a category that was called Faith. These are letters from Maynooth written by students of the St. Patrick’s College Maynooth seminary written during 1916.
The image on the right shows the same topic model visualisation but this time the nodes are colored based on the gender of the author. Like the ‘category’ also the ‘gender of the author’ is part of the metadata of each letter and the value is assigned during the uploading process. The image on its own show the gender balance within the corpus. It the image is compare to the previous topic model visualisation (on the left) it is interesting to see that currently letters of women are mostly in the categories ‘Love letters’ and ‘World War I’.
The visualisation below is another attempt to show the number of letters written by men and women. However, this time the amount of pages of each letter was visualised too. A large node means many pages. Interesting in this visualisation is that the women writers of letters clustering around T4 (‘World War I’) have mostly written letters with multiple pages. On the other hand ‘Official documents’ and ‘Business letters’, clustering around T7, are predominantly one-page letters. Now this topic model can be misleading and has to be used with care because a single-page letter might have actually more words written on it than a two- or three-page letter that is written by someone with a large handwriting. In the future I hope to have time to make a more acurate topic model with the actual word numbers of letters as a parameter for the size.
About every month the Letters of 1916 project organises a Twitter chat. Different topics related to the letters project have been discussed in the past – Women in the Great War, Crowdsourcing, etc. Tonight the discussion was about Text Analysis and Topic Modelling of the 1916 letters.
My internship finished last week (end of July) and I used last week to wrap up everything and create a few nice visualisations of my dataset, clean up my Python scripts and write a report.
The Python scripts that I used for extracting the texts and metadate and for cleaning the texts is available on GitHub. For Topic Modelling I used Mallet and Gensim. Mallet is a Java tool, while Gensim is a Python library. My implementation of Gensim can also be found on GitHub too.
I started my internship with knowing little about topic modelling and related tools. During my internship I learned about topic modelling as a tool to investigate a large corpus of text. I learned about benefits and pitfalls of using this technique. I explored my data set using tools such as Mallet, Gensim, nltk, and Gephi. I learned more about Python programming, how to optimise your programs and how to make them faster. Finally, I learned also a good bit about my data set, the letters of 1916, and what are issues working with it. I wrote a short internship report for DAH focusing on the objectives of the internship and my learning outcomes.
When the Letters of 1916 corpus is clustered to the 16 topics generated with Gensim and Mallet it seems that 16 topics might be too much. In one of my last posts I have shown visualisations created with Gephi, and I colored the letter nodes based on the categories that was assigned by the person that uploaded the letter. Only letters assigned to four or five of these categories actually clustered together. So after I talked with my internship supervisor Dermot it was decided that I try to reduce the number of topics to see what happens, and I would create visualisations for 4, 8, 12 generated topics. I could observer that that with 4, 8, and 12 topics the clustering was still the same as with 16 topics. However, lesser topics shows that many letters from generic categories such as 1916 Rising, or The Irish Question cluster with one of the four distinct topics.
At first I generated 16 topics (the reason is explained in a previous post) with Gensim and Mallet. When I visualised my data with Gephi I got an interesting result.
Mallet – 16 topics
Gensim – 16 topics
The Mallet output shows clearly a clustering of human assigned topics (colors) around computer generated topics (the black nodes, numbered Topic 0 – 15). At least letters assigned to four topics seem to cluster also together based on computer generated topics: Letters categorised as World War 1, Family life, Official documents and Love letters. See for instance, the clustering of letters assigned to the category of WW1 and Family life. It seems that the language of letters with these two categories are quite close:
The above mentioned categories cluster quite nicely. Another observation is that the green nodes for the categories Easter Rising and Irish question are all over the place and it is questionable if this is a useful category. The remaining categories are not used much at the moment, and they are not really visible. However, they could get more important when the data set grows.
The visualisation of the Gensim topics is not so clear at first glance, because there are many more edges. But a similar red, blue and yellow clustering can be observed. One issue with the Gensim algorithm was however that it responded much more to address information in the letters, and this influences the topic modelling process. This can be observed when looking at the generated topics, the clustering of the letters and the transcriptions of the individual letters. Address information is currently part of the transcription. The plan for the future it to encode the letters in TEI. When they are TEI encoded the stripping out of address information, salutation, etc. will be easier and much clearer topics can be generated.
I recently generated topics of the 1916 Letters project data using two different topic modelling software: Mallet, a topic modelling program written in Java, and on I wrote a script based on the Python topic modelling library Gensim. Mallet uses an implementation of LDA, while Gensim uses its own implementation of LDA, but allows also the transformation to other models and has wrapper for other implementations. For instance, there is also a Mallet wrapper (since version 0.9.0), but I could not get it to work. Anyway, the point is that the standard Gensim implementation of LDA is different from Mallet and when I ran Gensim and Mallet on the 1916 Letters data I got different results. On first sight the computer generated topics did not make much sense to me, but when I clustered the letters according to their relationships to the topics I found that similar letters would cluster together. So that showed both Gensim and Mallet worked.
Here is a first attempt to generate 16 topics. I chose the number 16 because at the moment when people upload their letters to the Letters of 1916 website they have to assign one of 16 predefined topics to their letter. Topics are for instance: World War 1, Family life, Art and literature, etc. One of the research questions I am working on is if the human assigned topics and the computer generated topics differ.
Gephi is a suit for interactive visualisation of network data. It is very often used for topic modelling in the Digital Humanities. As an introduction I suggest just play around with it, a how-do reading would be Gephi for the historically inclined. The best is however to get a few data sets and just try to use Gephi. For examples see the following blogs:
Essentially a challenge is to transform the output you get from Mallet or Gensim into a useful input for Gephi (edges and nodes files). On his blog Elijah goes into detail explaining how he visualized the Mallet output.
I wrote a function in my export/outputter module that converts Mallet output to Gephi edges data and saves it to a file. To view the module feel free to have a look at my project on GitHub.
There are two summer school on text analysis using Python this year. From the 22nd July to the 1st August is Joint Culture & Technology and CLARIN-D Summer School in Leipzig. I have been at this summer school a few years ago. It was great, many people, great atmosphere, and Leipzig is a lovely place. Anyway, this year they have a module on Python for text analysis: Advanced Topics in Humanities Programming with Python.
The workshop was kicked off by Micheál Ó Siochrú’s lecture on his personal experience with Big Data as a historian through his work on the 1641 Depositions project. Title: Digitising History – why should historians care?
I do not want to reproduce the whole workshop program, because it is available with abstracts and bio notes on the KDEG website. It was a very diverse workshop bringing together, humanities, science and computer science people. I found all papers quite fascinating. For instance, stanford’s attempt to model the Roman road system: http://orbis.stanford.edu/
In one of my previous posts I talked about using PyEnchant to regularise word spelling. Another process that was suggested to use a stemmer.
A stemmer is a program that reduced a word to its word stem or base form. For instance, in English a stemmer would remove suffix endings such as -ed, -ly, -s, -ing, and others. Hence, ‘walking’, ‘walked’, ‘walks’ would all be reduced to ‘walk’. This can be very useful when your analysis depends on word frequency. A problem is however that the stemmer can be sometimes too radical and change ‘july:juli’, ‘county:counti’, or ‘enclose:enclos’. This does not effect the analysis, but when presenting the results it might be worth to check the correct spelling.
I implemented a stemmer from nltk.stem and saved a list of the original word and stemmed form to a file. This allowed me to spot stemming issues. Following is my stemming function:
st = stem.PorterStemmer()
stem_words = 
for w in wordlst: