Topics of the 1916 Letters

I recently generated topics of the 1916 Letters project data using two different topic modelling software: Mallet, a topic modelling program written in Java, and on I wrote a script based on the Python topic modelling library Gensim. Mallet uses an implementation of LDA, while Gensim uses its own implementation of LDA, but allows also the transformation to other models and has wrapper for other implementations. For instance, there is also a Mallet wrapper (since version 0.9.0), but I could not get it to work. Anyway, the point is that the standard Gensim implementation of LDA is different from Mallet and when I ran Gensim and Mallet on the 1916 Letters data I got different results. On first sight the computer generated topics did not make much sense to me, but when I clustered the letters according to their relationships to the topics I found that similar letters would cluster together. So that showed both Gensim and Mallet worked.

Here is a first attempt to generate 16 topics. I chose the number 16 because at the moment when people upload their letters to the Letters of 1916 website they have to assign one of 16 predefined topics to their letter. Topics are for instance: World War 1, Family life, Art and literature, etc. One of the research questions I am working on is if the human assigned topics and the computer generated topics differ.

Here is my first Gensim and Mallet topic output:

Gensim_Mallet_16_topics

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s