Second topic model of the Letters of 1916 corpus, January 2015

New topic model created in January 2015 with about 1344 letters of the Letters of 1916 corpus. The workflow and results were presented at DHd 2015 in Graz.

The following two images show Gephi visualisations of this new topic model. It has to be emphasised that because the Letters of 1916 corpus is constantly growing that this topic model can only be snapshot of the corpus at the time.

The image on the left shows the topic model colored following the categories asigned by people who uploaded letters. When a new letter is uploaded to the Letters of 1916 website the uploader is asked to choose a category for the letter. This is a bit tricky as many letters might fit into several categories. However, the topic model shows clear clustering of some of these human-assigned categories. For instance, the red nodes are all of the category ‘Love letter’, blue nodes are all of the category ‘World War I’ and the yellow nodes are of the category ‘Official documents’ and ‘Business letters’. The orange cluster in the center (Topic 6 – or T6) is a category that was called Faith. These are letters from Maynooth written by students of the St. Patrick’s College Maynooth seminary written during 1916.

The image on the right shows the same topic model visualisation but this time the nodes are colored based on the gender of the author. Like the ‘category’ also the ‘gender of the author’ is part of the metadata of each letter and the value is assigned during the uploading process. The image on its own show the gender balance within the corpus. It the image is compare to the previous topic model visualisation (on the left) it is interesting to see that currently letters of women are mostly in the categories ‘Love letters’ and ‘World War I’.

Topic model of the Letters 1916 corpus, colored by human assigned topicsTopic model of the Letters 1916 corpus, colored by gender

The visualisation below is another attempt to show the number of letters written by men and women. However, this time the amount of pages of each letter was visualised too. A large node means many pages. Interesting in this visualisation is that the women writers of letters clustering around T4 (‘World War I’) have mostly written letters with multiple pages. On the other hand ‘Official documents’ and ‘Business letters’, clustering around T7, are predominantly one-page letters. Now this topic model can be misleading and has to be used with care because a single-page letter might have actually more words written on it than a two- or three-page letter that is written by someone with a large handwriting. In the future I hope to have time to make a more acurate topic model with the actual word numbers of letters as a parameter for the size.

Topic model of Letters of 1916 corpus based gender and pages

Link to slides of the DHd paper on the GAMS: slides (pdf)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s