New topic model created in January 2015 with about 1344 letters of the Letters of 1916 corpus. The workflow and results were presented at DHd 2015 in Graz.
The following two images show Gephi visualisations of this new topic model. It has to be emphasised that because the Letters of 1916 corpus is constantly growing that this topic model can only be snapshot of the corpus at the time.
The image on the left shows the topic model colored following the categories asigned by people who uploaded letters. When a new letter is uploaded to the Letters of 1916 website the uploader is asked to choose a category for the letter. This is a bit tricky as many letters might fit into several categories. However, the topic model shows clear clustering of some of these human-assigned categories. For instance, the red nodes are all of the category ‘Love letter’, blue nodes are all of the category ‘World War I’ and the yellow nodes are of the category ‘Official documents’ and ‘Business letters’. The orange cluster in the center (Topic 6 – or T6) is a category that was called Faith. These are letters from Maynooth written by students of the St. Patrick’s College Maynooth seminary written during 1916.
The image on the right shows the same topic model visualisation but this time the nodes are colored based on the gender of the author. Like the ‘category’ also the ‘gender of the author’ is part of the metadata of each letter and the value is assigned during the uploading process. The image on its own show the gender balance within the corpus. It the image is compare to the previous topic model visualisation (on the left) it is interesting to see that currently letters of women are mostly in the categories ‘Love letters’ and ‘World War I’.
The visualisation below is another attempt to show the number of letters written by men and women. However, this time the amount of pages of each letter was visualised too. A large node means many pages. Interesting in this visualisation is that the women writers of letters clustering around T4 (‘World War I’) have mostly written letters with multiple pages. On the other hand ‘Official documents’ and ‘Business letters’, clustering around T7, are predominantly one-page letters. Now this topic model can be misleading and has to be used with care because a single-page letter might have actually more words written on it than a two- or three-page letter that is written by someone with a large handwriting. In the future I hope to have time to make a more acurate topic model with the actual word numbers of letters as a parameter for the size.
The workshop was kicked off by Micheál Ó Siochrú’s lecture on his personal experience with Big Data as a historian through his work on the 1641 Depositions project. Title: Digitising History – why should historians care?
I do not want to reproduce the whole workshop program, because it is available with abstracts and bio notes on the KDEG website. It was a very diverse workshop bringing together, humanities, science and computer science people. I found all papers quite fascinating. For instance, stanford’s attempt to model the Roman road system: http://orbis.stanford.edu/
“LSA is a fully automatic mathematical/ statistical technique for extracting and inferring relations of expected usage of words in passages of discourse.” Sounds difficult – well I think it is. It seems to be the most used technique for topic modelling in the DH. Gensim and Mallet both great tools for topic modelling use it. I found a general introduction here: Introduction
One investigation of my internship is into topic modelling of the 1916 letters. I decided to use Python, because I was already familiar with the language before I started the internship and Python has good libraries for natural language processing and topic modelling. I tested the nltk and the gensim toolkit. The nltk is a well known toolkit and I use parts of it occasionally. For an introduction I recommend the documentation and the O’Reilly book available via the NLTK website.
The gensim library is a library for ‘topic modelling for humans’, so I hope it is as easy to use and intuitive as it claims to be. It is quickly installed via easy_install or pip and it is build on NumPy and SciPy, which have to be installed in order to use it.
The Letters of 1916 project was officially launched on Research Night 27th September, 2013. Recently the project moved from Trinity College Dublin to its next phase at An Foras Feasa, the Digital Humanities center at the National University of Ireland Maynooth (NUIM). Following this move, the ‘Kildare Launch’ of the project took place on 8 May 2014 at NUIM. The evening started with an encoding and digital imaging lab. This was a great chance for the audience to get an introduction on how everyone can contribute to the Letters project by transcribing or uploading letters. The Labs were followed by talks by Professor Susan Schreibman, Robert Doyle, Dr Brian Hughes, and Lar Joye. Videos of the presentations should be available soon on the Letters 1916 homepage.
During the internship at the Center of High Performance Computing and the Letters of 1916 project I will build a text analysis tool for the online letter collection. The structure of this analysis tool can be roughly divided into three phases/steps: import of data – text analysis – visual output.
Each of these steps is a challenging task and already from the beginning a number of issues are apparent:
Data Import: The letters are all encoded in some form of TEI/XML. But because this is a crowd-sourcing project the data is certainly messy and it is not clear what is encoded and how consistent. The same for metadata. It will therefore be interesting to see how helpful the TEi markup will be in the final text analysis.
Text Analysis and Visual Output: As first step the text analysis tool will just produce a histogram-like wordcount and frequency distribution. For the text processing part it will important be cleaned the text of punctuation and markup to allow proper tokenization into words.
…and there will be more challenges ahead as the internship progresses.
The Letters of 1916 project is a crowd-sourcing project and follows a very similar approach as the famous UCL Transcribe Bentham. The website is a based on a number of different technologies. A WordPress blog is used for the homepage, description and project detail pages. For the letter upload, display and transcription functionality Omeka is used with Scripto plugin and MediaWiki at the back-end. The Transcription interface is based on the DIYHistory project of the University of Iowa and uses the Transcription Toolbar from the Transcribe Bentham project. The website was developed and maintained by the team of Trinity College Dublin High Performance Computing, especially Juliusz Filipowski, Paddy Doyle and Dermot Frost, and designed by Karolina Badzmierowska, PhD candidate in the Digital Arts and Humanities at Trinity College Dublin.