About every month the Letters of 1916 project organises a Twitter chat. Different topics related to the letters project have been discussed in the past – Women in the Great War, Crowdsourcing, etc. Tonight the discussion was about Text Analysis and Topic Modelling of the 1916 letters.
My internship finished last week (end of July) and I used last week to wrap up everything and create a few nice visualisations of my dataset, clean up my Python scripts and write a report.
The Python scripts that I used for extracting the texts and metadate and for cleaning the texts is available on GitHub. For Topic Modelling I used Mallet and Gensim. Mallet is a Java tool, while Gensim is a Python library. My implementation of Gensim can also be found on GitHub too.
I started my internship with knowing little about topic modelling and related tools. During my internship I learned about topic modelling as a tool to investigate a large corpus of text. I learned about benefits and pitfalls of using this technique. I explored my data set using tools such as Mallet, Gensim, nltk, and Gephi. I learned more about Python programming, how to optimise your programs and how to make them faster. Finally, I learned also a good bit about my data set, the letters of 1916, and what are issues working with it. I wrote a short internship report for DAH focusing on the objectives of the internship and my learning outcomes.
The Letters of 1916 project was officially launched on Research Night 27th September, 2013. Recently the project moved from Trinity College Dublin to its next phase at An Foras Feasa, the Digital Humanities center at the National University of Ireland Maynooth (NUIM). Following this move, the ‘Kildare Launch’ of the project took place on 8 May 2014 at NUIM. The evening started with an encoding and digital imaging lab. This was a great chance for the audience to get an introduction on how everyone can contribute to the Letters project by transcribing or uploading letters. The Labs were followed by talks by Professor Susan Schreibman, Robert Doyle, Dr Brian Hughes, and Lar Joye. Videos of the presentations should be available soon on the Letters 1916 homepage.
During the internship at the Center of High Performance Computing and the Letters of 1916 project I will build a text analysis tool for the online letter collection. The structure of this analysis tool can be roughly divided into three phases/steps: import of data – text analysis – visual output.
Each of these steps is a challenging task and already from the beginning a number of issues are apparent:
Data Import: The letters are all encoded in some form of TEI/XML. But because this is a crowd-sourcing project the data is certainly messy and it is not clear what is encoded and how consistent. The same for metadata. It will therefore be interesting to see how helpful the TEi markup will be in the final text analysis.
Text Analysis and Visual Output: As first step the text analysis tool will just produce a histogram-like wordcount and frequency distribution. For the text processing part it will important be cleaned the text of punctuation and markup to allow proper tokenization into words.
…and there will be more challenges ahead as the internship progresses.
The Letters of 1916 project is a crowd-sourcing project and follows a very similar approach as the famous UCL Transcribe Bentham. The website is a based on a number of different technologies. A WordPress blog is used for the homepage, description and project detail pages. For the letter upload, display and transcription functionality Omeka is used with Scripto plugin and MediaWiki at the back-end. The Transcription interface is based on the DIYHistory project of the University of Iowa and uses the Transcription Toolbar from the Transcribe Bentham project. The website was developed and maintained by the team of Trinity College Dublin High Performance Computing, especially Juliusz Filipowski, Paddy Doyle and Dermot Frost, and designed by Karolina Badzmierowska, PhD candidate in the Digital Arts and Humanities at Trinity College Dublin.
The Letters of 1916: Creating Historyis the first crowd-sourced humanities project in Ireland. The project was launched on Friday September 27th 2013 at Discover Research Night and invites people all over the world to share letters written during and related to the Easter Rising of 1916. Images of letters can be uploaded, read online and transcribed. This project focuses especially on private collections and the letters and voices of people that were less well known or even forgotten.
In the word of the principle investigator of the project Susan Schreibman, Professor of Digital Humanities NUIM:
Allowing letters from personal collections to be read alongside official letters and letters contributed by institutions will add new perspectives to the events of the period and allow us to understand what it was like to live an ordinary life through what were extraordinary times…All too often our emphasis is on the grand narrative focusing on key political figures. But as we approach the centenary of the Easter Rising we want to try to get a sense of how ordinary people coped with one of the most disruptive periods in contemporary Irish history…” press release