Category Archives: Digital Humanities

Twitter chat on #AskLetters1916

About every month the Letters of 1916 project organises a Twitter chat. Different topics related to the letters project have been discussed in the past – Women in the Great War, Crowdsourcing, etc. Tonight the discussion was about Text Analysis and Topic Modelling of the 1916 letters.

Here is a link to the Twitter page: Link

Close of my internship

My internship finished last week (end of July) and I used last week to wrap up everything and create a few nice visualisations of my dataset, clean up my Python scripts and write a report.

The Python scripts that I used for extracting the texts and metadate and for cleaning the texts is available on GitHub. For Topic Modelling I used Mallet and Gensim. Mallet is a Java tool, while Gensim is a Python library. My implementation of Gensim can also be found on GitHub too.

I started my internship with knowing little about topic modelling and related tools. During my internship I learned about topic modelling as a tool to investigate a large corpus of text. I learned about benefits and pitfalls of using this technique. I explored my data set using tools such as Mallet, Gensim, nltk, and Gephi. I learned more about Python programming, how to optimise your programs and how to make them faster. Finally, I learned also a good bit about my data set, the letters of 1916, and what are issues working with it. I wrote a short internship report for DAH focusing on the objectives of the internship and my learning outcomes.

My internship report as pdf: ReportForDAHJuly2014

Generating 4, 8, 12, 16 topics

When the Letters of 1916 corpus is clustered to the 16 topics generated with Gensim and Mallet it seems that 16 topics might be too much. In one of my last posts I have shown visualisations created with Gephi, and I colored the letter nodes based on the categories that was assigned by the person that uploaded the letter. Only letters assigned to four or five of these categories actually clustered together. So after I talked with my internship supervisor Dermot it was decided that I try to reduce the number of topics to see what happens, and I would create visualisations for 4, 8, 12 generated topics. I could observer that that with 4, 8, and 12 topics the clustering was still the same as with 16 topics. However, lesser topics shows that many letters from generic categories such as 1916 Rising, or The Irish Question cluster with one of the four distinct topics.

4 topics Mallet:

letters_T4_030_with_lab2

4 topics Gensim:

letters_gensim_T4_01_lab

Topics of the 1916 Letters

I recently generated topics of the 1916 Letters project data using two different topic modelling software: Mallet, a topic modelling program written in Java, and on I wrote a script based on the Python topic modelling library Gensim. Mallet uses an implementation of LDA, while Gensim uses its own implementation of LDA, but allows also the transformation to other models and has wrapper for other implementations. For instance, there is also a Mallet wrapper (since version 0.9.0), but I could not get it to work. Anyway, the point is that the standard Gensim implementation of LDA is different from Mallet and when I ran Gensim and Mallet on the 1916 Letters data I got different results. On first sight the computer generated topics did not make much sense to me, but when I clustered the letters according to their relationships to the topics I found that similar letters would cluster together. So that showed both Gensim and Mallet worked.

Here is a first attempt to generate 16 topics. I chose the number 16 because at the moment when people upload their letters to the Letters of 1916 website they have to assign one of 16 predefined topics to their letter. Topics are for instance: World War 1, Family life, Art and literature, etc. One of the research questions I am working on is if the human assigned topics and the computer generated topics differ.

Here is my first Gensim and Mallet topic output:

Gensim_Mallet_16_topics

 

Gephi for the 1916 Letters

Gephi is a suit for interactive visualisation of network data. It is very often used for topic modelling in the Digital Humanities. As an introduction I suggest just play around with it, a how-do reading would be Gephi for the historically inclined. The best is however to get a few data sets and just try to use Gephi. For examples see the following blogs:

Essentially a challenge is to transform the output you get from Mallet or Gensim into a useful input for Gephi (edges and nodes files). On his blog Elijah goes into detail explaining how he visualized the Mallet output.

I wrote a function in my export/outputter module that converts Mallet output to Gephi edges data and saves it to a file. To view the module feel free to have a look at my project on GitHub.

Summer school Python for text analysis

There are two summer school on text analysis using Python this year. From the 22nd July to the 1st August is Joint Culture & Technology and CLARIN-D Summer School in Leipzig. I have been at this summer school a few years ago. It was great, many people, great atmosphere, and Leipzig is a lovely place. Anyway, this year they have a module on Python for text analysis: Advanced Topics in Humanities Programming with Python.

The second summer school is DARIAH International Digital Humanities Summer School in Göttingen, from 17th to 30th August. They also do a module on Python for text analysis. I have been there last year and it was great. The instructors were fantastic and we learned a lot. Would definitely recommend it.

The Humanities Programmer

Following a comment by Alex O’Connor I pushed all my code up on GitHub. I had planned to do this at some stage, but it never crossed my mind that somebody would be interested to study how I am writing the code for this project. On closer thinking about it, it is actually a fascinating topic.  More and more humanities research with no or little CS background learn programming languages in order to have another tool in their toolbox for text processing, online publishing, etc.

The interest in and use of programming languages by Humanities scholars goes way back into the 1960 and 1970s when collation concordances and collation software was developed. The use of this software required at least some knowledge of a programming language. From 1966 on a number of articles about programming languages for humanities research appeared in the journal Computers and the Humanities. The ability of a language to allow the Humanities Scholar ‘to split, scan, measure, compare, and join strings’ were essential, but also tasks like text formatting required programming knowledge at that time. The article also emphasizes that in the future programming languages for “complex pattern-matching problems that arise in musical or graphic art analysis” will become important too. A 1971 article in the same journal gives an overview over languages ‘easy to learn’ for humanities scholars (ALGOL, APL/360, BASIC, COBOL, FORTRAN, PL/I, SNAP, SNOBOL).

The most popular languages of recent years for humanities scholars are probably JavaScript, PHP, and Python. JavaScript and PHP because of their frequent use in web development, while Python is becoming more popular as a language for Natural Language Processing. This is for instance demonstrated by the many courses and summer schools addressing Python programming for humanities scholars. Examples are, the 2013 DARIAH Summer School in Goettingen or the this years Summer School in Goettingen, or ESU in Leipzig. Also the Austrian Centre for Digital Humanities in Graz, where I studied DH before coming to Dublin, moved from teaching Java programming to Python. Python is certainly a much more accessible language for humanities scholars and very useful for text processing. With more and more humanities scholars using programming languages (sometimes also only as a tool for one research task) it becomes relevant to explore how these scholars with often no CS background write code and generate software. Such studies will contribute to future developments of programming languages.

Long story short, I uploaded the latest version of my Python code to GitHub, so interested people can observe how my project progressed, and some might be even interested to contribute.