Tag Archives: Digital Humanities

Stemming words in the transcriptions

In one of my previous posts I talked about using PyEnchant to regularise word spelling. Another process that was suggested to use a stemmer.

A stemmer is a program that reduced a word to its word stem or base form. For instance, in English a stemmer would remove suffix endings such as -ed, -ly, -s, -ing, and others. Hence, ‘walking’, ‘walked’, ‘walks’ would all be reduced to ‘walk’. This can be very useful when your analysis depends on word frequency. A problem is however that the stemmer can be sometimes too radical and change ‘july:juli’, ‘county:counti’, or ‘enclose:enclos’. This does not effect the analysis, but when presenting the results it might be worth to check the correct spelling.

I implemented a stemmer from nltk.stem and saved a list of the original word and stemmed form to a file. This allowed me to spot stemming issues. Following is my stemming function:


def stemmer(wordlst):
    st = stem.PorterStemmer()
    stem_words = []
    for w in wordlst:
        stem_words.append((w, st.stem(w)))
    return stem_words

Advertisements

The Humanities Programmer

Following a comment by Alex O’Connor I pushed all my code up on GitHub. I had planned to do this at some stage, but it never crossed my mind that somebody would be interested to study how I am writing the code for this project. On closer thinking about it, it is actually a fascinating topic.  More and more humanities research with no or little CS background learn programming languages in order to have another tool in their toolbox for text processing, online publishing, etc.

The interest in and use of programming languages by Humanities scholars goes way back into the 1960 and 1970s when collation concordances and collation software was developed. The use of this software required at least some knowledge of a programming language. From 1966 on a number of articles about programming languages for humanities research appeared in the journal Computers and the Humanities. The ability of a language to allow the Humanities Scholar ‘to split, scan, measure, compare, and join strings’ were essential, but also tasks like text formatting required programming knowledge at that time. The article also emphasizes that in the future programming languages for “complex pattern-matching problems that arise in musical or graphic art analysis” will become important too. A 1971 article in the same journal gives an overview over languages ‘easy to learn’ for humanities scholars (ALGOL, APL/360, BASIC, COBOL, FORTRAN, PL/I, SNAP, SNOBOL).

The most popular languages of recent years for humanities scholars are probably JavaScript, PHP, and Python. JavaScript and PHP because of their frequent use in web development, while Python is becoming more popular as a language for Natural Language Processing. This is for instance demonstrated by the many courses and summer schools addressing Python programming for humanities scholars. Examples are, the 2013 DARIAH Summer School in Goettingen or the this years Summer School in Goettingen, or ESU in Leipzig. Also the Austrian Centre for Digital Humanities in Graz, where I studied DH before coming to Dublin, moved from teaching Java programming to Python. Python is certainly a much more accessible language for humanities scholars and very useful for text processing. With more and more humanities scholars using programming languages (sometimes also only as a tool for one research task) it becomes relevant to explore how these scholars with often no CS background write code and generate software. Such studies will contribute to future developments of programming languages.

Long story short, I uploaded the latest version of my Python code to GitHub, so interested people can observe how my project progressed, and some might be even interested to contribute.

 

XML Encoding of Letters

The Letters of 1916 project is a crowdsourcing Digital Scholarly Editing project, and the transcribers are encouraged to mark up the letter transcriptions with TEI/XML tags. The TEI markup should eventually be used to identify features such as line breaks, titles and headers, but also names of people, organisations and places. Because it is assumed that most of the transcribers do not have previous experience with TEI or XML an editor with a toolbar is part of the transcription interface to guide the transcriber to use the correct TEI elements.
One of my first tasks was to have a look at the crowdsourced transcriptions and find out to what extend they were transcribed. It was interesting to find that there was a lot of markup in place. My replacement function counted 7395 start-tags, end-tags, and XML comments. If this is however related to the 166149 word tokens of the letters, the amount of encoded text does not seem so much anymore. The numbers can not be directly related, but if we assume that at least every 10 words there could be a line-break we get a quote of about 45% markup. Again this is highly speculative, because close investigation of individual letters shows that some are very detailed encoded (using the TEI del-element, add-element, line-break element, hi element and others), while other contain no tags at all.
The next step was to test if the transcriptions were well-formed XML and could be parsed with one of pythons libraries. I used the lxml library for this task, and found that over 40% of the letters would through a XMLSyntaxError. In most cases this was due to the use of ‘&’ instead of the entity ‘&’. After I had dealt with this problem by replacing all ‘&’ before trying to parse the transcription strings to xml, I still counted about 100 XMLSyntaxError out of 850 letters. In most of the cases this was due to not well-formed XML, opening-tags without closing-tags or (less common) overlapping elements.

Kildare Launch

The Letters of 1916 project was officially launched on Research Night 27th September, 2013. Recently the project moved from Trinity College Dublin to its next phase at An Foras Feasa, the Digital Humanities center at the National University of Ireland Maynooth (NUIM). Following this move, the ‘Kildare Launch’ of the project took place on 8 May 2014 at NUIM. The evening started with an encoding and digital imaging lab. This was a great chance for the audience to get an introduction on how everyone can contribute to the Letters project by transcribing or uploading letters. The Labs were followed by talks by Professor Susan Schreibman, Robert Doyle, Dr Brian Hughes, and Lar Joye. Videos of the presentations should be available soon on the Letters 1916 homepage.
1916 Letter project

The Letters of 1916: Creating History

The Letters of 1916: Creating History is the first crowd-sourced humanities project in Ireland. The project was launched on Friday September 27th 2013 at Discover Research Night and invites people all over the world to share letters written during and related to the Easter Rising of 1916. Images of letters can be uploaded, read online and transcribed. This project focuses especially on private collections and the letters and voices of people that were less well known or even forgotten.

In the word of the principle investigator of the project Susan Schreibman, Professor of Digital Humanities NUIM:

Allowing letters from personal collections to be read alongside official letters and letters contributed by institutions will add new perspectives to the events of the period and allow us to understand what it was like to live an ordinary life through what were extraordinary times…All too often our emphasis is on the grand narrative focusing on key political figures. But as we approach the centenary of the Easter Rising we want to try to get a sense of how ordinary people coped with one of the most disruptive periods in contemporary Irish history…” press release