Summer school Python for text analysis

There are two summer school on text analysis using Python this year. From the 22nd July to the 1st August is Joint Culture & Technology and CLARIN-D Summer School in Leipzig. I have been at this summer school a few years ago. It was great, many people, great atmosphere, and Leipzig is a lovely place. Anyway, this year they have a module on Python for text analysis: Advanced Topics in Humanities Programming with Python.

The second summer school is DARIAH International Digital Humanities Summer School in Göttingen, from 17th to 30th August. They also do a module on Python for text analysis. I have been there last year and it was great. The instructors were fantastic and we learned a lot. Would definitely recommend it.

Advertisements

1st international workshop on computational history

Today the first Irish international workshop on computational history was held at the Royal Irish Academy. The workshop was organised by the Knowledge and Data Engineering Group (KDEG) in TCD.

The workshop was kicked off by Micheál Ó Siochrú’s lecture on his personal experience with Big Data as a historian through his work on the 1641 Depositions project. Title: Digitising History – why should historians care?

I do not want to reproduce the whole workshop program, because it is available with abstracts and bio notes on the KDEG website. It was a very diverse workshop bringing together, humanities, science and computer science people. I found all papers quite fascinating. For instance, stanford’s attempt to model the Roman road system: http://orbis.stanford.edu/

Stemming words in the transcriptions

In one of my previous posts I talked about using PyEnchant to regularise word spelling. Another process that was suggested to use a stemmer.

A stemmer is a program that reduced a word to its word stem or base form. For instance, in English a stemmer would remove suffix endings such as -ed, -ly, -s, -ing, and others. Hence, ‘walking’, ‘walked’, ‘walks’ would all be reduced to ‘walk’. This can be very useful when your analysis depends on word frequency. A problem is however that the stemmer can be sometimes too radical and change ‘july:juli’, ‘county:counti’, or ‘enclose:enclos’. This does not effect the analysis, but when presenting the results it might be worth to check the correct spelling.

I implemented a stemmer from nltk.stem and saved a list of the original word and stemmed form to a file. This allowed me to spot stemming issues. Following is my stemming function:


def stemmer(wordlst):
    st = stem.PorterStemmer()
    stem_words = []
    for w in wordlst:
        stem_words.append((w, st.stem(w)))
    return stem_words

Spell checking with PyEnchant

PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.

In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.

My spell check function is included into the cleaner module and looks something like this at the moment:


def spell_checking(wordlst):
    with open(SPELL_CHECK_PWL, "r") as f:
        all_pwl = f.read().lower()
    d = enchant.DictWithPWL("en_US", temp_pwl_file)
    err = []
    for w in wordlst:
        if not d.check(w):
        try:
            first_sug = d.suggest(w)[0]
            if w != first_sug.lower():
            err.append((w, first_sug))
        except IndexError:
            err.append((w, None))
    os.remove(temp_pwl_file)
    return err

The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.

A sample form the result file looks something like this:

1000.0.txt:
barrington:Harrington
oct:cot
preists:priests
glendalough:Glendale
glenlough:unploughed
irelands:ire lands

1004.0.txt:
clonbrook:cloakroom

1006.0.txt:
organisation:organization
belfort:Belfast
hanly:manly
chau:char
organisation:organization
wallpole:wall pole
especally:especially

Cleaning a messy corpus

Working with the 1916 data I found (what people with experience have always told me) that cleaning of your data is an essential step. It could be even the most important step. Inconsistent, messy, and fault leads to problems and wrong results in the analysis and interpretation stages of your research.

In regards to the 1916 letters wrong spelling, inconsistent markup and comments in the text, inconsistent metadata are all sources for error. I knew from the start of my internship that cleaning the 1916 data would be one of the challenges. I did a bit of research and found very useful tips. Emma Clarke a former Mphil student here in TCD did recently a topic modelling project and talking to her and reading her Mphil thesis was very helpful. Furthermore,I found the O’Reilly Bad Data Handbook an interesting read.

The Humanities Programmer

Following a comment by Alex O’Connor I pushed all my code up on GitHub. I had planned to do this at some stage, but it never crossed my mind that somebody would be interested to study how I am writing the code for this project. On closer thinking about it, it is actually a fascinating topic.  More and more humanities research with no or little CS background learn programming languages in order to have another tool in their toolbox for text processing, online publishing, etc.

The interest in and use of programming languages by Humanities scholars goes way back into the 1960 and 1970s when collation concordances and collation software was developed. The use of this software required at least some knowledge of a programming language. From 1966 on a number of articles about programming languages for humanities research appeared in the journal Computers and the Humanities. The ability of a language to allow the Humanities Scholar ‘to split, scan, measure, compare, and join strings’ were essential, but also tasks like text formatting required programming knowledge at that time. The article also emphasizes that in the future programming languages for “complex pattern-matching problems that arise in musical or graphic art analysis” will become important too. A 1971 article in the same journal gives an overview over languages ‘easy to learn’ for humanities scholars (ALGOL, APL/360, BASIC, COBOL, FORTRAN, PL/I, SNAP, SNOBOL).

The most popular languages of recent years for humanities scholars are probably JavaScript, PHP, and Python. JavaScript and PHP because of their frequent use in web development, while Python is becoming more popular as a language for Natural Language Processing. This is for instance demonstrated by the many courses and summer schools addressing Python programming for humanities scholars. Examples are, the 2013 DARIAH Summer School in Goettingen or the this years Summer School in Goettingen, or ESU in Leipzig. Also the Austrian Centre for Digital Humanities in Graz, where I studied DH before coming to Dublin, moved from teaching Java programming to Python. Python is certainly a much more accessible language for humanities scholars and very useful for text processing. With more and more humanities scholars using programming languages (sometimes also only as a tool for one research task) it becomes relevant to explore how these scholars with often no CS background write code and generate software. Such studies will contribute to future developments of programming languages.

Long story short, I uploaded the latest version of my Python code to GitHub, so interested people can observe how my project progressed, and some might be even interested to contribute.

 

tf-idf – Term Frequency Inverse Document Frequency

Term Frequency Inverse Document Frequency, or short tf-idf, is a way to measure how important a term is in context of a document or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

With gensim the tf-idf can be calculated using the gensim.models.tfidfmodel:

from gensim import modelst
doc_bow = [(0, 1), (1, 1)] #bag-of-words, for instance created by document2bow
fidf = models.TfidfModel(corpus)
#Result: [(0, 0.70710678), (1, 0.70710678)]

This example is taken from the gensim tutorial and shows in a few steps how the transformation works. A “bag-of-words”, list of tuples of word-id and frequency, is used as corpus and TfidfModel class transforms the values into “TfIdf real-valued weights”.