Category Archives: Text Processing

Twitter chat on #AskLetters1916

About every month the Letters of 1916 project organises a Twitter chat. Different topics related to the letters project have been discussed in the past – Women in the Great War, Crowdsourcing, etc. Tonight the discussion was about Text Analysis and Topic Modelling of the 1916 letters.

Here is a link to the Twitter page: Link

Advertisements

Generating 4, 8, 12, 16 topics

When the Letters of 1916 corpus is clustered to the 16 topics generated with Gensim and Mallet it seems that 16 topics might be too much. In one of my last posts I have shown visualisations created with Gephi, and I colored the letter nodes based on the categories that was assigned by the person that uploaded the letter. Only letters assigned to four or five of these categories actually clustered together. So after I talked with my internship supervisor Dermot it was decided that I try to reduce the number of topics to see what happens, and I would create visualisations for 4, 8, 12 generated topics. I could observer that that with 4, 8, and 12 topics the clustering was still the same as with 16 topics. However, lesser topics shows that many letters from generic categories such as 1916 Rising, or The Irish Question cluster with one of the four distinct topics.

4 topics Mallet:

letters_T4_030_with_lab2

4 topics Gensim:

letters_gensim_T4_01_lab

Letters of 1916: Visualising 16 Topics

At first I generated 16 topics (the reason is explained in a previous post) with Gensim and Mallet. When I visualised my data with Gephi I got an interesting result.

Mallet – 16 topics

letters_T16_01_lab

Gensim – 16 topics

letters_gensim_T16_01_lab

The Mallet output shows clearly a clustering of human assigned topics (colors) around computer generated topics (the black nodes, numbered Topic 0 – 15). At least letters assigned to four topics seem to cluster also together based on computer generated topics: Letters categorised as World War 1, Family life, Official documents and Love letters. See for instance, the clustering of letters assigned to the category of WW1 and Family life. It seems that the language of letters with these two categories are quite close:

topic5_mallet_16T

The above mentioned categories cluster quite nicely. Another observation is that the green nodes for the categories Easter Rising and Irish question are all over the place and it is questionable if this is a useful category. The remaining categories are not used much at the moment, and they are not really visible. However, they could get more important when the data set grows.

The visualisation of the Gensim topics is not so clear at first glance, because there are many more edges. But a similar red, blue and yellow clustering can be observed. One issue with the Gensim algorithm was however that it responded much more to address information in the letters, and this influences the topic modelling process. This can be observed when looking at the generated topics, the clustering of the letters and the transcriptions of the individual letters. Address information is currently part of the transcription. The plan for the future it to encode the letters in TEI. When they are TEI encoded the stripping out of address information, salutation, etc. will be easier and much clearer topics can be generated.

 

Topics of the 1916 Letters

I recently generated topics of the 1916 Letters project data using two different topic modelling software: Mallet, a topic modelling program written in Java, and on I wrote a script based on the Python topic modelling library Gensim. Mallet uses an implementation of LDA, while Gensim uses its own implementation of LDA, but allows also the transformation to other models and has wrapper for other implementations. For instance, there is also a Mallet wrapper (since version 0.9.0), but I could not get it to work. Anyway, the point is that the standard Gensim implementation of LDA is different from Mallet and when I ran Gensim and Mallet on the 1916 Letters data I got different results. On first sight the computer generated topics did not make much sense to me, but when I clustered the letters according to their relationships to the topics I found that similar letters would cluster together. So that showed both Gensim and Mallet worked.

Here is a first attempt to generate 16 topics. I chose the number 16 because at the moment when people upload their letters to the Letters of 1916 website they have to assign one of 16 predefined topics to their letter. Topics are for instance: World War 1, Family life, Art and literature, etc. One of the research questions I am working on is if the human assigned topics and the computer generated topics differ.

Here is my first Gensim and Mallet topic output:

Gensim_Mallet_16_topics

 

Stemming words in the transcriptions

In one of my previous posts I talked about using PyEnchant to regularise word spelling. Another process that was suggested to use a stemmer.

A stemmer is a program that reduced a word to its word stem or base form. For instance, in English a stemmer would remove suffix endings such as -ed, -ly, -s, -ing, and others. Hence, ‘walking’, ‘walked’, ‘walks’ would all be reduced to ‘walk’. This can be very useful when your analysis depends on word frequency. A problem is however that the stemmer can be sometimes too radical and change ‘july:juli’, ‘county:counti’, or ‘enclose:enclos’. This does not effect the analysis, but when presenting the results it might be worth to check the correct spelling.

I implemented a stemmer from nltk.stem and saved a list of the original word and stemmed form to a file. This allowed me to spot stemming issues. Following is my stemming function:


def stemmer(wordlst):
    st = stem.PorterStemmer()
    stem_words = []
    for w in wordlst:
        stem_words.append((w, st.stem(w)))
    return stem_words

Spell checking with PyEnchant

PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.

In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.

My spell check function is included into the cleaner module and looks something like this at the moment:


def spell_checking(wordlst):
    with open(SPELL_CHECK_PWL, "r") as f:
        all_pwl = f.read().lower()
    d = enchant.DictWithPWL("en_US", temp_pwl_file)
    err = []
    for w in wordlst:
        if not d.check(w):
        try:
            first_sug = d.suggest(w)[0]
            if w != first_sug.lower():
            err.append((w, first_sug))
        except IndexError:
            err.append((w, None))
    os.remove(temp_pwl_file)
    return err

The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.

A sample form the result file looks something like this:

1000.0.txt:
barrington:Harrington
oct:cot
preists:priests
glendalough:Glendale
glenlough:unploughed
irelands:ire lands

1004.0.txt:
clonbrook:cloakroom

1006.0.txt:
organisation:organization
belfort:Belfast
hanly:manly
chau:char
organisation:organization
wallpole:wall pole
especally:especially

Cleaning a messy corpus

Working with the 1916 data I found (what people with experience have always told me) that cleaning of your data is an essential step. It could be even the most important step. Inconsistent, messy, and fault leads to problems and wrong results in the analysis and interpretation stages of your research.

In regards to the 1916 letters wrong spelling, inconsistent markup and comments in the text, inconsistent metadata are all sources for error. I knew from the start of my internship that cleaning the 1916 data would be one of the challenges. I did a bit of research and found very useful tips. Emma Clarke a former Mphil student here in TCD did recently a topic modelling project and talking to her and reading her Mphil thesis was very helpful. Furthermore,I found the O’Reilly Bad Data Handbook an interesting read.