Tag Archives: Python

Summer school Python for text analysis

There are two summer school on text analysis using Python this year. From the 22nd July to the 1st August is Joint Culture & Technology and CLARIN-D Summer School in Leipzig. I have been at this summer school a few years ago. It was great, many people, great atmosphere, and Leipzig is a lovely place. Anyway, this year they have a module on Python for text analysis: Advanced Topics in Humanities Programming with Python.

The second summer school is DARIAH International Digital Humanities Summer School in Göttingen, from 17th to 30th August. They also do a module on Python for text analysis. I have been there last year and it was great. The instructors were fantastic and we learned a lot. Would definitely recommend it.

Advertisements

Spell checking with PyEnchant

PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.

In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.

My spell check function is included into the cleaner module and looks something like this at the moment:


def spell_checking(wordlst):
    with open(SPELL_CHECK_PWL, "r") as f:
        all_pwl = f.read().lower()
    d = enchant.DictWithPWL("en_US", temp_pwl_file)
    err = []
    for w in wordlst:
        if not d.check(w):
        try:
            first_sug = d.suggest(w)[0]
            if w != first_sug.lower():
            err.append((w, first_sug))
        except IndexError:
            err.append((w, None))
    os.remove(temp_pwl_file)
    return err

The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.

A sample form the result file looks something like this:

1000.0.txt:
barrington:Harrington
oct:cot
preists:priests
glendalough:Glendale
glenlough:unploughed
irelands:ire lands

1004.0.txt:
clonbrook:cloakroom

1006.0.txt:
organisation:organization
belfort:Belfast
hanly:manly
chau:char
organisation:organization
wallpole:wall pole
especally:especially

tf-idf – Term Frequency Inverse Document Frequency

Term Frequency Inverse Document Frequency, or short tf-idf, is a way to measure how important a term is in context of a document or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

With gensim the tf-idf can be calculated using the gensim.models.tfidfmodel:

from gensim import modelst
doc_bow = [(0, 1), (1, 1)] #bag-of-words, for instance created by document2bow
fidf = models.TfidfModel(corpus)
#Result: [(0, 0.70710678), (1, 0.70710678)]

This example is taken from the gensim tutorial and shows in a few steps how the transformation works. A “bag-of-words”, list of tuples of word-id and frequency, is used as corpus and TfidfModel class transforms the values into “TfIdf real-valued weights”.

Better Performance – Text Streaming

One way to get faster performance for processing of a big text corpus is to use streaming methods. Streaming means basically to keep the data stored in a file and access it when necessary, instead of keeping all data in memory.

Recently I looked into the gensim library, a library for topic modelling with Python, and they provide easy ways to save/load text corpora, dictionaries ect. In their tutorial they also suggest to create a corpus object that uses a streaming method:

class MyCorpus(object):
    def __iter__(self):
       for line in open('mycorpus.txt'):
       # assume there's one document per line, tokens separated by whitespace
             yield dictionary.doc2bow(line.lower().split())

This corpus class reads the lines directly from a text file instead of keeping the whole text stored in memory, a MyCorpus instance is fairly small, because it has just a reference to ‘mycorpus.txt’. This is very memory efficient.
I tried to use a similar approach for my TxtCorpus class. However, my corpus is not reading from a text file, but instead I pickled a dictionary of instances of my Letter class. Each 1916 Letter is an object that gets pickled and stored. The TxtCorpus class retrieves them, or data stored in them. In my example below the method get_txt() returns the transcriptions:

class TxtCorpus(object):
    def __init__(self, file_name):
        self.file = file_name

    def __iter__(self):
        for key, item in item_from_pickle(self.file).items():
            # returns the transcriptions stored in the Letter's instance
            yield item.get_txt()

Topic Modelling with Python: Gensim

One investigation of my internship is into topic modelling of the 1916 letters. I decided to use Python, because I was already familiar with the language before I started the internship and Python has good libraries for natural language processing and topic modelling. I tested the nltk and the gensim toolkit. The nltk is a well known toolkit and I use parts of it occasionally. For an introduction I recommend the documentation and the O’Reilly book available via the NLTK website.

The gensim library is a library for ‘topic modelling for humans’, so I hope it is as easy to use and intuitive as it claims to be. It is quickly installed via easy_install or pip and it is build on NumPy and SciPy, which have to be installed in order to use it.

Topic modelling Tools

I had a look at a number of topic modelling tools. The first was Mallet, a tool frequently used for topic modelling. For instance, my collegue Emma Clarke, TCD and now NUIM, used Mallet to extract topics from the 19th century transactions of the Royal Irish Academy (on JSTOR). Her related blog entry is available here. For a detailed description on how to setup and use Mallet I recommand the blog post on the programming historian.

Another software that is quite popular for topic modelling in DH is the Topic Modelling Tool (TMT), and its use and examples are described by Miriam on her DH blog.

After searching a while the internet I found also a Python module, “gensim”, which claims to be for “topic modelling for humans”. It is not as easy to use as the above mentioned tools, but on is website there is a detailed tutorial, its developer Radim answered questions in a number of online forum, google groups etc, and also the API is very well documented. At a later stage I will use Mallet in order to compare the results that i get from gensim with another topic modelling tool.

Another Performance Test

Out if personal interest I made another performance test. This time I targeted a more complex function that was written to import a huge table of transcriptions and metadata about all the letters from an EXCEL file. The importing from EXCEL is very easy thanks to the Python’s xlrd module. Each row in the EXCEL file contains the transcription of a letter page and metadata of the letter, but this means that in case if there are more than one page of a letter the metadata is duplicated. My function loops over the rows in the EXCEL table and creates Letter objects (my custom class to represent letters). The function merges multiple pages of the same letter and adds them to the same Letter object, tokenizes the transcriptions and cleans the text from XML-like markup and punctuation. The final step is to store all the Letter objects to a file using Pythons shelve module. In that way all these objects do not have to stay in memory.

My approach approach for the storage is to save all the Letter objects in a dictionary, with the letter id’s as keys and than push the whole dictionary to the shelve file. Worked quite well and was not too slow (considering the amount of letters and what the function has to do).

performance_3

The function took nearly 19 – 21 seconds (based on three tests) to run over about 2200 entries in the EXCEL file, create 850 letter objects, add the letters to a dictionary, clean the transcriptions and merge them if they are from the same letter, and finally store everything in a shelve file. This is not too bad considering that most of the time (13.3) seconds was used by the xlrd.open_workbook(filename=file_path, encoding_override=”utf-8″) function, that retrieves the data from the EXCEL file, and creates a handy to use workbook object. And Python cp1252.py decode took also quite long. Furthermore, a lot of time was spend creating the letter objects (1.6) and cleaning and adding the transcriptions (2.6). There might be some opportunity for optimization – I might come back to revising this function later in the project.

The stored shelve dictionary contains now a dictionary with all the letter objects. Out of curiosity I wrote a second function that does all the steps that the first one does, but instead of creating a dictionary and than storing it, each letter object will be stored in the shelve file when it is created. The shelve file is after all dictionary-like. I assumed that it might take a bit longer, and was surprised how long it actually took.

performance_4

Nearly 60 seconds!!! That was three times the time of the other function. After I had a look at the statistics returned by cProfile I found that most time was spend calling the keys() method on the shelve file in order to retrieve the keys (= letter ids) that were already stored. Not sure why this happened. Is it because keys() returns a list, instead of a faster data structure, like a set? Still 40 seconds more is very significant.  Could it have something to do with DeadlockWrap?


d = shelve.open("letters.shelve")

#Than follows the for loop that gets the data from the EXCEL file and prepares it for storage

if letter_id not in d.keys():       #This is the interesting part: if the letter id is not already used
d[letter_id] = l                     # as key in the shelve file add a new letter object

d.sync()    # the shelve file has to be synced in order to ensure object is written to the file

d.close()   # and closed after the for-loop

Interesting was that by replacing the d.keys() with a set ‘s’ that stores all the keys of letter objects – s.update(letter_id), it took the function less than half of the time, even if every object is written to the shelve file directly:

performance_5

It is still not as fast as my first solution, and will probably never be, because of the numerous calls to the shelve file and the constant sync. It was however interesting to see how a small change like replacing d.keys() with the set ‘s’ can have so tremendous effects.