High Performance Text Processing: An example

It is aboslutly amazing how much a programs performance can be optimised. Or better, how much slower a badly written function can be.
My first approach to create a list of clean word tokens and strip punctuation characters, whitespace and TEI markup resulted in a function that worked fine in my tests and returned the results I wanted. In my unittests the function had only to process small strings, and when I tried it on the over 850 letters in 1916 Letters corpus it took about 6 to 8 seconds to run (several attempts).
This is only one function and in anticipation that the letter corpus will grow over the next years 8 seconds is too much.
My first approach to delete punctuation, spaces and markup was to loop over a list that contained all the stuff that I did not want. I split the text up along whitespaces and looped over each item in the list of words removing leading and trailing punctuation and spaces. To identify XML like markup I used regular expressions (Python re module). It worked okay, but as I said before – quite slow and the function was about 30 line long!
When I started looking for a better and faster solution to my problem, I found out that pre-combiled regular expressions in Python are pretty fast, because they are a C-based library, and they make the code also shorter.

def strip_punct_regex(strg):
    lst_words = strg.split()
    pat = "[\W]*(\w+[\w\'-/.]*\w+|\w|&)[\W]*"
    regex = re.compile(pat)
    lst_clean = []
    for item in lst_words:
        mm = regex.match(item)
        if mm:
    return lst_clean

From about 30 lines of code, two for loops and several if-else statements, thanks to Regular Expressions I came down to 10 lines, one for loop and one if statement. Allthough the extensive use of Regular Expressions seems not to be suggested, and there are performance issues as well (see Dive into Python). In my case I found that it made my code much simpler and also quicker.
Online resources:
On performance with regular expressions see also: Python – Performance Tests of Regular Expressions
On the python re module: the Python documentation, many examples are also on the website Python Module of The Week
You Tube vide on High Performance Text Processing


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s