Tag Archives: Regular Expressions

High Performance Text Processing: An example

It is aboslutly amazing how much a programs performance can be optimised. Or better, how much slower a badly written function can be.
My first approach to create a list of clean word tokens and strip punctuation characters, whitespace and TEI markup resulted in a function that worked fine in my tests and returned the results I wanted. In my unittests the function had only to process small strings, and when I tried it on the over 850 letters in 1916 Letters corpus it took about 6 to 8 seconds to run (several attempts).
performance_1
This is only one function and in anticipation that the letter corpus will grow over the next years 8 seconds is too much.
My first approach to delete punctuation, spaces and markup was to loop over a list that contained all the stuff that I did not want. I split the text up along whitespaces and looped over each item in the list of words removing leading and trailing punctuation and spaces. To identify XML like markup I used regular expressions (Python re module). It worked okay, but as I said before – quite slow and the function was about 30 line long!
When I started looking for a better and faster solution to my problem, I found out that pre-combiled regular expressions in Python are pretty fast, because they are a C-based library, and they make the code also shorter.

def strip_punct_regex(strg):
    lst_words = strg.split()
    pat = "[\W]*(\w+[\w\'-/.]*\w+|\w|&)[\W]*"
    regex = re.compile(pat)
    lst_clean = []
    for item in lst_words:
        mm = regex.match(item)
        if mm:
            lst_clean.append(mm.group(1).lower())
    return lst_clean

From about 30 lines of code, two for loops and several if-else statements, thanks to Regular Expressions I came down to 10 lines, one for loop and one if statement. Allthough the extensive use of Regular Expressions seems not to be suggested, and there are performance issues as well (see Dive into Python). In my case I found that it made my code much simpler and also quicker.
performance_2
Online resources:
On performance with regular expressions see also: Python – Performance Tests of Regular Expressions
On the python re module: the Python documentation, many examples are also on the website Python Module of The Week
You Tube vide on High Performance Text Processing

Advertisements

Getting Rid of XML markup with Regular Expressions

After I had found that the croudsourced xml markup was not very helpful and even made over 10% of my 850 letters not parse with the lxml library, I experimented with regular expressions. I find the Python re library fairly easy to use and very hand. When it comes to regular expression and Python there are many, many tutorials online. The documentation on the Python website is (like always) a good starting point, because it gives an overview over modules functions and how to use them. I found also the Google Developers Tutorial a good read. For a longer introduction with case studies and also critical remarks on when not to use regular expressions, and performance issues see Dive into Python.

For my purposes the following code worked quite well:

pat = "<[/\w\d\s\"\'=]+>|<!--[/\w\d\s\"\'=.,-]+-->"
expr = re.compile(pat)
for letter in letters:
    ''.join(expr.split(letter))