Getting Rid of XML markup with Regular Expressions

After I had found that the croudsourced xml markup was not very helpful and even made over 10% of my 850 letters not parse with the lxml library, I experimented with regular expressions. I find the Python re library fairly easy to use and very hand. When it comes to regular expression and Python there are many, many tutorials online. The documentation on the Python website is (like always) a good starting point, because it gives an overview over modules functions and how to use them. I found also the Google Developers Tutorial a good read. For a longer introduction with case studies and also critical remarks on when not to use regular expressions, and performance issues see Dive into Python.

For my purposes the following code worked quite well:

pat = "<[/\w\d\s\"\'=]+>|<!--[/\w\d\s\"\'=.,-]+-->"
expr = re.compile(pat)
for letter in letters:
    ''.join(expr.split(letter))
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s