Category Archives: Regular Expressions

Getting Rid of XML markup with Regular Expressions

After I had found that the croudsourced xml markup was not very helpful and even made over 10% of my 850 letters not parse with the lxml library, I experimented with regular expressions. I find the Python re library fairly easy to use and very hand. When it comes to regular expression and Python there are many, many tutorials online. The documentation on the Python website is (like always) a good starting point, because it gives an overview over modules functions and how to use them. I found also the Google Developers Tutorial a good read. For a longer introduction with case studies and also critical remarks on when not to use regular expressions, and performance issues see Dive into Python.

For my purposes the following code worked quite well:

pat = "<[/\w\d\s\"\'=]+>|<!--[/\w\d\s\"\'=.,-]+-->"
expr = re.compile(pat)
for letter in letters:
    ''.join(expr.split(letter))
Advertisements