Tag Archives: XML

Getting Rid of XML markup with Regular Expressions

After I had found that the croudsourced xml markup was not very helpful and even made over 10% of my 850 letters not parse with the lxml library, I experimented with regular expressions. I find the Python re library fairly easy to use and very hand. When it comes to regular expression and Python there are many, many tutorials online. The documentation on the Python website is (like always) a good starting point, because it gives an overview over modules functions and how to use them. I found also the Google Developers Tutorial a good read. For a longer introduction with case studies and also critical remarks on when not to use regular expressions, and performance issues see Dive into Python.

For my purposes the following code worked quite well:

pat = "<[/\w\d\s\"\'=]+>|<!--[/\w\d\s\"\'=.,-]+-->"
expr = re.compile(pat)
for letter in letters:
    ''.join(expr.split(letter))
Advertisements

XML Processing with Python

As the Letters of 1916 is a crowdsourced project the transcriptions of the letters contain irregular xml markup and in some cases not well-formed xml. At first I thought that I might be able to use the xml markup for my analysis, but the inconsistent quality of encoding makes this a useless attempt (see also my previous post).

Python has a number of libraries for XML-processing (a list including also libraries that are not par of the standard library is available here). The most popular ones are xml.etree, xml.dom, and xml.sax, which are all part of the Python standard library. I decided to use the lxml library, which has a similar API as xml.etree and was therefore easy enough to use. The library is pretty quick thanks to the underlying C libraries libxml2 and libxslt and it has limited support for XPath 1.0 and XSLT 1.0.

Because of the XPath support to get all the text out of a XML encoded document is as easy as:

for letter in letters:
    root = etree.fromstring(letter) 
    text_lst = root.xpath(".//text()") 
# The result is a list of text nodes that can be combined with " ".join() 

The problem however was that a good deal of what was supposed to be xml or plain text was in reality not well-formed xml (I discussed this in another post). To find out how many of the letters would not parse, I made the following changes:

for letter in letters:
    syntaxErr = 0
    try:
        root = etree.fromstring(letter)
        text_lst = root.xpath(".//text()")    
    except etree.XMLSyntaxError:
        syntaxErr += 1

To remove xml markup from the letter transcriptions was because of the numerous syntax errors in the transcriptions not possible. I found eventually regular expressions the best solution for this task.