Named Entity Recognition (NER) is the task to identify and tag entities such as person names, company names, place names, days, etc. from unstructured text. Unstructured texts are for instance plain text files, such as the letters I am working on after the XML markup is removed. As discussed in previous posts the XML markup added through crowdsourcing was inconsistent and in most cases did not parse anyway.
NER is relevant for my project as it allows me to identify and if necessary to build up a stopword list of names that are necessary to be striped in a pre-processing stage. One issue with my letter corpus is that each transcription starts with address information. Furthermore, a personal name like ‘Peter’ provides me with little useful information about a letters content.
Another problem is that at this stage a big part of the corpus are letters to and from Lady Clonbrock of Galway (Augusta Caroline Dillon, wife of Luke Gerald Dillon), for Lady Clonbrock’s correspondance with soldiers in WW1 see this article. Initial tests have already shown that some generated topics are based on names rather than content words, and the high frequency of names (due to address header etc.) makes interpretation of the topics difficult.
The importance of a similar pre-processing for a corpus of 19-cent. literary texts was described by Matthew Jockers and David Mimno in ‘Significant Themes in 19th-Century Literature‘.
Like Jockers and Mimno I am also using the Stanford NLP-software. It is a Java-based software including different tools for Natural Language Processing. A demo of the NER tagger can be found here.
I found the tool very user-friendly and there is a lot of documentation online. There are also several interfaces to other programming languages available. I used the NLTK interface. The setup was the setup was straightforward. Instructions can be found on the Stanford NLP website, or alternatively on this blog. I just had to download the software and a model file, and point the NLTK to my Java Development Kit. This is done in the internals.py file in the NLTK module. On line 72 I simply added the path to def config_java():
def config_java(bin="C:/Program Files/Java/jdk1.8.0_05/bin/java.exe", options=None, verbose=True):
One issue that kept me occupied for a while was that I got a ‘Java command failed!’ error. After a while I found that the problem was that I had config_java pointed to an older version of JDK (1.7).