In one of my previous posts I talked about using PyEnchant to regularise word spelling. Another process that was suggested to use a stemmer.
A stemmer is a program that reduced a word to its word stem or base form. For instance, in English a stemmer would remove suffix endings such as -ed, -ly, -s, -ing, and others. Hence, ‘walking’, ‘walked’, ‘walks’ would all be reduced to ‘walk’. This can be very useful when your analysis depends on word frequency. A problem is however that the stemmer can be sometimes too radical and change ‘july:juli’, ‘county:counti’, or ‘enclose:enclos’. This does not effect the analysis, but when presenting the results it might be worth to check the correct spelling.
I implemented a stemmer from nltk.stem and saved a list of the original word and stemmed form to a file. This allowed me to spot stemming issues. Following is my stemming function:
st = stem.PorterStemmer()
stem_words = 
for w in wordlst:
PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.
In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.
My spell check function is included into the cleaner module and looks something like this at the moment:
with open(SPELL_CHECK_PWL, "r") as f:
all_pwl = f.read().lower()
d = enchant.DictWithPWL("en_US", temp_pwl_file)
err = 
for w in wordlst:
if not d.check(w):
first_sug = d.suggest(w)
if w != first_sug.lower():
The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.
A sample form the result file looks something like this:
Working with the 1916 data I found (what people with experience have always told me) that cleaning of your data is an essential step. It could be even the most important step. Inconsistent, messy, and fault leads to problems and wrong results in the analysis and interpretation stages of your research.
In regards to the 1916 letters wrong spelling, inconsistent markup and comments in the text, inconsistent metadata are all sources for error. I knew from the start of my internship that cleaning the 1916 data would be one of the challenges. I did a bit of research and found very useful tips. Emma Clarke a former Mphil student here in TCD did recently a topic modelling project and talking to her and reading her Mphil thesis was very helpful. Furthermore,I found the O’Reilly Bad Data Handbook an interesting read.