PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.
In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.
My spell check function is included into the cleaner module and looks something like this at the moment:
def spell_checking(wordlst): with open(SPELL_CHECK_PWL, "r") as f: all_pwl = f.read().lower() d = enchant.DictWithPWL("en_US", temp_pwl_file) err =  for w in wordlst: if not d.check(w): try: first_sug = d.suggest(w) if w != first_sug.lower(): err.append((w, first_sug)) except IndexError: err.append((w, None)) os.remove(temp_pwl_file) return err
The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.
A sample form the result file looks something like this:
1000.0.txt: barrington:Harrington oct:cot preists:priests glendalough:Glendale glenlough:unploughed irelands:ire lands 1004.0.txt: clonbrook:cloakroom 1006.0.txt: organisation:organization belfort:Belfast hanly:manly chau:char organisation:organization wallpole:wall pole especally:especially