Spell checking with PyEnchant

PyEnchant is a Python library for spell checking. As part of my text cleaning process I employ PyEnchant to automate the normalisation of words in my the 1916 Letters corpus. The cleaning with PyEnchant or similar tools has to be done carefully, because it is very easy to clean too much and correct words that were right in the first place. Therefore, a human-supervised, semi-automated normalisation process is probably the best solution. Thanks to Emma Clarke for suggesting PyEnchant it is a very useful tool.

In regards to spelling there are several issues that could have negative influence on the outcome of my analysis. The 1916 letters are being transcribed using a crowdsourcing approach. Spelling errors can happen during the transcription process, or the source letters contain wrong spelling and it is not corrected by the transcriber. Furthermore, the letters were written at the beginning of the twentieth century and written by people with very diverse education and from different countries. Naturally, in some cases the spelling will differ. An automated spell checker is a useful tool to ensure some consistency within the collected transcriptions.

My spell check function is included into the cleaner module and looks something like this at the moment:


def spell_checking(wordlst):
    with open(SPELL_CHECK_PWL, "r") as f:
        all_pwl = f.read().lower()
    d = enchant.DictWithPWL("en_US", temp_pwl_file)
    err = []
    for w in wordlst:
        if not d.check(w):
        try:
            first_sug = d.suggest(w)[0]
            if w != first_sug.lower():
            err.append((w, first_sug))
        except IndexError:
            err.append((w, None))
    os.remove(temp_pwl_file)
    return err

The result will be a file that contains a list of suggested spelling errors and a guess for a solution. The global variable SPELL_CHECK_PWL refers to a personal word list file. I add a word to the PWL every time the spell checker thinks a word wrong, but it is actually correct and I do not want it corrected.

A sample form the result file looks something like this:

1000.0.txt:
barrington:Harrington
oct:cot
preists:priests
glendalough:Glendale
glenlough:unploughed
irelands:ire lands

1004.0.txt:
clonbrook:cloakroom

1006.0.txt:
organisation:organization
belfort:Belfast
hanly:manly
chau:char
organisation:organization
wallpole:wall pole
especally:especially
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s