Stemming words in the transcriptions

In one of my previous posts I talked about using PyEnchant to regularise word spelling. Another process that was suggested to use a stemmer.

A stemmer is a program that reduced a word to its word stem or base form. For instance, in English a stemmer would remove suffix endings such as -ed, -ly, -s, -ing, and others. Hence, ‘walking’, ‘walked’, ‘walks’ would all be reduced to ‘walk’. This can be very useful when your analysis depends on word frequency. A problem is however that the stemmer can be sometimes too radical and change ‘july:juli’, ‘county:counti’, or ‘enclose:enclos’. This does not effect the analysis, but when presenting the results it might be worth to check the correct spelling.

I implemented a stemmer from nltk.stem and saved a list of the original word and stemmed form to a file. This allowed me to spot stemming issues. Following is my stemming function:


def stemmer(wordlst):
    st = stem.PorterStemmer()
    stem_words = []
    for w in wordlst:
        stem_words.append((w, st.stem(w)))
    return stem_words

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s