In one of my previous posts I talked about using PyEnchant to regularise word spelling. Another process that was suggested to use a stemmer.
A stemmer is a program that reduced a word to its word stem or base form. For instance, in English a stemmer would remove suffix endings such as -ed, -ly, -s, -ing, and others. Hence, ‘walking’, ‘walked’, ‘walks’ would all be reduced to ‘walk’. This can be very useful when your analysis depends on word frequency. A problem is however that the stemmer can be sometimes too radical and change ‘july:juli’, ‘county:counti’, or ‘enclose:enclos’. This does not effect the analysis, but when presenting the results it might be worth to check the correct spelling.
I implemented a stemmer from nltk.stem and saved a list of the original word and stemmed form to a file. This allowed me to spot stemming issues. Following is my stemming function:
def stemmer(wordlst): st = stem.PorterStemmer() stem_words =  for w in wordlst: stem_words.append((w, st.stem(w))) return stem_words