tf-idf – Term Frequency Inverse Document Frequency

Term Frequency Inverse Document Frequency, or short tf-idf, is a way to measure how important a term is in context of a document or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

With gensim the tf-idf can be calculated using the gensim.models.tfidfmodel:

from gensim import modelst
doc_bow = [(0, 1), (1, 1)] #bag-of-words, for instance created by document2bow
fidf = models.TfidfModel(corpus)
#Result: [(0, 0.70710678), (1, 0.70710678)]

This example is taken from the gensim tutorial and shows in a few steps how the transformation works. A “bag-of-words”, list of tuples of word-id and frequency, is used as corpus and TfidfModel class transforms the values into “TfIdf real-valued weights”.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s