Term Frequency Inverse Document Frequency, or short tf-idf, is a way to measure how important a term is in context of a document or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
With gensim the tf-idf can be calculated using the gensim.models.tfidfmodel:
from gensim import modelst doc_bow = [(0, 1), (1, 1)] #bag-of-words, for instance created by document2bow fidf = models.TfidfModel(corpus) #Result: [(0, 0.70710678), (1, 0.70710678)]
This example is taken from the gensim tutorial and shows in a few steps how the transformation works. A “bag-of-words”, list of tuples of word-id and frequency, is used as corpus and TfidfModel class transforms the values into “TfIdf real-valued weights”.