Term Frequency Inverse Document Frequency, or short tf-idf, is a way to measure how important a term is in context of a document or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
With gensim the tf-idf can be calculated using the gensim.models.tfidfmodel:
from gensim import modelst
doc_bow = [(0, 1), (1, 1)] #bag-of-words, for instance created by document2bow
fidf = models.TfidfModel(corpus)
#Result: [(0, 0.70710678), (1, 0.70710678)]
This example is taken from the gensim tutorial and shows in a few steps how the transformation works. A “bag-of-words”, list of tuples of word-id and frequency, is used as corpus and TfidfModel class transforms the values into “TfIdf real-valued weights”.
One investigation of my internship is into topic modelling of the 1916 letters. I decided to use Python, because I was already familiar with the language before I started the internship and Python has good libraries for natural language processing and topic modelling. I tested the nltk and the gensim toolkit. The nltk is a well known toolkit and I use parts of it occasionally. For an introduction I recommend the documentation and the O’Reilly book available via the NLTK website.
The gensim library is a library for ‘topic modelling for humans’, so I hope it is as easy to use and intuitive as it claims to be. It is quickly installed via easy_install or pip and it is build on NumPy and SciPy, which have to be installed in order to use it.
When researching about Natural Language Processing (NLP) and Python I came across a few useful online resources. I want to share them, because some a are not so easy to find.
For a very good introduction to Computer Programming in Python are the videos lectures of the MIT course: Introduction to Computer Programming, it goes quickly to higher concepts of programming and for a total beginner it might be worth to start with an interactive introduction, something like CodeAcademy, or one of the many online tutorials, before watching the MIT course.
For beginners in NLP and Python the O’Reilly NLTK book is a very good introduction that covers both. It is freely available as an online version under here.
The MacQuarie University in Sydney offer a course in Document Processing and the Semantic Web, and they have a list of course material that is available online:
Resources and Support Materials : COMP348