Category Archives: Data Storage

Better Performance – Text Streaming

One way to get faster performance for processing of a big text corpus is to use streaming methods. Streaming means basically to keep the data stored in a file and access it when necessary, instead of keeping all data in memory.

Recently I looked into the gensim library, a library for topic modelling with Python, and they provide easy ways to save/load text corpora, dictionaries ect. In their tutorial they also suggest to create a corpus object that uses a streaming method:

class MyCorpus(object):
    def __iter__(self):
       for line in open('mycorpus.txt'):
       # assume there's one document per line, tokens separated by whitespace
             yield dictionary.doc2bow(line.lower().split())

This corpus class reads the lines directly from a text file instead of keeping the whole text stored in memory, a MyCorpus instance is fairly small, because it has just a reference to ‘mycorpus.txt’. This is very memory efficient.
I tried to use a similar approach for my TxtCorpus class. However, my corpus is not reading from a text file, but instead I pickled a dictionary of instances of my Letter class. Each 1916 Letter is an object that gets pickled and stored. The TxtCorpus class retrieves them, or data stored in them. In my example below the method get_txt() returns the transcriptions:

class TxtCorpus(object):
    def __init__(self, file_name):
        self.file = file_name

    def __iter__(self):
        for key, item in item_from_pickle(self.file).items():
            # returns the transcriptions stored in the Letter's instance
            yield item.get_txt()

Storing the Letter Objects: Python’s Shelve

Recently I looked into ways to store my Python Letter objects to a file after they are created. This has two advantages:

  1. Increased Performance, because the 850 objects do not have to be kept in memory
  2. The importer function takes quite a bit of time – 20 sec. If I want to run the whole program several times for testing, it is very annoying to wait each run 20 seconds for the importer (The reason why it takes so long is discussed in another post). My data won’t change (at least not during testing) and therefor it is handy to load it directly from a file instead of running the importer module

I found that Pythons pickle and shelve libraries where useful tools to work with. A good tutorial to shelve can be found in O’Reilly’s book: Programming Python, or on the Module of the Week blog. The shelve module is great because it allows to store objects in a dictionary-like way, where the objects can be fetched by keys.