One way to get faster performance for processing of a big text corpus is to use streaming methods. Streaming means basically to keep the data stored in a file and access it when necessary, instead of keeping all data in memory.
Recently I looked into the gensim library, a library for topic modelling with Python, and they provide easy ways to save/load text corpora, dictionaries ect. In their tutorial they also suggest to create a corpus object that uses a streaming method:
class MyCorpus(object): def __iter__(self): for line in open('mycorpus.txt'): # assume there's one document per line, tokens separated by whitespace yield dictionary.doc2bow(line.lower().split())
This corpus class reads the lines directly from a text file instead of keeping the whole text stored in memory, a MyCorpus instance is fairly small, because it has just a reference to ‘mycorpus.txt’. This is very memory efficient.
I tried to use a similar approach for my TxtCorpus class. However, my corpus is not reading from a text file, but instead I pickled a dictionary of instances of my Letter class. Each 1916 Letter is an object that gets pickled and stored. The TxtCorpus class retrieves them, or data stored in them. In my example below the method get_txt() returns the transcriptions:
class TxtCorpus(object): def __init__(self, file_name): self.file = file_name def __iter__(self): for key, item in item_from_pickle(self.file).items(): # returns the transcriptions stored in the Letter's instance yield item.get_txt()