data storage | Text Analysis of the 1916 Letters

Out if personal interest I made another performance test. This time I targeted a more complex function that was written to import a huge table of transcriptions and metadata about all the letters from an EXCEL file. The importing from EXCEL is very easy thanks to the Python’s xlrd module. Each row in the EXCEL file contains the transcription of a letter page and metadata of the letter, but this means that in case if there are more than one page of a letter the metadata is duplicated. My function loops over the rows in the EXCEL table and creates Letter objects (my custom class to represent letters). The function merges multiple pages of the same letter and adds them to the same Letter object, tokenizes the transcriptions and cleans the text from XML-like markup and punctuation. The final step is to store all the Letter objects to a file using Pythons shelve module. In that way all these objects do not have to stay in memory.

My approach approach for the storage is to save all the Letter objects in a dictionary, with the letter id’s as keys and than push the whole dictionary to the shelve file. Worked quite well and was not too slow (considering the amount of letters and what the function has to do).

The function took nearly 19 – 21 seconds (based on three tests) to run over about 2200 entries in the EXCEL file, create 850 letter objects, add the letters to a dictionary, clean the transcriptions and merge them if they are from the same letter, and finally store everything in a shelve file. This is not too bad considering that most of the time (13.3) seconds was used by the xlrd.open_workbook(filename=file_path, encoding_override=”utf-8″) function, that retrieves the data from the EXCEL file, and creates a handy to use workbook object. And Python cp1252.py decode took also quite long. Furthermore, a lot of time was spend creating the letter objects (1.6) and cleaning and adding the transcriptions (2.6). There might be some opportunity for optimization – I might come back to revising this function later in the project.

The stored shelve dictionary contains now a dictionary with all the letter objects. Out of curiosity I wrote a second function that does all the steps that the first one does, but instead of creating a dictionary and than storing it, each letter object will be stored in the shelve file when it is created. The shelve file is after all dictionary-like. I assumed that it might take a bit longer, and was surprised how long it actually took.

Nearly 60 seconds!!! That was three times the time of the other function. After I had a look at the statistics returned by cProfile I found that most time was spend calling the keys() method on the shelve file in order to retrieve the keys (= letter ids) that were already stored. Not sure why this happened. Is it because keys() returns a list, instead of a faster data structure, like a set? Still 40 seconds more is very significant. Could it have something to do with DeadlockWrap?


d = shelve.open("letters.shelve")

#Than follows the for loop that gets the data from the EXCEL file and prepares it for storage

if letter_id not in d.keys():       #This is the interesting part: if the letter id is not already used
d[letter_id] = l                     # as key in the shelve file add a new letter object

d.sync()    # the shelve file has to be synced in order to ensure object is written to the file

d.close()   # and closed after the for-loop

Interesting was that by replacing the d.keys() with a set ‘s’ that stores all the keys of letter objects – s.update(letter_id), it took the function less than half of the time, even if every object is written to the shelve file directly:

It is still not as fast as my first solution, and will probably never be, because of the numerous calls to the shelve file and the constant sync. It was however interesting to see how a small change like replacing d.keys() with the set ‘s’ can have so tremendous effects.

Text Analysis of the 1916 Letters

Text mining in private correspondence of letters related to the 1916 Easter Rising

Tag Archives: data storage

Another Performance Test

	Richard Hadden on Letters of 1916: Visualising 1…
	bleierr on Another Performance Test
	Alex O'Connor (@uber… on Another Performance Test