Cleaning a messy corpus

Working with the 1916 data I found (what people with experience have always told me) that cleaning of your data is an essential step. It could be even the most important step. Inconsistent, messy, and fault leads to problems and wrong results in the analysis and interpretation stages of your research.

In regards to the 1916 letters wrong spelling, inconsistent markup and comments in the text, inconsistent metadata are all sources for error. I knew from the start of my internship that cleaning the 1916 data would be one of the challenges. I did a bit of research and found very useful tips. Emma Clarke a former Mphil student here in TCD did recently a topic modelling project and talking to her and reading her Mphil thesis was very helpful. Furthermore,I found the O’Reilly Bad Data Handbook an interesting read.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s