Working with the 1916 data I found (what people with experience have always told me) that cleaning of your data is an essential step. It could be even the most important step. Inconsistent, messy, and fault leads to problems and wrong results in the analysis and interpretation stages of your research.
In regards to the 1916 letters wrong spelling, inconsistent markup and comments in the text, inconsistent metadata are all sources for error. I knew from the start of my internship that cleaning the 1916 data would be one of the challenges. I did a bit of research and found very useful tips. Emma Clarke a former Mphil student here in TCD did recently a topic modelling project and talking to her and reading her Mphil thesis was very helpful. Furthermore,I found the O’Reilly Bad Data Handbook an interesting read.