I cannot call myself a professional, but I am slightly familiar with text mining. However, I used it in relation to linguistics (specifically comparing translations of Russian and English texts), not history. My previous experience showed that text mining could become a really interesting tool for research, and I was really curious to see if it could be useful to my group in any way and could tell us any new facts about Clemens’ life.
Google Ngram Viewer is one of the text mining tools we are going to use in our project. It is based on Google Books, and it displays a graph showing how often and when certain words or phrases appeared in Google Books. Also, if more than one word or phrase is entered, Ngram Viewer shows color coded lines to contrast different terms and present the trends of usage of these terms. I have never used Ngram Viewer before, so I was pretty curious to see how the relationships between certain topics related to Clemens would be portrayed.
Clemens Tretbar was wounded in the battle of Winchester in 1864, and I wanted to see what kind of diagram I will get if I put words like wound, wounded, hospital, battle, casualties into Ngram. As it can be seen from the image below, I was interested in specific time frame (from 1800 to 2000) and searched American English corpus. The diagram shows a clear spike in the usage of the words battle, wounded and hospital in 1862-1864, which was the exact time of the Civil War in the United States. These trends did not surprise me, but rather reassured me that Civil War was a very brutal and bloody page in American history, and thousands of people were wounded and spent numerous weeks at field hospitals during these harsh times.
Clearly, text mining tools can be very useful for researches: they can work with large chunks of text, count the words and categorize them based on certain features, cluster some of the features that tend to be associated together and show trends in their usage, and even trace words and phrases overtime and show how often they were used throughout different periods of time. Unfortunately, text mining tools have many drawbacks as well. In my opinion, the most important disadvantage is that computers cannot actually READ the text they are working with. Text is completely meaningless to them; computer sees no context behind words and is unable to tell us was the text is about. Also, as Ted Underwood states in his article “Where to start with text mining,” these tools often work closely with OCR software that is not always reliable and accurate. Lastly, some text mining tools (Voyant is one of them) are not well adapted for managing large pieces of text and do not allow many modifications. If you try to work using either of those things, you need to be able to program which is not the skill a lot of people possess.