To think of a novel or piece of literature as data requires a leap. Whereas mathematically processing numbers through algebra is a simple task, trying to quantify aspects of written language is more challenging.
One technique employed to find out how rare a word is, compared to usage of today’s language. This is possible by cross-referencing a list of Google’s “Trillion Word Corpus” where the 10,000 most words are listed in order of commonality. By cross-checking each word with those most common 10,000 words, it is possible to assign a value of ‘rareness’ of each word used by a historical author.