The Business Intelligence Blog

Slicing Business Dicing Intelligence

The Jargon of the Novel, Computed  

Scholars in the growing field of digital humanities can tackle this question by analyzing enormous numbers of texts at once. When books and other written documents are gathered into an electronic corpus, one “subcorpus” can be compared with another: all the digitized fiction, for instance, can be stacked up against other genres of writing, like news reports, academic papers or blog posts.

One such research enterprise is the Corpus of Contemporary American English, or COCA, which brings together 425 million words of text from the past two decades, with equally large samples drawn from fiction, popular magazines, newspapers, academic texts and transcripts of spoken English. The fiction samples cover short stories and plays in literary magazines, along with the first chapters of hundreds of novels from major publishers. The compiler of COCA, Mark Davies at Brigham Young University, has designed a freely available online interface that can respond to queries about how contemporary language is used. Even grammatical questions are fair game, since every word in the corpus has been tagged with a part of speech.

More…

The article has

one response

Written by Guru Kirthigavasan

July 31st, 2011 at 6:18 am

One Response to 'The Jargon of the Novel, Computed'

Subscribe to comments with RSS or TrackBack to 'The Jargon of the Novel, Computed'.

  1. [...] The Jargon of the Novel, Computed [...]

Leave a Reply