Get to know (and use!) your English corpora: BNC, GloWbE, COCA, COHA and more!

corpusDocumenting yourself during you terminological research is essential for terminology work, especially if you’re dealing with an unknown topic, regardless of your target language. Corpora gather the works of subject-matter experts using concordancers that allow us to look at terms in their context. It also allows you to see the variations of language throughout time. Corpora from 2 through 5 presented here were created by Mark Davies, professor of Linguistics at Brigham Young University (BYU), Utah, USA. Read his University profile here.

  1. The British National Corpora (BNC) is a collection of 100 million samples of written and spoken language in four domains: academic writing, imaginative writing, newspaper texts, and spontaneous conversation. The written part (90%) includes newspapers, periodicals, journals, books, letters, memoranda and essays. The spoken part (10%) includes transcriptions of informal conversations, formal meetings, and radio shows.

How to use it?

Type a word or phrase in the search box and press the Go button to see up to 50 random hits from the corpus. You can search for a single word or a phrase, restrict searches by part of speech, search in parts of the corpus only, and much more. Start using BNC now by clicking here.

The BNC has several versions (you need to apply for approval to download them) that gather several special collections: the BNC Baby edition (4 million, 1 million from each of the 4 domains), the BNC Sampler (1 million words), the BNC World Edition (second edition of 2002), and the BNC XML (full) Edition, the 2007 third edition.

  1. The Corpus of Global Web-based English, GloWbe (pronounced “globe”), has 1.9 billion words from 1.8 million web pages from 20 countries (nearly 20 times larger than BNC) and was released in 2013.

How to use it?

You can search words, phrases, grammatical constructions, synonyms, customized lists, and collocates (nearby words, which provide insight into meaning and usage). You can compare British and American English or limit the search to one or two countries (e.g., Australia and South Africa). Start using GloWbe now by clicking here.

  1. The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English that contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2012 and the corpus is also updated regularly.

How to use it?

You can search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these.  You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near faint, all adjectives near woman, or all verbs near feelings). You can limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions. Start using COCA now by clicking here.

  1. The Corpus of Historical American English (COHA) is the largest structured corpus of historical English. Starting in March 2015, you can now download COHA for use on your own computer. The COHA data includes 385 million words of text in 116,000 different texts from the 1810s-2000s, in fiction, popular magazines, newspapers, and non-fiction (books).

How to use it?

You can search by words (grieved), phrases (of no little or faint + noun), lemmas (all forms of words, like sing or tall), wildcards (un*ly or r?n*), and more complex searches such as un-X-ed adjectives or a most + ADJ + NOUN. Notice that from the “frequency results” window you can click on the word or phrase to see it in context in this lower window. Start using COHA now by clicking here.

Other corpora

  1. Davies also provides a list of corpora that you might find very useful. Check out his page: It includes:
  • The Hansard Corpus (speeches from the British Parliament)
  • Wikipedia Corpus
  • Time Magazine Corpus
  • Strathy Corpus (Canada)
  • Google books corpora
  • Corpora in Spanish and Portuguese
    6. For corpora in English and other languages, check out my updated section TermFinder (at the bottom of the list).



The British National Corpora (BNC)
Wikipedia: British National Corpus
Full-text corpus data

Image source here


Leave a Reply

Your email address will not be published.