Create your first corpus and analyze it with AntConc (and related links to explore!)
I think many of us might feel a bit intimidated when we first approach a new tool, but Laurence Anthony (Professor in the Faculty of Science and Engineering at Waseda University, Japan) developed AntConc so skillfully that once you start using it you’ll be hooked for life. It’s so easy to use that it’s almost child’s play, and Professor Anthony created short but detailed videos so you can start using it right away.
I really don’t want to go into much detail because I believe Professor Anthony videos are very clear and there are guides to get you started on the right foot, but here is a 7-step guide to get you going.
1. Download AntConc: For Windows, you download an .exe file; for Linux, a tar.gz folder that you need to decompress and find the .exe file; for Macintosh, you will also have to decompress a file. Here is the link to download it: http://www.antlab.sci.waseda.ac.jp/software.html
2. Convert your Word files to the .txt format. You can easily convert Word and PDF files into AntConc compatible .txt files using AntFileConverter: http://www.laurenceanthony.net/software/antfileconverter/ and you can change txt files encoding to UTF-8 (in case you already have your data in TXT files) using EncodeAnt: http://www.laurenceanthony.net/software/encodeant/
3. Classify your documents. Let’s say one of your favorite subjects is energy, renewable and nonrenewable. You have solar energy, wind energy and so forth. You can put all of these in one corpus (.txt file) but it’s better if you divide them by subcorpora (several .txt files): (i) one for renewable energy named, e.g., “RECorpus1” (for solar energy), “RECorpus2” (for wind energy), and so on; (ii) one for non-renewable, e.g. “NRECorpus1” (for oil), “NRECorpus2” (for coal), and so on.
This is useful because one task in AntConc allows you to compare your corpus to a reference corpus for each individual topic to analyze word frequencies. See my previous post on English corpora that you can access and use as reference. You can also use them to start “playing” with AntConc. Also, Mark Davis created the BYU Wikipedia corpus on several subjects: http://corpus.byu.edu/wikipedia.asp
4. Download “A simple guide to using AntConc” by Professor Laurence Anthony. I also found this short guide called “A Guide to Using AntConc” by several authors, and Mura Nava in his blog EFLnotes wrote a post called “Building your Own Corpus – First Steps in AntConc.”
5. Watch the videos. Here is the list and their duration (you can see they are short and sweet!). The latest 3.4.0 version has 11 YouTube videos:
—Getting Started (5:58 min.) –explains how to install AntConc.
—Concordance Tool: Basic Features (12:06) and Advanced Features (13.02): shows search results in a ‘KWIC’ (KeyWord In Context) format.
—Concordance Plot Tool (6:03) – shows search results plotted as a ‘barcode’ format. This allows you to see the position where search results appear in target texts.
—File View Tool (4:15) – shows the text of individual files. This allows you to investigate in more detail the results generated in other tools of AntConc.
—Clusters Tool (9:22) – shows clusters based on the search condition; summarizes the results generated in the Concordance Tool or Concordance Plot Tool.
—N-Grams Tool (5:01) – allows you to find common expressions in a corpus.
—Collocates Tool (7:46) – shows the collocates of a search term. This allows you to investigate non-sequential patterns in language.
—Word List Tool (9:43) – shows all the words found organized, e.g, by frequency, so you will probably find function words like “the” and “an” on the top of the list.
—Key List Tool (12:47) – shows the which words are unusually frequent (or infrequent) in the corpus in comparison with the words in a reference corpus.
—Working with Tagged Data (14:38) – you need an annotated (tagged) corpus to use this function, but basically shows you each word followed, e.g, by its Part-of-Speech (POS), such as “chair_NN” which indicates that chair is a singlar common noun. See CLAWS tagset here: http://ucrel.lancs.ac.uk/claws1tags.html
As you can see, most videos are under 10 minutes, so you will be able to start using some basic features in AntConc in less than an hour. Start watching in Anthony’s YoutTube channel here.
6. Check out the U of Lancaster glossary “Corpus Linguistics: Some Key Terms” (9 pages) which also includes links to other corpus tools (free and paid).
7. Use AntConc’s google discussion forum to ask questions and find solutions to your problems: https://groups.google.com/forum/#!forum/antconc
If you get really curious and attain a more advanced level, here are some interesting links:
- com Lemmatization Lists (multilingual): http://www.lexiconista.com/datasets/lemmatization/
- Check out other software by Professor Laurence Anthony: http://www.laurenceanthony.net/software.html
- Tools, concordancers, word and frequency lists, stop lists and much, much more by Australia University of Wollongong: http://www.uow.edu.au/~dlee/software.htm
- Google stop words: https://code.google.com/p/stop-words/
- English frequency: http://ucrel.lancs.ac.uk/bncfreq/flists.html
- Word Frequencies in Written and Spoken English: based on the British National Corpus: http://ucrel.lancs.ac.uk/bncfreq/
Sources and Further reading:
- Lancaster University Corpus Linguistics webpage: http://corpora.lancs.ac.uk/clmtp/index.php
- Mura Nava’s EFL blog, particularly his section on Building your own corpus using AntConc and other tools: https://eflnotes.wordpress.com/tag/build-your-own-corpus/ or follow him in Google+: https://plus.google.com/u/0/communities/101266284417587206243 and Twitter: https://twitter.com/muranava
- Follow Professor Laurence Anthony in Twitter: https://twitter.com/antlabjp
- EduTech wiki on AntConc: http://edutechwiki.unige.ch/en/AntConc