MySmarTerms 10: Termhood and Unithood
These two little, strong terms are widely used in automatic term extraction (ATE) and I wanted to introduce them today as a first contact with the complex world of ATE, which can get highly technical. I wouldn’t dare to make up my own definitions and examples, so I think it is better to refer to the experts.
Unithood and termhood refer to the qualities of terms, or as Nakagawa calls it “two essential aspects of the nature of terms”, and they make part of the extraction tasks that are carried out during an automated extraction process. (The first task is corpus collection, then detection of unithood and termhood, then detection of term variants and finally evaluation and validation), as per Heylen and De Hertog.
Unithood deals with what is called “Multiword Units”, that is, complex terms (mainly noun phrases) that refer to one conceptual unit. Nakagawa explains that “a word has a very solid unithood”, as well as “compound words, collocations, and so forth”.
Termhood, Nakagawa points out, refers to the degree that a linguistic unit is related to domain-specific concepts and is usually calculated based on term frequency and bias of frequency. The higher the termhood is, the higher capability of distinguishing different domains the terminology has. (Zhang and Wu).
It seems that the definition for termhood has been challenged in recent years, according to Kara Warburton, who explains that “the classic criterion for termhood in Terminology is that the linguistic expression denotes a concept that is confined to a subject field”, but in the commercial environment where usually any linguistic expression that needs to be managed is a term, this classical view would not be as easily applied.
The formal and more widely known descriptions are given by Kageura and Umino. Unithood is “the degree of strength or stability of syntagmatic combinations and collocations” and Termhood is “the degree to which a stable lexical unit is related to some domain-specific concepts”.
An easy example for differentiation is provided by Korkontzelos et al: “in an eye-pathology corpus, “soft contact lens” is a valid term, which has both high termhood and unithood. However, its frequently occurring substring “soft contact” has high unithood and low termhood, since it does not refer to a key domain concept.
Both methods generate different information and, according to Korkontzelos et al, it is still unclear which method performs better. Therefore, many term extractors use a hybrid approach to combine the strengths of both methods.
Sources and further reading:
- Heylen, Kris and De Hertog, Dirk. Automatic Term Extraction in Handbook of Terminology, Volume 1, edited by Hendrik J. Kockaert and Frieda Steurs, John Benjamins Publishing Company
- Warburton, Kara, Managing terminology in commercial environments, in Handbook of Terminology, Volume 1, edited by Hendrik J. Kockaert and Frieda Steurs, John Benjamins Publishing Company
- Vu, Thuy, Aw, Ai Ti, Zhang, Min, Term Extraction through Unithood and Termhood Unification.
- Zhang, Chengzhi and Wu, Dan, Bilingual Terminology Extraction Using Multi-level Termhood.
- Nakagawa, Hiroshi. Experimental evaluation of ranking and selection methods in term extraction. Recent Advances in Computational Terminology, edited by Didier Bourigault, Christian Jacquemin, Marie-Claude L’Homme
- Korkontzelos, Ioannis, Klapaftis, Ioannis P., and Manandhar Suresh. Reviewing and evaluating automatic term recognition techniques.