Abstract
Automatic term extraction is the first step towards automatic or semi-automatic update of existing domain knowledge base. Most of the researches applied word segmentation as a preprocessing step to Chinese term extraction. However, segmentation ambiguity is unavoidable, especially in identifying unknown words for Chinese. In this paper, we discuss the effect and limitations of segmentation to Chinese terminology extraction. Detailed study shows that propagated errors caused by word segmentation have great impact on the result of terminology extraction. Based on our analysis and experiments, it is proven that character-based terminology extraction yields much better result than that using segmentation as a preprocessing step.
Original language | English |
---|---|
Title of host publication | PACLIC 20 - Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation |
Pages | 101-109 |
Number of pages | 9 |
Publication status | Published - 1 Dec 2006 |
Event | 20th Pacific Asia Conference on Language, Information and Computation, PACLIC 20 - Wuhan, China Duration: 1 Nov 2006 → 3 Nov 2006 |
Conference
Conference | 20th Pacific Asia Conference on Language, Information and Computation, PACLIC 20 |
---|---|
Country/Territory | China |
City | Wuhan |
Period | 1/11/06 → 3/11/06 |
Keywords
- Chinese corpus
- Terminology extraction
- Word segmentation
ASJC Scopus subject areas
- Language and Linguistics
- Computer Science (miscellaneous)