Quantitative Criteria for Computational Chinese Lexicography A Study Based on a Standard Reference Lexicon for Chinese NLP: Topic Areas: (D) electronic dictionaries, (h) large corpora

Chu Ren Huang, Zhoa Ming Gao, Claude C.C. Shen, Keh Jian Chen

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

2 Citations (Scopus)

Abstract

The construction of a standard reference lexicon for Chinese NLP involves two fundamental issues in computational linguistics: the definition of a word and the principled delimitation of the lexicon. We argued that such reference lexicons must-be judged by their cross-domain portability, expressive adequacy, and reusability. Thus principles for lexical selection must also be driven these criteria. This paper reports the approach and result of our construction of a standard reference lexicon for Chinese NLP, which also serves as the empirical basis for a segmentation standard. Our approach uses a mixture if stochastic and heuristic steps. First, a reference corpus is selected and lexical entries are automatically extracted from it based on statistically significant threshold. Second, the coverage of the automatically extracted lexicon is enhanced by conceptual primes as well as by comparative studies of MRD's from different Chinese speaking communities. We show the satisfactory coverage of the resultant lexicon by testing it with randomly accessed texts from the web.

Original languageEnglish
Title of host publicationProceedings of Research on Computational Linguistics Conference XI
PublisherThe Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
Pages87-108
Number of pages22
Publication statusPublished - Aug 1998
Externally publishedYes

ASJC Scopus subject areas

  • Language and Linguistics
  • Speech and Hearing

Fingerprint

Dive into the research topics of 'Quantitative Criteria for Computational Chinese Lexicography A Study Based on a Standard Reference Lexicon for Chinese NLP: Topic Areas: (D) electronic dictionaries, (h) large corpora'. Together they form a unique fingerprint.

Cite this