A Statistical Learning Approach to Improving the Accuracy of Chinese Word Segmentation

Chi Hong Leung, Wing Kay Kan

Research output: Journal article publicationJournal articleAcademic researchpeer-review

7 Citations (Scopus)

Abstract

In Chinese, there is no delimiter separating successive words in a sentence. Chinese word segmentation, which is a process of identifying word boundaries in text, is an essential step for Chinese language processing There are different word segmentation algorithms. However, because of the irregularities of syntactic and semantic features in Chinese, it is difficult to obtain word segmentation accuracy of 100%. To solve this problem, a statistical learning approach to improving the accuracy of Chinese word segmentation is proposed. Based on statistical correlations between incorrect segmented strings and their contexts, a number of rules governing the modification from incorrect segmented strings to correct ones are constructed. These rules can be applied to word segmentation results obtained by an automatic word segmentation algorithm. They can modify the word segmentation results and make them more accurate. Experimental results have shown that this approach is a practical method to assist in the process of automatic Chinese word segmentation. The rules constructed in the experiment are studied and the limitation of this approach is discussed.
Original languageEnglish
Pages (from-to)87-92
Number of pages6
JournalLiterary and Linguistic Computing
Volume11
Issue number2
DOIs
Publication statusPublished - 1 Jan 1996
Externally publishedYes

ASJC Scopus subject areas

  • Information Systems
  • Language and Linguistics
  • Linguistics and Language

Cite this