Abstract
In Chinese, there is no delimiter separating successive words in a sentence. Chinese word segmentation, which is a process of identifying word boundaries in text, is an essential step for Chinese language processing There are different word segmentation algorithms. However, because of the irregularities of syntactic and semantic features in Chinese, it is difficult to obtain word segmentation accuracy of 100%. To solve this problem, a statistical learning approach to improving the accuracy of Chinese word segmentation is proposed. Based on statistical correlations between incorrect segmented strings and their contexts, a number of rules governing the modification from incorrect segmented strings to correct ones are constructed. These rules can be applied to word segmentation results obtained by an automatic word segmentation algorithm. They can modify the word segmentation results and make them more accurate. Experimental results have shown that this approach is a practical method to assist in the process of automatic Chinese word segmentation. The rules constructed in the experiment are studied and the limitation of this approach is discussed.
Original language | English |
---|---|
Pages (from-to) | 87-92 |
Number of pages | 6 |
Journal | Literary and Linguistic Computing |
Volume | 11 |
Issue number | 2 |
DOIs | |
Publication status | Published - 1 Jan 1996 |
Externally published | Yes |
ASJC Scopus subject areas
- Information Systems
- Language and Linguistics
- Linguistics and Language