Practical and Robust Chinese Word Segmentation and PoS Tagging

Research output: Chapter in book / Conference proceedingChapter in an edited book (as author)Academic researchpeer-review

Abstract

The ability to automatically segment and PoS tag any Chinese text at any time with high accuracy and recall is a prerequisite for the online processing of Chinese texts. While this goal is within reach, it has yet to be attained even after more than 30 years of Chinese language processing research. Most recent achievements in Chinese adopt either stochastic or deep learning models that rely heavily on the availability of training data. As such, these state-of-the-art algorithms are not designed for texts without enough training data, such as texts on novel topics, in new genres, or from linked heterogeneous sources. In this paper, we propose practical and robust Chinese word segmentation and PoS tagging methodologies to address these challenges. The goals are to achieve real-time adaptation of unfamiliar texts, as well as to gain high quality with consistency among heterogeneous texts. For segmentation, we propose a semi-supervised approach, which performs online learning with either labeled or unlabeled data. This approach adopts the word boundary decision (WBD) model and is capable of using only the bigram information of the target article to train for better performance in almost real-time and on heterogeneous texts. For PoS tagging, we introduce the idea of tagset mapping and active learning. The result is the first realistic Chinese segmentation system that is able to support a wide range of HLT applications, which will have important implications in Chinese language processing.
Original languageEnglish
Title of host publicationChinese Language Resources
Subtitle of host publicationData Collection, Linguistic Analysis, Annotation and Language Processing
EditorsChu-ren Huang, Shu-Kai Hsieh, Peng Jin
PublisherSpringer
Chapter4
Pages59-78
ISBN (Electronic)978-3-031-38913-9
ISBN (Print)978-3-031-38912-2, 978-3-031-38915-3
DOIs
Publication statusPublished - 19 Dec 2023

Publication series

NameText, Speech and Language Technology
PublisherSpringer
Volume49
ISSN (Print)1386-291X
ISSN (Electronic)2542-9388

Keywords

  • Part-of-speech tagging
  • Word segmentation
  • Word boundary decision
  • Active learning approach
  • Robustness

Fingerprint

Dive into the research topics of 'Practical and Robust Chinese Word Segmentation and PoS Tagging'. Together they form a unique fingerprint.

Cite this