TY - CHAP
T1 - Practical and Robust Chinese Word Segmentation and PoS Tagging
AU - Huang, Chu-Ren
PY - 2023/12/19
Y1 - 2023/12/19
N2 - The ability to automatically segment and PoS tag any Chinese text at any time with high accuracy and recall is a prerequisite for the online processing of Chinese texts. While this goal is within reach, it has yet to be attained even after more than 30 years of Chinese language processing research. Most recent achievements in Chinese adopt either stochastic or deep learning models that rely heavily on the availability of training data. As such, these state-of-the-art algorithms are not designed for texts without enough training data, such as texts on novel topics, in new genres, or from linked heterogeneous sources. In this paper, we propose practical and robust Chinese word segmentation and PoS tagging methodologies to address these challenges. The goals are to achieve real-time adaptation of unfamiliar texts, as well as to gain high quality with consistency among heterogeneous texts. For segmentation, we propose a semi-supervised approach, which performs online learning with either labeled or unlabeled data. This approach adopts the word boundary decision (WBD) model and is capable of using only the bigram information of the target article to train for better performance in almost real-time and on heterogeneous texts. For PoS tagging, we introduce the idea of tagset mapping and active learning. The result is the first realistic Chinese segmentation system that is able to support a wide range of HLT applications, which will have important implications in Chinese language processing.
AB - The ability to automatically segment and PoS tag any Chinese text at any time with high accuracy and recall is a prerequisite for the online processing of Chinese texts. While this goal is within reach, it has yet to be attained even after more than 30 years of Chinese language processing research. Most recent achievements in Chinese adopt either stochastic or deep learning models that rely heavily on the availability of training data. As such, these state-of-the-art algorithms are not designed for texts without enough training data, such as texts on novel topics, in new genres, or from linked heterogeneous sources. In this paper, we propose practical and robust Chinese word segmentation and PoS tagging methodologies to address these challenges. The goals are to achieve real-time adaptation of unfamiliar texts, as well as to gain high quality with consistency among heterogeneous texts. For segmentation, we propose a semi-supervised approach, which performs online learning with either labeled or unlabeled data. This approach adopts the word boundary decision (WBD) model and is capable of using only the bigram information of the target article to train for better performance in almost real-time and on heterogeneous texts. For PoS tagging, we introduce the idea of tagset mapping and active learning. The result is the first realistic Chinese segmentation system that is able to support a wide range of HLT applications, which will have important implications in Chinese language processing.
KW - Part-of-speech tagging
KW - Word segmentation
KW - Word boundary decision
KW - Active learning approach
KW - Robustness
U2 - 10.1007/978-3-031-38913-9_4
DO - 10.1007/978-3-031-38913-9_4
M3 - Chapter in an edited book (as author)
SN - 978-3-031-38912-2
SN - 978-3-031-38915-3
T3 - Text, Speech and Language Technology
SP - 59
EP - 78
BT - Chinese Language Resources
A2 - Huang, Chu-ren
A2 - Hsieh, Shu-Kai
A2 - Jin, Peng
PB - Springer
ER -