Abstract
Due to the limitation of data, low-resource word segmentation poses significant challenges for pre-trained language models, which struggle to process new knowledge beyond their training data. Instead of focusing on data augmentation or transfer representations, this paper proposes an efficient approach called Word Boundary Decision (WBD), which redefines word segmentation learning goals as segmentation behaviors rather than segmented units from the training data. The paper presents experiments across diverse datasets, including social media, medical, patent, Cantonese, and ancient
Chinese text. In small sample tests, WBD enables models to achieve the same performance with substantially less training data—for example, requiring only 3K words to match baseline F1 scores at 20K words for ancient Chinese, representing around 6.67 times less data. Through transfer learning experiments, WBD also significantly enhances the cross-domain performance
of pre-trained language models. For instance, WBD increases F1 scores by 2.48% and ROOV by 2.28% for BERT on average. This paper is an initial attempt to enable models to process new knowledge beyond their training data through task formulation.
Chinese text. In small sample tests, WBD enables models to achieve the same performance with substantially less training data—for example, requiring only 3K words to match baseline F1 scores at 20K words for ancient Chinese, representing around 6.67 times less data. Through transfer learning experiments, WBD also significantly enhances the cross-domain performance
of pre-trained language models. For instance, WBD increases F1 scores by 2.48% and ROOV by 2.28% for BERT on average. This paper is an initial attempt to enable models to process new knowledge beyond their training data through task formulation.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation |
| Editors | Nathaniel Oco, Shirley N. Dita, Ariane Macalinga Borlongan, Jong-Bok Kim |
| Publisher | Tokyo University of Foreign Studies |
| Pages | 160-169 |
| Publication status | Published - Dec 2024 |
| Event | The 38th Pacific Asia Conference on Language, Information and Computation [PACLIC-38] - Tokyo University of Foreign Studies, Tokyo, Japan Duration: 7 Dec 2024 → 9 Dec 2024 |
Conference
| Conference | The 38th Pacific Asia Conference on Language, Information and Computation [PACLIC-38] |
|---|---|
| Country/Territory | Japan |
| City | Tokyo |
| Period | 7/12/24 → 9/12/24 |