Word Boundary Decision: An Efficient Approach for Low-Resource Word Segmentation

Yu Wang, Chu-Ren Huang

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

Abstract

Due to the limitation of data, low-resource word segmentation poses significant challenges for pre-trained language models, which struggle to process new knowledge beyond their training data. Instead of focusing on data augmentation or transfer representations, this paper proposes an efficient approach called Word Boundary Decision (WBD), which redefines word segmentation learning goals as segmentation behaviors rather than segmented units from the training data. The paper presents experiments across diverse datasets, including social media, medical, patent, Cantonese, and ancient
Chinese text. In small sample tests, WBD enables models to achieve the same performance with substantially less training data—for example, requiring only 3K words to match baseline F1 scores at 20K words for ancient Chinese, representing around 6.67 times less data. Through transfer learning experiments, WBD also significantly enhances the cross-domain performance
of pre-trained language models. For instance, WBD increases F1 scores by 2.48% and ROOV by 2.28% for BERT on average. This paper is an initial attempt to enable models to process new knowledge beyond their training data through task formulation.
Original languageEnglish
Title of host publicationProceedings of the 38th Pacific Asia Conference on Language, Information and Computation
EditorsNathaniel Oco, Shirley N. Dita, Ariane Macalinga Borlongan, Jong-Bok Kim
PublisherTokyo University of Foreign Studies
Pages160-169
Publication statusPublished - Dec 2024
EventThe 38th Pacific Asia Conference on Language, Information and Computation [PACLIC-38] - Tokyo University of Foreign Studies, Tokyo, Japan
Duration: 7 Dec 20249 Dec 2024

Conference

ConferenceThe 38th Pacific Asia Conference on Language, Information and Computation [PACLIC-38]
Country/TerritoryJapan
CityTokyo
Period7/12/249/12/24

Fingerprint

Dive into the research topics of 'Word Boundary Decision: An Efficient Approach for Low-Resource Word Segmentation'. Together they form a unique fingerprint.

Cite this