Rethinking Chinese word segmentation: Tokenization, character classification, or wordbreak identification

Chu Ren Huang, Petr Šimon, Shu Kai Hsieh, Laurent Prévot

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

33 Citations (Scopus)

Abstract

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries (CB’s) into either word-boundaries (WB’s) and non-word-boundaries. In Chinese, CB’s are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB’s are WB’s.

Original languageEnglish
Title of host publicationProceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions
EditorsSophia Ananiadou
PublisherAssociation for Computational Linguistics (ACL)
Pages69-72
Number of pages4
Publication statusPublished - Jun 2007
Externally publishedYes
Event45th Annual Meeting of the Association for Computational Linguistics, ACL 2007 - Prague, Czech Republic
Duration: 25 Jun 200727 Jun 2007

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference45th Annual Meeting of the Association for Computational Linguistics, ACL 2007
Country/TerritoryCzech Republic
CityPrague
Period25/06/0727/06/07

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Rethinking Chinese word segmentation: Tokenization, character classification, or wordbreak identification'. Together they form a unique fingerprint.

Cite this