Automatic word segmentation for spoken Cantonese

Suk Yee Roxana Fung, Brigitte Bigi

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

5 Citations (Scopus)

Abstract

Though Cantonese is the most influential variety of Chinese other than Mandarin, there are only a limited number of Cantonese corpora available for linguistic studies. Among the essential steps of building a corpus, word segmentation is a necessary but highly challenging task due to the lack of clear word boundary in Cantonese. This paper reports the construction and evaluation of an open-source automatic Cantonese word segmenter developed for Cantonese. The tool is a component of the multilingual SPPAS program designed to be used directly by linguists. It is a free software distributed under a GPL license. The effectiveness of the tool was evaluated by comparing the result of segmenting some samples of a spoken Cantonese corpus manually and automatically using the tool developed. High precision and recall were found in our study. Upon completion, the tool would definitely promote the development of more Cantonese corpora for language related studies.
Original languageEnglish
Title of host publication2015 International Conference Oriental COCOSDA - Held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2015 - Proceedings
PublisherIEEE
Pages196-201
Number of pages6
ISBN (Electronic)9781467382793
DOIs
Publication statusPublished - 14 Dec 2015
Event18th Annual International Conference Oriental COCOSDA, O-COCOSDA 2015 - Held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation, CASLRE 2015 - Shanghai Jiao Tong University, Shanghai, China
Duration: 28 Oct 201530 Oct 2015

Conference

Conference18th Annual International Conference Oriental COCOSDA, O-COCOSDA 2015 - Held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation, CASLRE 2015
Country/TerritoryChina
CityShanghai
Period28/10/1530/10/15

Keywords

  • automatic
  • Cantonese
  • corpus
  • segmentation
  • software

ASJC Scopus subject areas

  • Linguistics and Language
  • Computer Science Applications
  • Software
  • Language and Linguistics

Cite this