Abstract
Though Cantonese is the most influential variety of Chinese other than Mandarin, there are only a limited number of Cantonese corpora available for linguistic studies. Among the essential steps of building a corpus, word segmentation is a necessary but highly challenging task due to the lack of clear word boundary in Cantonese. This paper reports the construction and evaluation of an open-source automatic Cantonese word segmenter developed for Cantonese. The tool is a component of the multilingual SPPAS program designed to be used directly by linguists. It is a free software distributed under a GPL license. The effectiveness of the tool was evaluated by comparing the result of segmenting some samples of a spoken Cantonese corpus manually and automatically using the tool developed. High precision and recall were found in our study. Upon completion, the tool would definitely promote the development of more Cantonese corpora for language related studies.
Original language | English |
---|---|
Title of host publication | 2015 International Conference Oriental COCOSDA - Held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE 2015 - Proceedings |
Publisher | IEEE |
Pages | 196-201 |
Number of pages | 6 |
ISBN (Electronic) | 9781467382793 |
DOIs | |
Publication status | Published - 14 Dec 2015 |
Event | 18th Annual International Conference Oriental COCOSDA, O-COCOSDA 2015 - Held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation, CASLRE 2015 - Shanghai Jiao Tong University, Shanghai, China Duration: 28 Oct 2015 → 30 Oct 2015 |
Conference
Conference | 18th Annual International Conference Oriental COCOSDA, O-COCOSDA 2015 - Held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation, CASLRE 2015 |
---|---|
Country/Territory | China |
City | Shanghai |
Period | 28/10/15 → 30/10/15 |
Keywords
- automatic
- Cantonese
- corpus
- segmentation
- software
ASJC Scopus subject areas
- Linguistics and Language
- Computer Science Applications
- Software
- Language and Linguistics