Create a manual chinese word segmentation dataset using crowdsourcing method

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

1 Citation (Scopus)

Abstract

The manual Chinese word segmentation dataset WordSegCHC 1.0 which was built by eight crowdsourcing tasks conducted on the Crowdflower platform contains the manual word segmentation data of 152 Chinese sentences whose length ranges from 20 to 46 characters without punctuations. All the sentences received 200 segmentation responses in their corresponding crowdsourcing tasks and the numbers of valid response of them range from 123 to 143 (each sentence was segmented by more than 120 subjects). We also proposed an evaluation method called manual segmentation error rate (MSER) to evaluate the dataset; the MSER of the dataset is proved to be very low which indicates reliable data quality. In this work, we applied the crowdsourcing method to Chinese word segmentation task and the results confirmed again that the crowdsourcing method is a promising tool for linguistic data collection; the framework of crowdsourcing linguistic data collection used in this work can be reused in similar tasks; the resultant dataset filled a gap in Chinese language resources to the best of our knowledge, and it has potential applications in the research of word intuition of Chinese speakers and Chinese language processing.

Original languageEnglish
Title of host publicationProceedings of the 8th SIGHAN Workshop on Chinese Language Processing, SIGHAN 2015 - co-located with 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ACL IJCNLP 2015
EditorsLiang-Chih Yu, Zhifang Sui, Yue Zhang, Vincent Ng
PublisherAssociation for Computational Linguistics (ACL)
Pages7-14
Number of pages8
ISBN (Electronic)9781941643570
Publication statusPublished - 2015
Event8th SIGHAN Workshop on Chinese Language Processing, SIGHAN 2015 - Beijing, China
Duration: 30 Jul 201531 Jul 2015

Publication series

NameProceedings of the 8th SIGHAN Workshop on Chinese Language Processing, SIGHAN 2015 - co-located with 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ACL IJCNLP 2015

Conference

Conference8th SIGHAN Workshop on Chinese Language Processing, SIGHAN 2015
Country/TerritoryChina
CityBeijing
Period30/07/1531/07/15

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Create a manual chinese word segmentation dataset using crowdsourcing method'. Together they form a unique fingerprint.

Cite this