Building a collation element table for a large chinese character set in YES

Xiaoheng Zhang, Xiaotong Li

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

Abstract

YES is a simplified stroke-based method for sorting Chinese characters. It is free from stroke counting and grouping, and thus much faster and more accurate than the traditional method. This paper presents a collation element table built in YES for a large joint Chinese character set covering (a) all 20,902 characters of Unicode CJK Unified Ideographs, (b) all 11,408 characters in the Complete List of Chinese Characters Used by the Media in 2013, (c) all 13,000 plus characters in the latest versions of Xinhua Dictionary(v11) and Contemporary Chinese Dictionary(v6). Of the 20,902 Chinese characters in Unicode, 97.23% have one-to-one relationship with their stroke order codes in YES, comparing with 90.69% of the traditional method. Enhanced with the secondary and tertiary sorting levels of stroke layout and Unicode value, there is a guarantee of one-to-one relationship between the characters and collation elements. The collation element table has been successfully applied to sorting CC-CEDICT, a Chinese-English dictionary of over 112,000 word entries.
Original languageEnglish
Title of host publicationChinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data - 14th China National Conference, CCL 2015 and 3rd International Symposium, NLP-NABD 2015, Proceedings
PublisherSpringer Verlag
Pages3-14
Number of pages12
ISBN (Print)9783319258157
DOIs
Publication statusPublished - 1 Jan 2015
Event14th China National Conference on Chinese Computational Linguistics, CCL 2015 and 3rd International Symposium on Natural Language Processing Based on Naturally Annotated Big Data, NLP-NABD 2015 - Guangzhou, China
Duration: 13 Nov 201514 Nov 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9427
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference14th China National Conference on Chinese Computational Linguistics, CCL 2015 and 3rd International Symposium on Natural Language Processing Based on Naturally Annotated Big Data, NLP-NABD 2015
Country/TerritoryChina
CityGuangzhou
Period13/11/1514/11/15

Keywords

  • Chinese characters
  • Collation
  • Unicode
  • YES

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this