TY - GEN
T1 - Building a collation element table for a large chinese character set in YES
AU - Zhang, Xiaoheng
AU - Li, Xiaotong
PY - 2015/1/1
Y1 - 2015/1/1
N2 - YES is a simplified stroke-based method for sorting Chinese characters. It is free from stroke counting and grouping, and thus much faster and more accurate than the traditional method. This paper presents a collation element table built in YES for a large joint Chinese character set covering (a) all 20,902 characters of Unicode CJK Unified Ideographs, (b) all 11,408 characters in the Complete List of Chinese Characters Used by the Media in 2013, (c) all 13,000 plus characters in the latest versions of Xinhua Dictionary(v11) and Contemporary Chinese Dictionary(v6). Of the 20,902 Chinese characters in Unicode, 97.23% have one-to-one relationship with their stroke order codes in YES, comparing with 90.69% of the traditional method. Enhanced with the secondary and tertiary sorting levels of stroke layout and Unicode value, there is a guarantee of one-to-one relationship between the characters and collation elements. The collation element table has been successfully applied to sorting CC-CEDICT, a Chinese-English dictionary of over 112,000 word entries.
AB - YES is a simplified stroke-based method for sorting Chinese characters. It is free from stroke counting and grouping, and thus much faster and more accurate than the traditional method. This paper presents a collation element table built in YES for a large joint Chinese character set covering (a) all 20,902 characters of Unicode CJK Unified Ideographs, (b) all 11,408 characters in the Complete List of Chinese Characters Used by the Media in 2013, (c) all 13,000 plus characters in the latest versions of Xinhua Dictionary(v11) and Contemporary Chinese Dictionary(v6). Of the 20,902 Chinese characters in Unicode, 97.23% have one-to-one relationship with their stroke order codes in YES, comparing with 90.69% of the traditional method. Enhanced with the secondary and tertiary sorting levels of stroke layout and Unicode value, there is a guarantee of one-to-one relationship between the characters and collation elements. The collation element table has been successfully applied to sorting CC-CEDICT, a Chinese-English dictionary of over 112,000 word entries.
KW - Chinese characters
KW - Collation
KW - Unicode
KW - YES
UR - http://www.scopus.com/inward/record.url?scp=84952665532&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-25816-4_1
DO - 10.1007/978-3-319-25816-4_1
M3 - Conference article published in proceeding or book
SN - 9783319258157
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 3
EP - 14
BT - Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data - 14th China National Conference, CCL 2015 and 3rd International Symposium, NLP-NABD 2015, Proceedings
PB - Springer Verlag
T2 - 14th China National Conference on Chinese Computational Linguistics, CCL 2015 and 3rd International Symposium on Natural Language Processing Based on Naturally Annotated Big Data, NLP-NABD 2015
Y2 - 13 November 2015 through 14 November 2015
ER -