Automated extraction of lexicon applied both to chinese and japanese corpora

Shujing Ke, Chi Keung Simon Shiu, Benjamin Goertzel, Gino Tu Yu, Xiaodong Shi, Changle Zhou

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

Abstract

A novel statistical approach is described, enabling the automated extraction of large word lists from unsegmented corpora without reliance on existing dictionaries. The main contribution of this approach includes the following two points: First, it's very generic and has been successfully applied separately to both Chinese and Japanese, Second, it doesn't take any use of punctuation information, so compared to most of the existing methods, it doesn't need to pre-process the corpora to remove the punctuations or to pre-segment the corpora by punctuations. Our experiment results in the extraction of 14,087 Chinese words and 15,553 Japanese words. Precision achieved is over 80% for two-character Chinese words, over 90% for one-character Japanese words and over 70% for two-character Japanese words. And we've also successfully extracted most of single-character words including common functional characters, such as '?' (in), ' ?' (and), '?' (or), '?'('s), '?' (also), '?' (a family name) in Chinese, hiragana such as ' ?,'' ?,'' ?' in Japanese, and punctuations such as ',', '?', '?'.
Original languageEnglish
Title of host publicationProceedings - 2012 International Conference on Advanced Computer Science Applications and Technologies, ACSAT 2012
Pages7-12
Number of pages6
DOIs
Publication statusPublished - 12 Jun 2013
Event2012 International Conference on Advanced Computer Science Applications and Technologies, ACSAT 2012 - Kuala Lumpur, Malaysia
Duration: 26 Nov 201228 Nov 2012

Conference

Conference2012 International Conference on Advanced Computer Science Applications and Technologies, ACSAT 2012
CountryMalaysia
CityKuala Lumpur
Period26/11/1228/11/12

Keywords

  • Combination Degree
  • Punctuation
  • Statistics
  • Word extraction

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science Applications

Cite this