A phonetic-based approach to Chinese chat text normalization

Yunqing Xia, Kam Fai Wong, Wenjie Li

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

21 Citations (Scopus)

Abstract

Chatting is a popular communication media on the Internet via ICQ, chat rooms, etc. Chat language is different from natural language due to its anomalous and dynamic natures, which renders conventional NLP tools inapplicable. The dynamic problem is enormously troublesome because it makes static chat language corpus outdated quickly in representing contemporary chat language. To address the dynamic problem, we propose the phonetic mapping models to present mappings between chat terms and standard words via phonetic transcription, i.e. Chinese Pinyin in our case. Different from character mappings, the phonetic mappings can be constructed from available standard Chinese corpus. To perform the task of dynamic chat language term normalization, we extend the source channel model by incorporating the phonetic mapping models. Experimental results show that this method is effective and stable in normalizing dynamic chat language terms.
Original languageEnglish
Title of host publicationCOLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
Pages993-1000
Number of pages8
Volume1
Publication statusPublished - 1 Dec 2006
Event21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, COLING/ACL 2006 - Sydney, NSW, Australia
Duration: 17 Jul 200621 Jul 2006

Conference

Conference21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, COLING/ACL 2006
CountryAustralia
CitySydney, NSW
Period17/07/0621/07/06

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this