Contrastive approach towards text source classification based on top-bag-of-word similarity

Chu-ren Huang, Lung Hao Lee

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

15 Citations (Scopus)

Abstract

This paper proposes a method to automatically classify texts from different varieties of the same language. We show that similarity measure is a robust tool for studying comparable corpora of language variations. We take LDC's Chinese Gigaword Corpus composed of three varieties of Chinese from Mainland China, Singapore, and Taiwan, as the comparable corpora. Top-bag-of-word similarity measures reflect distances among the three varieties of the same language. A Top-bag-of-word similarity based contrastive approach was taken to solve the text source classification problem. Our results show that a contrastive approach using similarity to rule out identity of source and to arrive actual source by inference is more robust that directly confirmation of source by similarity. We show that this approach is robust when applied to other texts.
Original languageEnglish
Title of host publicationProceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, PACLIC 22
Pages404-410
Number of pages7
Publication statusPublished - 1 Dec 2008
Event22nd Pacific Asia Conference on Language, Information and Computation, PACLIC 22 - Cebu, Philippines
Duration: 20 Nov 200822 Nov 2008

Conference

Conference22nd Pacific Asia Conference on Language, Information and Computation, PACLIC 22
Country/TerritoryPhilippines
CityCebu
Period20/11/0822/11/08

Keywords

  • Chinese gigaword
  • Comparable corpus
  • Contrastive approach
  • Text source classification
  • Top-bag-of-word similarity

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science (miscellaneous)
  • Information Systems

Cite this