Uniform and effective tagging of a heterogeneous giga-word corpus

Wei Yun Ma, Chu Ren Huang

Research output: Unpublished conference presentation (presented paper, abstract, poster)Conference presentation (not published in journal/proceeding/book)Academic researchpeer-review

14 Citations (Scopus)

Abstract

Tagging as the most crucial annotation of language resources can still be challenging when the corpus size is big and when the corpus data is not homogeneous. The Chinese Gigaword Corpus is confounded by both challenges. The corpus contains roughly 1.12 billion Chinese characters from two heterogeneous sources: respective news in Taiwan and in Mainland China. In other words, in addition to its size, the data also contains two variants of Chinese that are known to exhibit substantial linguistic differences. We utilize Chinese Sketch Engine as the corpus query tool, by which grammar behaviours of the two heterogeneous resources could be captured and displayed in a unified web interface. In this paper, we report our answer to the two challenges to effectively tag this large-scale corpus. The evaluation result shows our mechanism of tagging maintains high annotation quality.

Original languageEnglish
Pages2182-2185
Number of pages4
Publication statusPublished - 1 Jan 2006
Externally publishedYes
Event5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy
Duration: 22 May 200628 May 2006

Conference

Conference5th International Conference on Language Resources and Evaluation, LREC 2006
CountryItaly
CityGenoa
Period22/05/0628/05/06

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Cite this