A Comparison of Chinese Document Indexing Strategies and Retrieval Models

Research output: Journal article publicationJournal articleAcademic researchpeer-review

20 Citations (Scopus)

Abstract

With the advent of the Internet and intranets, substantial interest is being shown in Asian language information retrieval; especially in Chinese, which is a good example of an Asian ideographic language (other examples include Japanese and Korean). Since, in this type of language, spaces do not delimit words, an important issue is which index terms should be extracted from documents. This issue also has wider implications for indexing other languages such as agglutinating languages (e.g., Finnish and Turkish), archaic ideographic languages like Egyptian hieroglyphs, and other types of information such as data stored in genomic databases. Although comparisons of indexing strategies for Chinese documents have been made, almost all of them are based on a single retrieval model. This article compares the performance of various combinations of indexing strategies (i.e., character, word, short-word, bigram, and Pircs indexing) and retrieval models (i.e., vector space, 2- Poisson, logistic regression, and Pircs models). We determine which model (and its parameters) achieves the (near) best retrieval effectiveness without relevance feedback, and compare it with the open evaluations (i.e.,TREC and NTCIR) for both long and title queries. In addition, we describe a more extensive investigation of retrieval efficiency. In particular, the storage cost of word indexing is only slightly more than character indexing, and bigram indexing is about double the storage cost of other indexing strategies. The retrieval time typically varies linearly with the number of unique terms in the query, which is supported by correlation values above 90%. The Pircs retrieval system achieves robust and good retrieval performance, but it appears to be the slowest method, whereas vector space models were not very effective in retrieval, but were able to respond quickly. For robust, near-best retrieval effectiveness, without considering storage overhead, the 2-Poisson model using bigram indexing appears to be a good compromise between retrieval effectiveness and efficiency for both long and title queries.
Original languageEnglish
Pages (from-to)225-268
Number of pages44
JournalACM Transactions on Asian Language Information Processing
Volume1
Issue number3
DOIs
Publication statusPublished - 1 Sep 2002

Keywords

  • Algorithms
  • Chinese information retrieval
  • comparison
  • Experimentation
  • indexing strategies
  • Languages
  • Performance

ASJC Scopus subject areas

  • Computer Science(all)

Cite this