Hybrid Chinese Term Indexing and the 2-Poisson Model

Wing Pong Robert Luk, Kam Fai Wong

Research output: Journal article publicationJournal articleAcademic researchpeer-review

2 Citations (Scopus)

Abstract

Retrieval effectiveness depends on both the retrieval model and how terms are extracted and indexed. For Chinese, Japanese and Korea text, there are no spaces to delimit words. Indexing using hybrid terms (i.e. words and bigrams) was found to be effective and efficient using the 2-Poisson model in NTCIR-III open evaluation workshop. Here, we explore another Okapi weight, BM25, based on the 2-Poisson model and compared their performances with bigram and word indexing strategies. Results show that word indexing is the most efficient in terms of indexing time and storage but hybrid term indexing requires the least amount of retrieval time per query. Without pseudo-relevance feedback (PRF), our BM25 appeared to yield better retrieval effectiveness performance for short queries. With PRF, our implementation of the BM11 weights, which are a simplified version of BM25, with hybrid term indexing remains the most effective combination for retrieval in this study.
Original languageEnglish
Pages (from-to)1745-1752
Number of pages8
JournalIEICE Transactions on Information and Systems
VolumeE86-D
Issue number9
Publication statusPublished - 1 Jan 2003

Keywords

  • 2-Poisson model
  • Chinese information retrieval
  • Evaluation
  • Indexing

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence

Cite this