Mining pure high-order word associations via information geometry for information retrieval

Yuexian Hou, Xiaozhao Zhao, Dawei Song, Wenjie Li

Research output: Journal article publicationJournal articleAcademic researchpeer-review

18 Citations (Scopus)

Abstract

The classical bag-of-word models for information retrieval (IR) fail to capture contextual associations between words. In this article, we propose to investigate pure high-order dependence among a number of words forming an unseparable semantic entity, that is, the high-order dependence that cannot be reduced to the random coincidence of lower-order dependencies. We believe that identifying these pure high-order dependence patterns would lead to a better representation of documents and novel retrieval models. Specifically, two formal definitions of pure dependence-unconditional pure dependence (UPD) and conditional pure dependence (CPD)-are defined. The exact decision on UPD and CPD, however, is NP-hard in general.We hence derive and prove the sufficient criteria that entail UPD and CPD, within the well-principled information geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods for extracting word patterns with pure high-order dependence. Our methods are applied to and extensively evaluated on three typical IR tasks: text classification and text retrieval without and with query expansion.
Original languageEnglish
JournalACM Transactions on Information Systems
Volume31
Issue number3
DOIs
Publication statusPublished - 1 Jul 2013

Keywords

  • Information geometry
  • Pure high-order dependence
  • Text classification
  • Text retrieval
  • Word association

ASJC Scopus subject areas

  • Information Systems
  • Business, Management and Accounting(all)
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Mining pure high-order word associations via information geometry for information retrieval'. Together they form a unique fingerprint.

Cite this