Abstract
The classical bag-of-word models for information retrieval (IR) fail to capture contextual associations between words. In this article, we propose to investigate pure high-order dependence among a number of words forming an unseparable semantic entity, that is, the high-order dependence that cannot be reduced to the random coincidence of lower-order dependencies. We believe that identifying these pure high-order dependence patterns would lead to a better representation of documents and novel retrieval models. Specifically, two formal definitions of pure dependence-unconditional pure dependence (UPD) and conditional pure dependence (CPD)-are defined. The exact decision on UPD and CPD, however, is NP-hard in general.We hence derive and prove the sufficient criteria that entail UPD and CPD, within the well-principled information geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods for extracting word patterns with pure high-order dependence. Our methods are applied to and extensively evaluated on three typical IR tasks: text classification and text retrieval without and with query expansion.
Original language | English |
---|---|
Journal | ACM Transactions on Information Systems |
Volume | 31 |
Issue number | 3 |
DOIs | |
Publication status | Published - 1 Jul 2013 |
Keywords
- Information geometry
- Pure high-order dependence
- Text classification
- Text retrieval
- Word association
ASJC Scopus subject areas
- Information Systems
- Business, Management and Accounting(all)
- Computer Science Applications