Abstract
Variable bit-block compression (VBC) signature is extended for document ranking. Two different extensions were experimented: the weighted VBC (WVBC) scheme and the aggregate VBC (AVBC) scheme. For both, analytical bounds of the additional storage for the term frequencies were derived. The upper and lower bounds of WVBC signatures were better than the corresponding bounds for AVBC signatures. In general, these bounds are functions of the word size (in bits) of the term frequencies. Therefore, term frequencies were scaled to reduce the word size. Experiments showed that the additional storage cost is closer to the lower than the upper bound for both WVBC and AVBC signatures. In addition, WVBC signatures were better than AVBC signatures in terms of storage and retrieval speed. Logarithmic scaling was found to be significantly better than linear scaling, in measuring the agreement of document ranking against the case without scaling, using the Kendall rank-order correlation. If a 75% ranking performance is acceptable, then the additional storage of the term frequencies is only 3.4% of all the indexed documents.
Original language | English |
---|---|
Pages (from-to) | 39-51 |
Number of pages | 13 |
Journal | Information Processing and Management |
Volume | 37 |
Issue number | 1 |
DOIs | |
Publication status | Published - 1 Jan 2001 |
ASJC Scopus subject areas
- Information Systems
- Media Technology
- Computer Science Applications
- Management Science and Operations Research
- Library and Information Sciences