Extracting k most important groups from data efficiently

Man Lung Yiu, Nikos Mamoulis, Vagelis Hristidis

Research output: Journal article publicationJournal articleAcademic researchpeer-review

3 Citations (Scopus)

Abstract

We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is "find the 10 combinations of product-type and month with the largest sum of sales". The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms.
Original languageEnglish
Pages (from-to)289-310
Number of pages22
JournalData and Knowledge Engineering
Volume66
Issue number2
DOIs
Publication statusPublished - 1 Aug 2008
Externally publishedYes

Keywords

  • Optimization and performance

ASJC Scopus subject areas

  • Artificial Intelligence

Cite this