Efficient discovery of longest-lasting correlation in sequence databases

Yuhong Li, U. Leong Hou, Man Lung Yiu, Zhiguo Gong

Research output: Journal article publicationJournal articleAcademic researchpeer-review

5 Citations (Scopus)

Abstract

The search for similar subsequences is a core module for various analytical tasks in sequence databases. Typically, the similarity computations require users to set a length. However, there is no robust means by which to define the proper length for different application needs. In this study, we examine a new query that is capable of returning the longest-lasting highly correlated subsequences in a sequence database, which is particularly helpful to analyses without prior knowledge regarding the query length. A baseline, yet expensive, solution is to calculate the correlations for every possible subsequence length. To boost performance, we study a space-constrained index that provides a tight correlation bound for subsequences of similar lengths and offset by intraobject and interobject grouping techniques. To the best of our knowledge, this is the first index to support a normalized distance metric of arbitrary length subsequences. In addition, we study the use of a smart cache for disk-resident data (e.g., millions of sequence objects) and a graph processing unit-based parallel processing technique for frequently updated data (e.g., nonindexable streaming sequences) to compute the longest-lasting highly correlated subsequences. Extensive experimental evaluation on both real and synthetic sequence datasets verifies the efficiency and effectiveness of our proposed methods.
Original languageEnglish
Pages (from-to)767-790
Number of pages24
JournalVLDB Journal
Volume25
Issue number6
DOIs
Publication statusPublished - 1 Dec 2016

Keywords

  • Longest-lasting correlated subsequences
  • Similarity search
  • Time series analysis

ASJC Scopus subject areas

  • Information Systems
  • Hardware and Architecture

Cite this