Compression of Multiple DNA Sequences Using Intra-Sequence and Inter-Sequence Similarities

Kin On Cheng, Paula Wu, Ngai Fong Law, Wan Chi Siu

Research output: Journal article publicationJournal articleAcademic researchpeer-review

9 Citations (Scopus)

Abstract

Traditionally, intra-sequence similarity is exploited for compressing a single DNA sequence. Recently, remarkable compression performance of individual DNA sequence from the same population is achieved by encoding its difference with a nearly identical reference sequence. Nevertheless, there is lack of general algorithms that also allow less similar reference sequences. In this work, we extend the intra-sequence to the inter-sequence similarity in that approximate matches of subsequences are found between the DNA sequence and a set of reference sequences. Hence, a set of nearly identical DNA sequences from the same population or a set of partially similar DNA sequences like chromosome sequences and DNA sequences of related species can be compressed together. For practical compressors, the compressed size is usually influenced by the compression order of sequences. Fast search algorithms for the optimal compression order are thus developed for multiple sequences compression. Experimental results on artificial and real datasets demonstrate that our proposed multiple sequences compression methods with fast compression order search are able to achieve good compression performance under different levels of similarity in the multiple DNA sequences.
Original languageEnglish
Article number7047709
Pages (from-to)1322-1332
Number of pages11
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
Volume12
Issue number6
DOIs
Publication statusPublished - 1 Nov 2015

Keywords

  • Biology and genetics
  • data compaction and compression
  • data dependencies
  • information theory

ASJC Scopus subject areas

  • Biotechnology
  • Genetics
  • Applied Mathematics

Cite this