Discovering interesting motif-sets for multi-class protein sequence classification

Patrick C.H. Ma, Chun Chung Chan

Research output: Journal article publicationJournal articleAcademic researchpeer-review

1 Citation (Scopus)

Abstract

In this article, we propose an effective data mining technique for multi-class protein sequence classification. The technique, which can discover discriminative motif-sets for classification, performs its tasks in two phases. In Phase 1, it makes use of a popular motif discovery algorithm called MEME (Multiple Expectation Maximization for Motif Elicitation) to discover a set of highly conserved motifs in each protein family of training sequences. The highly conserved motif-sets discovered in each family may overlap with each other and may therefore not be unique enough to allow them to be used for classification. Phase 2, therefore, makes use of a pattern discovery approach to discover the interesting motif-sets in each protein family that are useful for classification with a single classifier. Based on these motif-sets, the functional family of each independent testing sequence can then be determined. For experimentation, the proposed technique has been tested with different sets of protein sequences. Experimental results show that it outperforms other existing protein sequence classifiers and can effectively classify proteins into their corresponding functional families. In addition, the motif-sets discovered during the training process have been found to be biologically meaningful.
Original languageEnglish
Pages (from-to)733-743
Number of pages11
JournalJournal of Computational Biology
Volume17
Issue number5
DOIs
Publication statusPublished - 1 May 2010

Keywords

  • Bioinformatics
  • Data mining
  • Motif discovery
  • Multi-class protein sequence classifi-cation
  • Pattern discovery

ASJC Scopus subject areas

  • Modelling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Cite this