Abstract
In this article, we propose an effective data mining technique for multi-class protein sequence classification. The technique, which can discover discriminative motif-sets for classification, performs its tasks in two phases. In Phase 1, it makes use of a popular motif discovery algorithm called MEME (Multiple Expectation Maximization for Motif Elicitation) to discover a set of highly conserved motifs in each protein family of training sequences. The highly conserved motif-sets discovered in each family may overlap with each other and may therefore not be unique enough to allow them to be used for classification. Phase 2, therefore, makes use of a pattern discovery approach to discover the interesting motif-sets in each protein family that are useful for classification with a single classifier. Based on these motif-sets, the functional family of each independent testing sequence can then be determined. For experimentation, the proposed technique has been tested with different sets of protein sequences. Experimental results show that it outperforms other existing protein sequence classifiers and can effectively classify proteins into their corresponding functional families. In addition, the motif-sets discovered during the training process have been found to be biologically meaningful.
Original language | English |
---|---|
Pages (from-to) | 733-743 |
Number of pages | 11 |
Journal | Journal of Computational Biology |
Volume | 17 |
Issue number | 5 |
DOIs | |
Publication status | Published - 1 May 2010 |
Keywords
- Bioinformatics
- Data mining
- Motif discovery
- Multi-class protein sequence classifi-cation
- Pattern discovery
ASJC Scopus subject areas
- Modelling and Simulation
- Molecular Biology
- Genetics
- Computational Mathematics
- Computational Theory and Mathematics