Abstract
To classify proteins into functional families based on their primary sequences, popular algorithms such as the k-NN-, HMM-, and SVM-based algorithms are often used. For many of these algorithms to perform their tasks, protein sequences need to be properly aligned first. Since the alignment process can be error-prone, protein classification may not be performed very accurately. To improve classification accuracy, we propose an algorithm, called the Unaligned Protein SEquence Classifier (UPSEC), which can perform its tasks without sequence alignment. UPSEC makes use of a probabilistic measure to identify residues that are useful for classification in both positive and negative training samples, and can handle multi-class classification with a single classifier and a single pass through the training data. UPSEC has been tested with real protein data sets. Experimental results show that UPSEC can effectively classify unaligned protein sequences into their corresponding functional families, and the patterns it discovers during the training process can be biologically meaningful. 2008.
Original language | English |
---|---|
Pages (from-to) | 431-443 |
Number of pages | 13 |
Journal | Journal of Computational Biology |
Volume | 15 |
Issue number | 4 |
DOIs | |
Publication status | Published - 1 May 2008 |
Keywords
- Information theory
- Pattern discovery
- Protein sequence classification
- Residual analysis
- Weight of evidence
ASJC Scopus subject areas
- Modelling and Simulation
- Molecular Biology
- Genetics
- Computational Theory and Mathematics
- Computational Mathematics