UPSEC: An algorithm for classifying unaligned protein sequences into functional families

Patrick C.H. Ma, Chun Chung Chan

Research output: Journal article publicationReview articleAcademic researchpeer-review

11 Citations (Scopus)

Abstract

To classify proteins into functional families based on their primary sequences, popular algorithms such as the k-NN-, HMM-, and SVM-based algorithms are often used. For many of these algorithms to perform their tasks, protein sequences need to be properly aligned first. Since the alignment process can be error-prone, protein classification may not be performed very accurately. To improve classification accuracy, we propose an algorithm, called the Unaligned Protein SEquence Classifier (UPSEC), which can perform its tasks without sequence alignment. UPSEC makes use of a probabilistic measure to identify residues that are useful for classification in both positive and negative training samples, and can handle multi-class classification with a single classifier and a single pass through the training data. UPSEC has been tested with real protein data sets. Experimental results show that UPSEC can effectively classify unaligned protein sequences into their corresponding functional families, and the patterns it discovers during the training process can be biologically meaningful. 2008.
Original languageEnglish
Pages (from-to)431-443
Number of pages13
JournalJournal of Computational Biology
Volume15
Issue number4
DOIs
Publication statusPublished - 1 May 2008

Keywords

  • Information theory
  • Pattern discovery
  • Protein sequence classification
  • Residual analysis
  • Weight of evidence

ASJC Scopus subject areas

  • Modelling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this