Abstract
Ambiguities in the taxonomy dependent assignment of pyrosequencing reads are usually resolved by mapping each read to the lowest common ancestor in a reference taxonomy of all those sequences that match the read. This conservative approach has the drawback of mapping a read to a possibly large clade that may also contain many sequences not matching the read. A more accurate taxonomic assignment of short reads can be made by mapping each read to the node in the reference taxonomy that provides the best precision and recall. We show that given a suffix array for the sequences in the reference taxonomy, a short read can be mapped to the node of the reference taxonomy with the best combined value of precision and recall in time linear in the size of the taxonomy subtree rooted at the lowest common ancestor of the matching sequences. An accurate taxonomic assignment of short reads can thus be made with about the same efficiency as when mapping each read to the lowest common ancestor of all matching sequences in a reference taxonomy. We demonstrate the effectiveness of our approach on several metagenomic datasets of marine and gut microbiota. Pte. Ltd.
Original language | English |
---|---|
Title of host publication | Pacific Symposium on Biocomputing 2010, PSB 2010 |
Pages | 3-9 |
Number of pages | 7 |
Publication status | Published - 1 Dec 2010 |
Externally published | Yes |
Event | 15th Pacific Symposium on Biocomputing, PSB 2010 - Kamuela, HI, United States Duration: 4 Jan 2010 → 8 Jan 2010 |
Conference
Conference | 15th Pacific Symposium on Biocomputing, PSB 2010 |
---|---|
Country/Territory | United States |
City | Kamuela, HI |
Period | 4/01/10 → 8/01/10 |
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Biomedical Engineering
- General Medicine