TY - JOUR
T1 - GAAP: Genome-organization-framework-Assisted Assembly Pipeline for prokaryotic genomes
AU - Yuan, Lina
AU - Yu, Yang
AU - Zhu, Yanmin
AU - Li, Yulai
AU - Li, Changqing
AU - Li, Rujiao
AU - Ma, Qin
AU - Siu, Kit Hang
AU - Yu, Jun
AU - Jiang, Taijiao
AU - Xiao, Jingfa
AU - Kang, Yu
PY - 2017/1/25
Y1 - 2017/1/25
N2 - Background: Next-generation sequencing (NGS) technologies have greatly promoted the genomic study of prokaryotes. However, highly fragmented assemblies due to short reads from NGS are still a limiting factor in gaining insights into the genome biology. Reference-assisted tools are promising in genome assembly, but tend to result in false assembly when the assigned reference has extensive rearrangements. Results: Herein, we present GAAP, a genome assembly pipeline for scaffolding based on core-gene-defined Genome Organizational Framework (cGOF) described in our previous study. Instead of assigning references, we use the multiple-reference-derived cGOFs as indexes to assist in order and orientation of the scaffolds and build a skeleton structure, and then use read pairs to extend scaffolds, called local scaffolding, and distinguish between true and chimeric adjacencies in the scaffolds. In our performance tests using both empirical and simulated data of 15 genomes in six species with diverse genome size, complexity, and all three categories of cGOFs, GAAP outcompetes or achieves comparable results when compared to three other reference-assisted programs, AlignGraph, Ragout and MeDuSa. Conclusions: GAAP uses both cGOF and pair-end reads to create assemblies in genomic scale, and performs better than the currently available reference-assisted assembly tools as it recovers more assemblies and makes fewer false locations, especially for species with extensive rearranged genomes. Our method is a promising solution for reconstruction of genome sequence from short reads of NGS.
AB - Background: Next-generation sequencing (NGS) technologies have greatly promoted the genomic study of prokaryotes. However, highly fragmented assemblies due to short reads from NGS are still a limiting factor in gaining insights into the genome biology. Reference-assisted tools are promising in genome assembly, but tend to result in false assembly when the assigned reference has extensive rearrangements. Results: Herein, we present GAAP, a genome assembly pipeline for scaffolding based on core-gene-defined Genome Organizational Framework (cGOF) described in our previous study. Instead of assigning references, we use the multiple-reference-derived cGOFs as indexes to assist in order and orientation of the scaffolds and build a skeleton structure, and then use read pairs to extend scaffolds, called local scaffolding, and distinguish between true and chimeric adjacencies in the scaffolds. In our performance tests using both empirical and simulated data of 15 genomes in six species with diverse genome size, complexity, and all three categories of cGOFs, GAAP outcompetes or achieves comparable results when compared to three other reference-assisted programs, AlignGraph, Ragout and MeDuSa. Conclusions: GAAP uses both cGOF and pair-end reads to create assemblies in genomic scale, and performs better than the currently available reference-assisted assembly tools as it recovers more assemblies and makes fewer false locations, especially for species with extensive rearranged genomes. Our method is a promising solution for reconstruction of genome sequence from short reads of NGS.
KW - Core-gene-defined Genome Organizational Framework (cGOF)
KW - Prokaryotic genome
KW - Rearrangement
KW - Scaffolding
UR - http://www.scopus.com/inward/record.url?scp=85010950492&partnerID=8YFLogxK
U2 - 10.1186/s12864-016-3267-0
DO - 10.1186/s12864-016-3267-0
M3 - Journal article
C2 - 28198678
SN - 1471-2164
VL - 18
JO - BMC Genomics
JF - BMC Genomics
M1 - 952
ER -