SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

被引:157
|
作者
Dohm, Juliane C.
Lottaz, Claudio
Borodina, Tatiana
Himmelbauer, Heinz
机构
[1] Max Planck Inst Mol Genet, D-14195 Berlin, Germany
[2] Univ Regensburg, Inst Funct Genom Computat Diagnost, D-93053 Regensburg, Germany
关键词
D O I
10.1101/gr.6435207
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quickly and at low cost. Applications of such technologies seem to be limited to resequencing and transcript discovery, due to the shortness of the generated reads. In order to extend the fields of application to de novo sequencing, we developed the SHARCGS algorithm to assemble short-read (25-40-mer) data with high accuracy and speed. The efficiency of SHARCGS was tested oil BAC inserts from three eukaryotic species, on two yeast chromosomes, and oil two bacterial genomes (Haemophilus influenzae, Escherichia coli). We show that 30-mer-based BAC assemblies have N50 sizes >20 kbp for Drosophila and Arabidopsis and >4 kbp for human in simulations taking missing reads and wrong base calls into account. We assembled 949,974 contigs with length >50 bp, and only one single contig could not be aligned error-free against the reference sequences. We generated 36-mer reads for the genome of Helicobacter acinonychis oil the Illumina 1G sequencing instrument and assembled 937 contigs covering 98% of the genome with an N50 size of 3.7 kbp. With the exception of five contigs that differ in 1-4 positions relative to the reference sequence, all contigs matched the genome error-free. Thus, SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy.
引用
收藏
页码:1697 / 1706
页数:10
相关论文
共 50 条
  • [31] Long-read sequencing and de novo assembly of a Chinese genome
    Lingling Shi
    Yunfei Guo
    Chengliang Dong
    John Huddleston
    Hui Yang
    Xiaolu Han
    Aisi Fu
    Quan Li
    Na Li
    Siyi Gong
    Katherine E. Lintner
    Qiong Ding
    Zou Wang
    Jiang Hu
    Depeng Wang
    Feng Wang
    Lin Wang
    Gholson J. Lyon
    Yongtao Guan
    Yufeng Shen
    Oleg V. Evgrafov
    James A. Knowles
    Francoise Thibaud-Nissen
    Valerie Schneider
    Chack-Yung Yu
    Libing Zhou
    Evan E. Eichler
    Kwok-Fai So
    Kai Wang
    Nature Communications, 7
  • [32] Long-read sequencing and de novo assembly of a Chinese genome
    Shi, Lingling
    Guo, Yunfei
    Dong, Chengliang
    Huddleston, John
    Yang, Hui
    Han, Xiaolu
    Fu, Aisi
    Li, Quan
    Li, Na
    Gong, Siyi
    Lintner, Katherine E.
    Ding, Qiong
    Wang, Zou
    Hu, Jiang
    Wang, Depeng
    Wang, Feng
    Wang, Lin
    Lyon, Gholson J.
    Guan, Yongtao
    Shen, Yufeng
    Evgrafov, Oleg V.
    Knowles, James A.
    Thibaud-Nissen, Francoise
    Schneider, Valerie
    Yu, Chack-Yung
    Zhou, Libing
    Eichler, Evan E.
    So, Kwok-Fai
    Wang, Kai
    NATURE COMMUNICATIONS, 2016, 7
  • [33] Fast and accurate short-read alignment with hybrid hash-tree data structure
    Junichiro Makino
    Toshikazu Ebisuzaki
    Ryutaro Himeno
    Yoshihide Hayashizaki
    Genomics & Informatics, 22 (1)
  • [34] A fast hybrid short read fragment assembly algorithm
    Schmidt, Bertil
    Sinha, Ranjan
    Beresford-Smith, Bryan
    Puglisi, Simon J.
    BIOINFORMATICS, 2009, 25 (17) : 2279 - 2280
  • [35] Genotyping and De Novo Discovery of Allelic Variants at the Brassicaceae Self-Incompatibility Locus from Short-Read Sequencing Data
    Genete, Mathieu
    Castric, Vincent
    Vekemans, Xavier
    MOLECULAR BIOLOGY AND EVOLUTION, 2020, 37 (04) : 1193 - 1201
  • [36] taxMaps: comprehensive and highly accurate taxonomic classification of short-read data in reasonable time
    Corvelo, Andre
    Clarke, Wayne E.
    Robine, Nicolas
    Zody, Michael C.
    GENOME RESEARCH, 2018, 28 (05) : 751 - 758
  • [37] High-Quality de novo Chromosome-Level Genome Assembly of a Single Bombyx mori With BmNPV Resistance by a Combination of PacBio Long-Read Sequencing, Illumina Short-Read Sequencing, and Hi-C Sequencing
    Tang, Min
    He, Suqun
    Gong, Xun
    Lu, Peng
    Taha, Rehab H.
    Chen, Keping
    FRONTIERS IN GENETICS, 2021, 12
  • [38] Fast Short Read De-Novo Assembly Using Overlap-Layout-Consensus Approach
    Bayat, Arash
    Deshpande, Nandan P.
    Wilkins, Marc R.
    Parameswaran, Sri
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2020, 17 (01) : 334 - 338
  • [39] Partial short-read sequencing of a highly inbred Iberian pig and genomics inference thereof
    Esteve-Codina, A.
    Kofler, R.
    Himmelbauer, H.
    Ferretti, L.
    Vivancos, A. P.
    Groenen, M. A. M.
    Folch, J. M.
    Rodriguez, M. C.
    Perez-Enciso, M.
    HEREDITY, 2011, 107 (03) : 256 - 264
  • [40] Partial short-read sequencing of a highly inbred Iberian pig and genomics inference thereof
    A Esteve-Codina
    R Kofler
    H Himmelbauer
    L Ferretti
    A P Vivancos
    M A M Groenen
    J M Folch
    M C Rodríguez
    M Pérez-Enciso
    Heredity, 2011, 107 : 256 - 264