A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

被引:17
|
作者
Chang, Yu-Jung [1 ]
Chen, Chien-Chih [1 ,2 ]
Chen, Chuen-Liang [2 ]
Ho, Jan-Ming [1 ]
机构
[1] Acad Sinica, Inst Informat Sci, Taipei, Taiwan
[2] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei 10764, Taiwan
来源
BMC GENOMICS | 2012年 / 13卷
关键词
Sequencing Error; Coverage Depth; Graph Construction; Position Weight Matrix; MapReduce Framework;
D O I
10.1186/1471-2164-13-S7-S28
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. Results: We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] COMPLETING A RARE HLA-DRB1*04 GENOMIC ALLELE SEQUENCE BASED ON LONG RANGE AMPLICONS AND NEXT GENERATION SEQUENCING
    Binder, Thomas M. C.
    Alster, Ina
    Trojok, Peter
    Bergmann, Ina
    Rabbi, Sikder N. I.
    Kelsch, Reinhard
    Rigo, Krisztina
    Eiermann, Thomas H.
    Schaefer, Marco
    Peter, Wolfgang
    HLA, 2019, 93 (05) : 390 - 390
  • [42] COMPLETING A RARE HLA-DRB1*13 GENOMIC ALLELE SEQUENCE BASED ON LONG RANGE AMPLICONS AND NEXT GENERATION SEQUENCING
    Binder, Thomas M.
    Alster, Ina
    Krammes, Lena
    Roth, Tobias
    Schaefer, Marco
    Koehn, Corinna B. Corina B.
    Rabbi, Sikder N.
    Kelsch, Reinhard
    Eiermann, Thomas H.
    HUMAN IMMUNOLOGY, 2017, 78 : 208 - 208
  • [43] De novo SNP markers development for the Neotropical gilded catfish Brachyplatystoma rousseauxii using next-generation sequencing-based genotyping
    Martinez, Jose Gregorio
    Caballero-Gaitan, Susana Josefina
    Sanchez-Bernal, Diana
    de Assuncao, Enedina Nogueira
    Astolfi-Filho, Spartaco
    Hrbek, Tomas
    Farias, Izeni Pires
    CONSERVATION GENETICS RESOURCES, 2016, 8 (04) : 415 - 418
  • [44] De novo SNP markers development for the Neotropical gilded catfish Brachyplatystoma rousseauxii using next-generation sequencing-based genotyping
    José Gregorio Martínez
    Susana Josefina Caballero-Gaitán
    Diana Sánchez-Bernal
    Enedina Nogueira de Assunção
    Spartaco Astolfi-Filho
    Tomas Hrbek
    Izeni Pires Farias
    Conservation Genetics Resources, 2016, 8 : 415 - 418
  • [45] DDBJ Read Annotation Pipeline: A Cloud Computing-Based Pipeline for High-Throughput Analysis of Next-Generation Sequencing Data
    Nagasaki, Hideki
    Mochizuki, Takako
    Kodama, Yuichi
    Saruhashi, Satoshi
    Morizaki, Shota
    Sugawara, Hideaki
    Ohyanagi, Hajime
    Kurata, Nori
    Okubo, Kousaku
    Takagi, Toshihisa
    Kaminuma, Eli
    Nakamura, Yasukazu
    DNA RESEARCH, 2013, 20 (04) : 383 - 390
  • [46] Next-Generation Sequencing and De Novo Assembly, Genome Organization, and Comparative Genomic Analyses of the Genomes of Two Helicobacter pylori Isolates from Duodenal Ulcer Patients in India
    Kumar, Narender
    Mukhopadhyay, Asish K.
    Patra, Rajashree
    De, Ronita
    Baddam, Ramani
    Shaik, Sabiha
    Alam, Jawed
    Tiruvayipati, Suma
    Ahmed, Niyaz
    JOURNAL OF BACTERIOLOGY, 2012, 194 (21) : 5963 - 5964
  • [47] Correction to: A Graph‑Based Ontology Matching Framework (New Generation Computing, (2023), 10.1007/s00354-022-00200-3)
    Şentürk, Fatmana
    Aytac, Vecdi
    New Generation Computing, 2023, 41 (01):
  • [48] iScreen: world's first cloud-computing web server for virtual screening and de novo drug design based on TCM database@Taiwan
    Tsai, Tsung-Ying
    Chang, Kai-Wei
    Chen, Calvin Yu-Chian
    JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2011, 25 (06) : 525 - 531
  • [49] rmvPFBAM: Removing Primers from BAM Files Based on Amplicon-Based Next-Generation Sequencing and Cloud Computing When Analyzing Personal Genome Data
    Ma, Yanjun
    SCIENTIFIC PROGRAMMING, 2021, 2021
  • [50] SNPs markers for the heavily overfished tambaqui Colossoma macropomum, a Neotropical fish, using next-generation sequencing-based de novo genotyping
    José Gregorio Martínez
    Valéria Nogueira Machado
    Susana J. Caballero-Gaitán
    Maria da C. Freitas Santos
    Rodrigo Maciel Alencar
    Maria Doris Escobar L.
    Tomas Hrbek
    Izeni Pires Farias
    Conservation Genetics Resources, 2017, 9 : 29 - 33