A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

被引:17
|
作者
Chang, Yu-Jung [1 ]
Chen, Chien-Chih [1 ,2 ]
Chen, Chuen-Liang [2 ]
Ho, Jan-Ming [1 ]
机构
[1] Acad Sinica, Inst Informat Sci, Taipei, Taiwan
[2] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei 10764, Taiwan
来源
BMC GENOMICS | 2012年 / 13卷
关键词
Sequencing Error; Coverage Depth; Graph Construction; Position Weight Matrix; MapReduce Framework;
D O I
10.1186/1471-2164-13-S7-S28
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. Results: We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Sequence Comparative Analysis Using Networks: Software for Evaluating De Novo Transcript Assembly from Next-Generation Sequencing
    Misner, Ian
    Bicep, Cedric
    Lopez, Philippe
    Halary, Sebastien
    Bapteste, Eric
    Lane, Christopher E.
    MOLECULAR BIOLOGY AND EVOLUTION, 2013, 30 (08) : 1975 - 1986
  • [22] De novo transcriptome characterization and development of genomic tools for Scabiosa columbaria L. using next-generation sequencing techniques
    Angeloni, F.
    Wagemaker, C. A. M.
    Jetten, M. S. M.
    den Camp, H. J. M. Op
    Janssen-Megens, E. M.
    Francoijs, K-J
    Stunnenberg, H. G.
    Ouborg, N. J.
    MOLECULAR ECOLOGY RESOURCES, 2011, 11 (04) : 662 - 674
  • [23] A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer
    Feng Zhang
    Xiangke Liao
    Shaoliang Peng
    Yingbo Cui
    Bingqiang Wang
    Xiaoqian Zhu
    Jie Liu
    Interdisciplinary Sciences: Computational Life Sciences, 2016, 8 : 169 - 176
  • [24] A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer
    Zhang, Feng
    Liao, Xiangke
    Peng, Shaoliang
    Cui, Yingbo
    Wang, Bingqiang
    Zhu, Xiaoqian
    Liu, Jie
    INTERDISCIPLINARY SCIENCES-COMPUTATIONAL LIFE SCIENCES, 2016, 8 (02) : 169 - 176
  • [25] Editorial: Heterogeneous Cloud-Based Intelligent Computing for Next-Generation 5G Applications
    Liu, Qiang
    Shea, Ryan
    Liu, Zhi
    Wang, Zehua
    Hu, Han
    MOBILE NETWORKS & APPLICATIONS, 2022, 27 (05): : 1779 - 1782
  • [26] An Adaptive Workflow Scheduling Scheme Based on an Estimated Data Processing Rate for Next Generation Sequencing in Cloud Computing
    Kim, Byungsang
    Youn, Chan-Hyun
    Park, Yong-Sung
    Lee, Yonggyu
    Choi, Wan
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2012, 8 (04): : 555 - 566
  • [27] Editorial: Heterogeneous Cloud-Based Intelligent Computing for Next-Generation 5G Applications
    Qiang Liu
    Ryan Shea
    Zhi Liu
    Zehua Wang
    Han Hu
    Mobile Networks and Applications, 2022, 27 : 1779 - 1782
  • [28] De novo profile generation based on sequence context specificity with the long short-term memory network
    Kazunori D. Yamada
    Kengo Kinoshita
    BMC Bioinformatics, 19
  • [29] De novo profile generation based on sequence context specificity with the long short-term memory network
    Yamada, Kazunori D.
    Kinoshita, Kengo
    BMC BIOINFORMATICS, 2018, 19
  • [30] Next-Generation Sequencing-Based Preimplantation Genetic Testing for De Novo NF1 Mutations
    Dongjia Chen
    Xiaoting Shen
    Yan Xu
    Bing Cai
    Chenhui Ding
    Yiping Zhong
    Yanwen Xu
    Canquan Zhou
    BioChip Journal, 2021, 15 : 69 - 76