A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

被引:17
|
作者
Chang, Yu-Jung [1 ]
Chen, Chien-Chih [1 ,2 ]
Chen, Chuen-Liang [2 ]
Ho, Jan-Ming [1 ]
机构
[1] Acad Sinica, Inst Informat Sci, Taipei, Taiwan
[2] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei 10764, Taiwan
来源
BMC GENOMICS | 2012年 / 13卷
关键词
Sequencing Error; Coverage Depth; Graph Construction; Position Weight Matrix; MapReduce Framework;
D O I
10.1186/1471-2164-13-S7-S28
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. Results: We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] Next-Generation Sequencing-Based Preimplantation Genetic Testing for De Novo NF1 Mutations
    Chen, Dongjia
    Shen, Xiaoting
    Xu, Yan
    Cai, Bing
    Ding, Chenhui
    Zhong, Yiping
    Xu, Yanwen
    Zhou, Canquan
    BIOCHIP JOURNAL, 2021, 15 (01) : 69 - 76
  • [33] Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing
    Qu, Wei
    Hashimoto, Shin-ichi
    Morishita, Shinichi
    GENOME RESEARCH, 2009, 19 (07) : 1309 - 1315
  • [34] Development, characterization, functional annotation and validation of genomic and genic-SSR markers using de novo next generation sequencing in Melia dubia Cav.
    Annapurna, Dhavala
    Warrier, Rekha Ravindranath
    Arunkumar, Arkalgud Nagaraja
    Aparna, Rajan
    Sreedevi, Chigatagere Nagaraj
    Joshi, Geeta
    3 BIOTECH, 2021, 11 (07)
  • [35] Development, characterization, functional annotation and validation of genomic and genic-SSR markers using de novo next generation sequencing in Melia dubia Cav.
    Dhavala Annapurna
    Rekha Ravindranath Warrier
    Arkalgud Nagaraja Arunkumar
    Rajan Aparna
    Chigatagere Nagaraj Sreedevi
    Geeta Joshi
    3 Biotech, 2021, 11
  • [36] ESTIMATION OF SEQUENCE ERRORS AND CAPACITY OF GENOMIC ANNOTATION IN TRANSCRIPTOMIC AND DNA-PROTEIN INTERACTION ASSAYS BASED ON NEXT GENERATION SEQUENCERS
    Philippe, Nicolas
    Boureux, Anthony
    Brehelin, Laurent
    Tarhio, Jorma
    Commes, Therese
    Rivals, Eric
    CELLULAR ONCOLOGY, 2009, 31 (02) : 145 - 146
  • [37] Towards an Optimal Cloud-Based Resource Management Framework for Next-Generation Internet with Multi-Slice Capabilities
    AlQahtani, Salman Ali
    FUTURE INTERNET, 2023, 15 (10):
  • [38] mirTrios: an integrated pipeline for detection of de novo and rare inherited mutations from trios-based next-generation sequencing
    Li, Jinchen
    Jiang, Yi
    Wang, Tao
    Chen, Huiqian
    Xie, Qing
    Shao, Qianzhi
    Ran, Xia
    Xia, Kun
    Sun, Zhong Sheng
    Wu, Jinyu
    JOURNAL OF MEDICAL GENETICS, 2015, 52 (04) : 275 - 281
  • [39] De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome
    Iorizzo, Massimo
    Senalik, Douglas
    Szklarczyk, Marek
    Grzebelus, Dariusz
    Spooner, David
    Simon, Philipp
    BMC PLANT BIOLOGY, 2012, 12
  • [40] De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome
    Massimo Iorizzo
    Douglas Senalik
    Marek Szklarczyk
    Dariusz Grzebelus
    David Spooner
    Philipp Simon
    BMC Plant Biology, 12