GemSIM: general, error-model based simulator of next-generation sequencing data

被引:114
|
作者
McElroy, Kerensa E. [1 ,2 ,3 ]
Luciani, Fabio [3 ]
Thomas, Torsten [1 ,2 ]
机构
[1] UNSW, Ctr Marine Bioinnovat, Sydney, NSW 2052, Australia
[2] UNSW, Sch Biotechnol & Biomol Sci, Sydney, NSW 2052, Australia
[3] Univ New S Wales, Sch Med Sci, Inflammat & Infect Res Grp, Sydney, NSW 2052, Australia
来源
BMC GENOMICS | 2012年 / 13卷
基金
英国医学研究理事会; 澳大利亚国家健康与医学研究理事会;
关键词
QUALITY; ACCURACY; FORMAT;
D O I
10.1186/1471-2164-13-74
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirically derived, sequence-context based error models to realistically emulate individual sequencing runs and/or technologies. Empirical fragment length and quality score distributions are also used. Reads may be drawn from one or more genomes or haplotype sets, facilitating simulation of deep sequencing, metagenomic, and resequencing projects. Results: We demonstrate GemSIM's value by deriving error models from two different Illumina sequencing runs and one Roche/454 run, and comparing and contrasting the resulting error profiles of each run. Overall error rates varied dramatically, both between individual Illumina runs, between the first and second reads in each pair, and between datasets from Illumina and Roche/454 technologies. Indels were markedly more frequent in Roche/454 than Illumina and both technologies suffered from an increase in error rates near the end of each read. The effects of these different profiles on low-frequency SNP-calling accuracy were investigated by analysing simulated sequencing data for a mixture of bacterial haplotypes. In general, SNP-calling using VarScan was only accurate for SNPs with frequency > 3%, independent of which error model was used to simulate the data. Variation between error profiles interacted strongly with VarScan's 'minumum average quality' parameter, resulting in different optimal settings for different sequencing runs. Conclusions: Next-generation sequencing has unprecedented potential for assessing genetic diversity, however analysis is complicated as error profiles can vary noticeably even between different runs of the same technology. Simulation with GemSIM can help overcome this problem, by providing insights into the error profiles of individual sequencing runs and allowing researchers to assess the effects of these errors on downstream data analysis.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] GemSIM: general, error-model based simulator of next-generation sequencing data
    Kerensa E McElroy
    Fabio Luciani
    Torsten Thomas
    BMC Genomics, 13
  • [2] NGSNGS: next-generation simulator for next-generation sequencing data
    Henriksen, Rasmus Amund
    Zhao, Lei
    Korneliussen, Thorfinn Sand
    BIOINFORMATICS, 2023, 39 (01)
  • [3] IntSIM: An Integrated Simulator of Next-Generation Sequencing Data
    Yuan, Xiguo
    Zhang, Junying
    Yang, Liying
    IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2017, 64 (02) : 441 - 451
  • [4] Analysis of error profiles in deep next-generation sequencing data
    Ma, Xiaotu
    Shao, Ying
    Tian, Liqing
    Flasch, Diane A.
    Mulder, Heather L.
    Edmonson, Michael N.
    Liu, Yu
    Chen, Xiang
    Newman, Scott
    Nakitandwe, Joy
    Li, Yongjin
    Li, Benshang
    Shen, Shuhong
    Wang, Zhaoming
    Shurtleff, Sheila
    Robison, Leslie L.
    Levy, Shawn
    Easton, John
    Zhang, Jinghui
    GENOME BIOLOGY, 2019, 20 (1)
  • [5] MapReduce for accurate error correction of next-generation sequencing data
    Zhao, Liang
    Chen, Qingfeng
    Li, Wencui
    Jiang, Peng
    Wong, Limsoon
    Li, Jinyan
    BIOINFORMATICS, 2017, 33 (23) : 3844 - 3851
  • [6] Analysis of error profiles in deep next-generation sequencing data
    Xiaotu Ma
    Ying Shao
    Liqing Tian
    Diane A. Flasch
    Heather L. Mulder
    Michael N. Edmonson
    Yu Liu
    Xiang Chen
    Scott Newman
    Joy Nakitandwe
    Yongjin Li
    Benshang Li
    Shuhong Shen
    Zhaoming Wang
    Sheila Shurtleff
    Leslie L. Robison
    Shawn Levy
    John Easton
    Jinghui Zhang
    Genome Biology, 20
  • [7] Analysis of error profiles in deep next-generation sequencing data
    Ma, Xiaotu
    Zhang, Jinghui
    CANCER RESEARCH, 2019, 79 (13)
  • [8] ART: a next-generation sequencing read simulator
    Huang, Weichun
    Li, Leping
    Myers, Jason R.
    Marth, Gabor T.
    BIOINFORMATICS, 2012, 28 (04) : 593 - 594
  • [9] NeSSM: A Next-Generation Sequencing Simulator for Metagenomics
    Jia, Ben
    Xuan, Liming
    Cai, Kaiye
    Hu, Zhiqiang
    Ma, Liangxiao
    Wei, Chaochun
    PLOS ONE, 2013, 8 (10):
  • [10] A Bayesian Model for SNP Discovery Based on Next-Generation Sequencing Data
    Xu, Yanxun
    Zheng, Xiaofeng
    Yuan, Yuan
    Estecio, Marcos R.
    Issa, Jean-Pierre
    Ji, Yuan
    Liang, Shoudan
    2012 IEEE INTERNATIONAL WORKSHOP ON GENOMIC SIGNAL PROCESSING AND STATISTICS (GENSIPS), 2012, : 42 - 45