PGen: large-scale genomic variations analysis workflow and browser in SoyKB

被引:20
|
作者
Liu, Yang [1 ,2 ]
Khan, Saad M. [1 ,2 ]
Wang, Juexin [2 ,3 ]
Rynge, Mats [4 ]
Zhang, Yuanxun [3 ]
Zeng, Shuai [2 ,3 ]
Chen, Shiyuan [2 ,3 ]
dos Santos, Joao V. Maldonado [5 ]
Valliyodan, Babu [5 ,6 ]
Calyam, Prasad P.
Merchant, Nirav [7 ]
Nguyen, Henry T. [5 ,6 ]
Xu, Dong [1 ,2 ,3 ]
Joshi, Trupti [1 ,2 ,3 ,8 ,9 ]
机构
[1] Univ Missouri, Informat Inst, Columbia, MO 65211 USA
[2] Univ Missouri, Christopher S Bond Life Sci Ctr, Columbia, MO 65211 USA
[3] Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA
[4] Univ Southern Calif, Informat Sci Inst, Los Angeles, CA USA
[5] Univ Missouri, Div Plant Sci, Columbia, MO USA
[6] Natl Ctr Soybean Biotechnol, Columbia, MO USA
[7] Univ Arizona, iPlant Collaborat, Tucson, AZ USA
[8] Univ Missouri, Sch Med, Dept Mol Microbiol & Immunol, Columbia, MO 65212 USA
[9] Univ Missouri, Sch Med, Off Res, Columbia, MO 65211 USA
来源
BMC BIOINFORMATICS | 2016年 / 17卷
关键词
DISCOVERY;
D O I
10.1186/s12859-016-1227-y
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed " PGen", an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. Results: We have developed both a Linux version in GitHub (https:// github. com/ pegasus-isi/ PGen-GenomicVariationsWorkflow) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), (http:// soykb. org/ Pegasus/ index. php). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser (http:// soykb. org/ NGS_ Resequence/ NGS_ index. php) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. Conclusion: PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] PGen: large-scale genomic variations analysis workflow and browser in SoyKB
    Yang Liu
    Saad M. Khan
    Juexin Wang
    Mats Rynge
    Yuanxun Zhang
    Shuai Zeng
    Shiyuan Chen
    Joao V. Maldonado dos Santos
    Babu Valliyodan
    Prasad P. Calyam
    Nirav Merchant
    Henry T. Nguyen
    Dong Xu
    Trupti Joshi
    BMC Bioinformatics, 17
  • [2] A Workflow for Parallel and Distributed Computing of Large-Scale Genomic Data
    Choi, Hyun-Hwa
    Kim, Byoung-Seob
    Ahn, Shin-Young
    Bae, Seung-Jo
    2013 8TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2013, : 215 - 218
  • [3] On the analysis of large-scale genomic structures
    Nestor Norio Oiwa
    Carla Goldman
    Cell Biochemistry and Biophysics, 2005, 42 : 145 - 165
  • [4] On the analysis of large-scale genomic structures
    Oiwa, NN
    Goldman, C
    CELL BIOCHEMISTRY AND BIOPHYSICS, 2005, 42 (02) : 145 - 165
  • [5] Large-scale genomic analysis of ovarian carcinomas
    Gorringe, Kylie L.
    Campbell, Ian G.
    MOLECULAR ONCOLOGY, 2009, 3 (02): : 157 - 164
  • [6] Accelerating Large-Scale Genomic Analysis with Spark
    Li, Xueqi
    Tan, Guangming
    Zhang, Chunming
    Li, Xu
    Zhang, Zhonghai
    Sun, Ninghui
    2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 747 - 751
  • [7] Large-scale genomic analysis of Elizabethkingia anophelis
    Andriyanov, Pavel
    Zhurilov, Pavel
    Menshikova, Alena
    Tutrina, Anastasia
    Yashin, Ivan
    Kashina, Daria
    BMC GENOMICS, 2024, 25 (01):
  • [8] A Large-scale Empirical Analysis of Browser Fingerprints Properties forWeb Authentication
    Andriamilanto, Nampoina
    Allard, Tristan
    Le Guelvouit, Gaetan
    Garel, Alexandre
    ACM TRANSACTIONS ON THE WEB, 2022, 16 (01)
  • [9] Large-scale genomic and transcriptomic analysis of mycorrhizal fungi
    Kuo, A.
    Kohler, A.
    Grigoriev, I.
    Martin, F.
    PHYTOPATHOLOGY, 2013, 103 (06) : 75 - 75
  • [10] Kernel methods for large-scale genomic data analysis
    Wang, Xuefeng
    Xing, Eric P.
    Schaid, Daniel J.
    BRIEFINGS IN BIOINFORMATICS, 2015, 16 (02) : 183 - 192