Read Mapping and Transcript Assembly: A Scalable and High-Throughput Workflow for the Processing and Analysis of Ribonucleic Acid Sequencing Data

被引:15
|
作者
Peri, Sateesh [1 ]
Roberts, Sarah [2 ]
Kreko, Isabella R. [3 ]
McHan, Lauren B. [3 ]
Naron, Alexandra [3 ]
Ram, Archana [3 ]
Murphy, Rebecca L. [4 ]
Lyons, Eric [1 ,2 ]
Gregory, Brian D. [5 ]
Devisetty, Upendra K. [2 ]
Nelson, Andrew D. L. [6 ]
机构
[1] Univ Arizona, Genet Grad Interdisciplinary Grp, Tucson, AZ USA
[2] Univ Arizona, CyVerse, Tucson, AZ USA
[3] Univ Arizona, Sch Plant Sci, LIVE For Plants Summer Res Program, Tucson, AZ USA
[4] Centenary Coll Louisiana, Biol Dept, Shreveport, LA USA
[5] Univ Penn, Dept Biol, Philadelphia, PA 19104 USA
[6] Cornell Univ, Boyce Thompson Inst, Ithaca, NY 14850 USA
基金
美国国家科学基金会;
关键词
RNA-seq; transcriptomics; high throughput (-omics) techniques; bioinformatics; workflow; EXPRESSION ANALYSIS; ARABIDOPSIS; COGE;
D O I
10.3389/fgene.2019.01361
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Next-generation RNA-sequencing is an incredibly powerful means of generating a snapshot of the transcriptomic state within a cell, tissue, or whole organism. As the questions addressed by RNA-sequencing (RNA-seq) become both more complex and greater in number, there is a need to simplify RNA-seq processing workflows, make them more efficient and interoperable, and capable of handling both large and small datasets. This is especially important for researchers who need to process hundreds to tens of thousands of RNA-seq datasets. To address these needs, we have developed a scalable, user-friendly, and easily deployable analysis suite called RMTA (Read Mapping, Transcript Assembly). RMTA can easily process thousands of RNA-seq datasets with features that include automated read quality analysis, filters for lowly expressed transcripts, and read counting for differential expression analysis. RMTA is containerized using Docker for easy deployment within any compute environment [cloud, local, or high-performance computing (HPC)] and is available as two apps in CyVerse's Discovery Environment, one for normal use and one specifically designed for introducing undergraduates and high school to RNA-seq analysis. For extremely large datasets (tens of thousands of FASTq files) we developed a high-throughput, scalable, and parallelized version of RMTA optimized for launching on the Open Science Grid (OSG) from within the Discovery Environment. OSG-RMTA allows users to utilize the Discovery Environment for data management, parallelization, and submitting jobs to OSG, and finally, employ the OSG for distributed, high throughput computing. Alternatively, OSG-RMTA can be run directly on the OSG through the command line. RMTA is designed to be useful for data scientists, of any skill level, interested in rapidly and reproducibly analyzing their large RNA-seq data sets.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] QIIME allows analysis of high-throughput community sequencing data
    J Gregory Caporaso
    Justin Kuczynski
    Jesse Stombaugh
    Kyle Bittinger
    Frederic D Bushman
    Elizabeth K Costello
    Noah Fierer
    Antonio Gonzalez Peña
    Julia K Goodrich
    Jeffrey I Gordon
    Gavin A Huttley
    Scott T Kelley
    Dan Knights
    Jeremy E Koenig
    Ruth E Ley
    Catherine A Lozupone
    Daniel McDonald
    Brian D Muegge
    Meg Pirrung
    Jens Reeder
    Joel R Sevinsky
    Peter J Turnbaugh
    William A Walters
    Jeremy Widmann
    Tanya Yatsunenko
    Jesse Zaneveld
    Rob Knight
    Nature Methods, 2010, 7 : 335 - 336
  • [22] Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data
    Althammer, Sonja
    Gonzalez-Vallinas, Juan
    Ballare, Cecilia
    Beato, Miguel
    Eyras, Eduardo
    BIOINFORMATICS, 2011, 27 (24) : 3333 - 3340
  • [23] QIIME allows analysis of high-throughput community sequencing data
    Caporaso, J. Gregory
    Kuczynski, Justin
    Stombaugh, Jesse
    Bittinger, Kyle
    Bushman, Frederic D.
    Costello, Elizabeth K.
    Fierer, Noah
    Pena, Antonio Gonzalez
    Goodrich, Julia K.
    Gordon, Jeffrey I.
    Huttley, Gavin A.
    Kelley, Scott T.
    Knights, Dan
    Koenig, Jeremy E.
    Ley, Ruth E.
    Lozupone, Catherine A.
    McDonald, Daniel
    Muegge, Brian D.
    Pirrung, Meg
    Reeder, Jens
    Sevinsky, Joel R.
    Tumbaugh, Peter J.
    Walters, William A.
    Widmann, Jeremy
    Yatsunenko, Tanya
    Zaneveld, Jesse
    Knight, Rob
    NATURE METHODS, 2010, 7 (05) : 335 - 336
  • [24] Whole Genome Mapping with Feature Sets from High-Throughput Sequencing Data
    Pan, Yonglong
    Wang, Xiaoming
    Liu, Lin
    Wang, Hao
    Luo, Meizhong
    PLOS ONE, 2016, 11 (09):
  • [25] DEVELOPMENT OF A NOVEL SOFTWARE PACKAGE FOR HIGH-THROUGHPUT PROCESSING AND ANALYSIS OF CARDIAC OPTICAL MAPPING DATA
    O'Shea, Christopher
    Holmes, Andrew
    Yu, Ting Yue
    Winter, James
    Correia, Joao
    Kirchhof, Paulus
    Fabritz, Larissa
    Rajpoot, Kashif
    Pavlovic, Davor
    HEART, 2017, 103 : A128 - A129
  • [26] Reproducibility of read numbers in high-throughput sequencing analysis of nematode community composition and structure
    Porazinska, Dorota L.
    Sung, Way
    Giblin-Davis, Robin M.
    Thomas, W. Kelley
    MOLECULAR ECOLOGY RESOURCES, 2010, 10 (04) : 666 - 676
  • [27] High-Throughput Analysis of Optical Mapping Data Using ElectroMap
    O'Shea, Christopher
    Holmes, Andrew P.
    Yu, Ting Y.
    Winter, James
    Wells, Simon P.
    Parker, Beth A.
    Fobian, Dannie
    Johnson, Daniel M.
    Correia, Joao
    Kirchhoff, Paulus
    Fabritz, Larissa
    Rajpoot, Kashif
    Pavlovic, Davor
    JOVE-JOURNAL OF VISUALIZED EXPERIMENTS, 2019, (148):
  • [28] PathoQC: Computationally Efficient Read Preprocessing and Quality Control for High-Throughput Sequencing Data Sets
    Hong, Changjin
    Manimaran, Solaiappan
    Johnson, William
    CANCER INFORMATICS, 2014, 13 : 167 - 176
  • [29] Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA
    Shi, Haixiang
    Schmidt, Bertil
    Liu, Weiguo
    Mueller-Wittig, Wolfgang
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 1546 - 1553
  • [30] A Primer on the Analysis of High-Throughput Sequencing Data for Detection of Plant Viruses
    Kutnjak, Denis
    Tamisier, Lucie
    Adams, Ian
    Boonham, Neil
    Candresse, Thierry
    Chiumenti, Michela
    De Jonghe, Kris
    Kreuze, Jan F.
    Lefebvre, Marie
    Silva, Goncalo
    Malapi-Wight, Martha
    Margaria, Paolo
    Plesko, Irena Mavriric
    McGreig, Sam
    Miozzi, Laura
    Remenant, Benoit
    Reynard, Jean-Sebastien
    Rollin, Johan
    Rott, Mike
    Schumpp, Olivier
    Massart, Sebastien
    Haegeman, Annelies
    MICROORGANISMS, 2021, 9 (04)