RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

被引:4
|
作者
Pallotta, Simone [1 ]
Cascianelli, Silvia [1 ]
Masseroli, Marco [1 ]
机构
[1] Dipartimento Elettron & Informaz & Bioingn, Via Ponzio 34-5, I-20133 Milan, Italy
基金
欧洲研究理事会;
关键词
Heterogeneous omics big data; Data scalability; Distribution transparency; Tertiary data analysis; GENOMICS; TOOLKIT; BINDING; HADOOP; VHL;
D O I
10.1186/s12859-022-04648-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. Results We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. Conclusions RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
引用
收藏
页数:28
相关论文
共 50 条
  • [1] RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor
    Simone Pallotta
    Silvia Cascianelli
    Marco Masseroli
    BMC Bioinformatics, 23
  • [2] Scalable Analysis of Flow Cytometry Data Using R/Bioconductor
    Klinke, David J., II
    Brundage, Kathleen M.
    CYTOMETRY PART A, 2009, 75A (08) : 699 - 706
  • [3] Data Optimised Computing for Heterogeneous Big Data Computing Applications
    Yang, Erica
    Ross, Derek
    Nagella, Srikanth
    Turner, Martin
    Kockelmann, Winfried
    Burca, Genoveva
    Pouzols, Federico Montesino
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2817 - 2819
  • [4] Special Issue on Scalable Computing for Big Data
    Yang, Laurence T.
    Chen, Jinjun
    BIG DATA RESEARCH, 2014, 1 (01) : 2 - 3
  • [5] Software Engineering for Data Intensive Scalable Computing and Heterogeneous Computing
    Kim, Miryung
    2023 IEEE/ACM INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: FUTURE OF SOFTWARE ENGINEERING, ICSE-FOSE, 2023, : 54 - 68
  • [6] The Risa R/Bioconductor package: integrative data analysis from experimental metadata and back again
    Gonzalez-Beltran, Alejandra
    Neumann, Steffen
    Maguire, Eamonn
    Sansone, Susanna-Assunta
    Rocca-Serra, Philippe
    BMC BIOINFORMATICS, 2014, 15 : 1 - 12
  • [7] The Risa R/Bioconductor package: integrative data analysis from experimental metadata and back again
    Alejandra González-Beltrán
    Steffen Neumann
    Eamonn Maguire
    Susanna-Assunta Sansone
    Philippe Rocca-Serra
    BMC Bioinformatics, 15
  • [8] Towards Quantum Scalable Data for Heterogeneous Computing Environments
    Tuyen Nguyen
    Paik, Incheon
    Sagawa, Hiroyuki
    Truong Cong Thang
    2022 IEEE INTERNATIONAL CONFERENCE ON QUANTUM COMPUTING AND ENGINEERING (QCE 2022), 2022, : 886 - 889
  • [9] PathwayPCA: an R/Bioconductor Package for Pathway Based Integrative Analysis of Multi-Omics Data
    Odom, Gabriel J.
    Ban, Yuguang
    Colaprico, Antonio
    Liu, Lizhong
    Silva, Tiago Chedraoui
    Sun, Xiaodian
    Pico, Alexander R.
    Zhang, Bing
    Wang, Lily
    Chen, Xi
    PROTEOMICS, 2020, 20 (21-22)
  • [10] Special Issue on Scalable Computing Systems for Big Data Applications
    Sun, Xian-He
    Frincu, Marc
    Chelmis, Charalampos
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 108 : 1 - 2