RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

被引:4
|
作者
Pallotta, Simone [1 ]
Cascianelli, Silvia [1 ]
Masseroli, Marco [1 ]
机构
[1] Dipartimento Elettron & Informaz & Bioingn, Via Ponzio 34-5, I-20133 Milan, Italy
基金
欧洲研究理事会;
关键词
Heterogeneous omics big data; Data scalability; Distribution transparency; Tertiary data analysis; GENOMICS; TOOLKIT; BINDING; HADOOP; VHL;
D O I
10.1186/s12859-022-04648-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. Results We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. Conclusions RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
引用
收藏
页数:28
相关论文
共 50 条
  • [31] Research on interconnection technology of heterogeneous platforms for privacy computing for energy big data
    Zhai, Yujia
    Huang, Xiuli
    Yu, Pengfei
    2024 3RD INTERNATIONAL CONFERENCE ON ENERGY AND POWER ENGINEERING, CONTROL ENGINEERING, EPECE 2024, 2024, : 70 - 75
  • [32] SecDATAVIEW: A Secure Big Data Workflow Management System for Heterogeneous Computing Environments
    Mofrad, Saeid
    Ahmed, Ishtiaq
    Lu, Shiyong
    Yang, Ping
    Cui, Heming
    Zhang, Fengwei
    35TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE (ACSA), 2019, : 390 - 403
  • [33] Special Issue on Heterogeneous Big Data Analytics and Cloud Computing (Part 2)
    Wang, Ruomei
    He, Xiangjian
    Xu, Songhua
    INTERNATIONAL JOURNAL OF GRID AND HIGH PERFORMANCE COMPUTING, 2018, 10 (03) : V - VI
  • [34] Architecture and Implementation of a Scalable Sensor Data Storage and Analysis System Using Cloud Computing and Big Data Technologies
    Aydin, Galip
    Hallac, Ibrahim Riza
    Karakus, Betul
    JOURNAL OF SENSORS, 2015, 2015
  • [35] Low Power and Scalable Many-Core Architecture for Big-Data Stream Computing
    Kanoun, Karim
    Ruggiero, Martino
    Atienza, David
    van der Schaar, Mihaela
    2014 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI), 2014, : 469 - 474
  • [36] High-Performance Computing based Scalable Online Fuzzy Clustering Algorithms for Big Data
    Jha, Preeti
    Tiwari, Aruna
    Bharill, Neha
    Ratnaparkhe, Milind
    Patel, Om Prakash
    Pulakitha, Rapolu
    Chauhan, Aditi
    2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 1400 - 1407
  • [37] Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology
    Eric E. Schadt
    Michael D. Linderman
    Jon Sorenson
    Lawrence Lee
    Garry P. Nolan
    Nature Reviews Genetics, 2011, 12 : 224 - 224
  • [38] Special Issue on Heterogeneous Big Data Analytics and Cloud Computing, Part 1 Preface
    Wang, Ruomei
    He, Xiangjian
    Xu, Songhua
    INTERNATIONAL JOURNAL OF GRID AND HIGH PERFORMANCE COMPUTING, 2018, 10 (02) : V - VI
  • [39] Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology
    Schadt, Eric E.
    Linderman, Michael D.
    Sorenson, Jon
    Lee, Lawrence
    Nolan, Garry P.
    NATURE REVIEWS GENETICS, 2011, 12 (03) : 224 - 224
  • [40] Heterogeneous Internet of Things Big Data Analysis System Based on Mobile Edge Computing
    Yang, Lin
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2024,