RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

被引:4
|
作者
Pallotta, Simone [1 ]
Cascianelli, Silvia [1 ]
Masseroli, Marco [1 ]
机构
[1] Dipartimento Elettron & Informaz & Bioingn, Via Ponzio 34-5, I-20133 Milan, Italy
基金
欧洲研究理事会;
关键词
Heterogeneous omics big data; Data scalability; Distribution transparency; Tertiary data analysis; GENOMICS; TOOLKIT; BINDING; HADOOP; VHL;
D O I
10.1186/s12859-022-04648-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. Results We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. Conclusions RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
引用
收藏
页数:28
相关论文
共 50 条
  • [21] Cloud Computing and Scientific Applications - Big Data, Scalable Analytics, and Beyond Preface
    Pandey, Suraj
    Nepal, Surya
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2013, 29 (07): : 1774 - 1776
  • [22] Exploring the Feasibility of Heterogeneous Computing of Complex Networks for Big Data Analysis
    Garcia-Robledo, Alberto
    Diaz-Perez, Arturo
    Morales-Luna, Guillermo
    2015 12TH INTERNATIONAL CONFERENCE & EXPO ON EMERGING TECHNOLOGIES FOR A SMARTER WORLD (CEWIT), 2015,
  • [23] Editorial: Heterogeneous Computing for AI and Big Data in High Energy Physics
    D'Agostino, Daniele
    Cesini, Daniele
    FRONTIERS IN BIG DATA, 2021, 4
  • [24] Scheduling of Big Data Workflows in the Hadoop Framework with Heterogeneous Computing Cluster
    Rahmani, Amir Masoud
    Chamzini, Ehsan Yazdani
    Pourshaban, Mohsen
    Hosseinzadeh, Mehdi
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024,
  • [25] A Scalable Graph Analytics Framework for Programming with Big Data in R (pbdR)
    Hasan, S. M. Shamimul
    Schmidt, Drew
    Kannan, Ramakrishnan
    Imam, Neena
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 4783 - 4792
  • [26] GpemDB: A Scalable Database Architecture with the Multi-omics Entity-relationship Model to Integrate Heterogeneous Big-data for Precise Crop Breeding
    Gong, Liang
    Lou, Qiaojun
    Yu, Chenrui
    Chen, Yunyu
    Hong, Jun
    Wu, Wei
    Fan, Shengzhe
    Chen, Liang
    Liu, Chengliang
    FRONTIERS IN BIOSCIENCE-LANDMARK, 2022, 27 (05):
  • [27] An Efficient and Scalable Framework for Processing Remotely Sensed Big Data in Cloud Computing Environments
    Sun, Jin
    Zhang, Yi
    Wu, Zebin
    Zhu, Yaoqin
    Yin, Xianliang
    Ding, Zhongzheng
    Wei, Zhihui
    Plaza, Javier
    Plaza, Antonio
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2019, 57 (07): : 4294 - 4308
  • [28] FindIT2: an R/Bioconductor package to identify influential transcription factor and targets based on multi-omics data
    Shang, Guan-Dong
    Xu, Zhou-Geng
    Wan, Mu-Chun
    Wang, Fu-Xiang
    Wang, Jia-Wei
    BMC GENOMICS, 2022, 23 (SUPPL 1)
  • [29] FindIT2: an R/Bioconductor package to identify influential transcription factor and targets based on multi-omics data
    Guan-Dong Shang
    Zhou-Geng Xu
    Mu-Chun Wan
    Fu-Xiang Wang
    Jia-Wei Wang
    BMC Genomics, 23
  • [30] Next Generation Workload Management System For Big Data on Heterogeneous Distributed Computing
    Klimentov, A.
    Buncic, P.
    De, K.
    Jha, S.
    Maeno, T.
    Mount, R.
    Nilsson, P.
    Oleynik, D.
    Panitkin, S.
    Petrosyan, A.
    Porter, R. J.
    Read, K. F.
    Vaniachine, A.
    Wells, J. C.
    Wenaus, T.
    16TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2014), 2015, 608