RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

被引：4

作者：

Pallotta, Simone ^{[1
]}

Cascianelli, Silvia ^{[1
]}

Masseroli, Marco ^{[1
]}

机构：

[1] Dipartimento Elettron & Informaz & Bioingn, Via Ponzio 34-5, I-20133 Milan, Italy

来源：

BMC BIOINFORMATICS | 2022年 / 23卷 / 01期

基金：

欧洲研究理事会;

关键词：

Heterogeneous omics big data; Data scalability; Distribution transparency; Tertiary data analysis; GENOMICS; TOOLKIT; BINDING; HADOOP; VHL;

D O I：

10.1186/s12859-022-04648-4

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. Results We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. Conclusions RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.

引用

页数：28

共 50 条

[21] Cloud Computing and Scientific Applications - Big Data, Scalable Analytics, and Beyond Preface
Pandey, Suraj
Nepal, Surya
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2013, 29 (07): : 1774 - 1776
[22] Exploring the Feasibility of Heterogeneous Computing of Complex Networks for Big Data Analysis
Garcia-Robledo, Alberto
Diaz-Perez, Arturo
Morales-Luna, Guillermo
2015 12TH INTERNATIONAL CONFERENCE & EXPO ON EMERGING TECHNOLOGIES FOR A SMARTER WORLD (CEWIT), 2015,
[23] Editorial: Heterogeneous Computing for AI and Big Data in High Energy Physics
D'Agostino, Daniele
Cesini, Daniele
FRONTIERS IN BIG DATA, 2021, 4
[24] Scheduling of Big Data Workflows in the Hadoop Framework with Heterogeneous Computing Cluster
Rahmani, Amir Masoud
Chamzini, Ehsan Yazdani
Pourshaban, Mohsen
Hosseinzadeh, Mehdi
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024,
[25] A Scalable Graph Analytics Framework for Programming with Big Data in R (pbdR)
Hasan, S. M. Shamimul
Schmidt, Drew
Kannan, Ramakrishnan
Imam, Neena
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 4783 - 4792
[26] GpemDB: A Scalable Database Architecture with the Multi-omics Entity-relationship Model to Integrate Heterogeneous Big-data for Precise Crop Breeding
Gong, Liang
Lou, Qiaojun
Yu, Chenrui
Chen, Yunyu
Hong, Jun
Wu, Wei
Fan, Shengzhe
Chen, Liang
Liu, Chengliang
FRONTIERS IN BIOSCIENCE-LANDMARK, 2022, 27 (05):
[27] An Efficient and Scalable Framework for Processing Remotely Sensed Big Data in Cloud Computing Environments
Sun, Jin
Zhang, Yi
Wu, Zebin
Zhu, Yaoqin
Yin, Xianliang
Ding, Zhongzheng
Wei, Zhihui
Plaza, Javier
Plaza, Antonio
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2019, 57 (07): : 4294 - 4308
[28] FindIT2: an R/Bioconductor package to identify influential transcription factor and targets based on multi-omics data
Shang, Guan-Dong
Xu, Zhou-Geng
Wan, Mu-Chun
Wang, Fu-Xiang
Wang, Jia-Wei
BMC GENOMICS, 2022, 23 (SUPPL 1)
[29] FindIT2: an R/Bioconductor package to identify influential transcription factor and targets based on multi-omics data
Guan-Dong Shang
Zhou-Geng Xu
Mu-Chun Wan
Fu-Xiang Wang
Jia-Wei Wang
BMC Genomics, 23
[30] Next Generation Workload Management System For Big Data on Heterogeneous Distributed Computing
Klimentov, A.
Buncic, P.
De, K.
Jha, S.
Maeno, T.
Mount, R.
Nilsson, P.
Oleynik, D.
Panitkin, S.
Petrosyan, A.
Porter, R. J.
Read, K. F.
Vaniachine, A.
Wells, J. C.
Wenaus, T.
16TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2014), 2015, 608

← 1 2 3 4 5 →