RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

被引：4

作者：

Pallotta, Simone ^{[1
]}

Cascianelli, Silvia ^{[1
]}

Masseroli, Marco ^{[1
]}

机构：

[1] Dipartimento Elettron & Informaz & Bioingn, Via Ponzio 34-5, I-20133 Milan, Italy

来源：

BMC BIOINFORMATICS | 2022年 / 23卷 / 01期

基金：

欧洲研究理事会;

关键词：

Heterogeneous omics big data; Data scalability; Distribution transparency; Tertiary data analysis; GENOMICS; TOOLKIT; BINDING; HADOOP; VHL;

D O I：

10.1186/s12859-022-04648-4

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. Results We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. Conclusions RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.

引用

页数：28

共 50 条

[41] NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data
Agostinis, Federico
Romualdi, Chiara
Sales, Gabriele
Risso, Davide
BIOINFORMATICS, 2022, 38 (09) : 2648 - 2650
[42] ParSMURF-NG: A Machine Learning High Performance Computing System for the Analysis of Imbalanced Big Omics Data
Petrini, Alessandro
Notaro, Marco
Gliozzo, Jessica
Castrignano, Tiziana
Robinson, Peter N.
Casiraghi, Elena
Valentini, Giorgio
ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS. AIAI 2022 IFIP WG 12.5 INTERNATIONAL WORKSHOPS, 2022, 652 : 424 - 435
[43] SIMPO: A Scalable In-Memory Persistent Object Framework Using NVRAM for Reliable Big Data Computing
Zhang, Mingzhe
Lam, King Tin
Yao, Xin
Wang, Cho-Li
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2018, 15 (01)
[44] GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data
Chen, Cen
Li, Kenli
Ouyang, Aijia
Zeng, Zeng
Li, Keqin
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (06) : 1275 - 1288
[45] GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data
Chen, Cen
Li, Kenli
Ouyang, Aijia
Tang, Zhuo
Li, Keqin
PROCEEDINGS 45TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - ICPP 2016, 2016, : 542 - 551
[46] Big Data Privacy Preserving in Multi-Access Edge Computing for Heterogeneous Internet of Things
Du, Miao
Wang, Kun
Chen, Yuanfang
Wang, Xiaoyan
Sun, Yanfei
IEEE COMMUNICATIONS MAGAZINE, 2018, 56 (08) : 62 - 67
[47] Scalable Big Data Computing for the Personalization of Machine Learned Models and its Application to Automatic Speech Recognition Service
Ahnn, Jong Hoon
2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014,
[48] Design and implementation of a scalable high-performance computing (HPC) cluster for omics data analysis: achievements, challenges and recommendations in LMICs
Ghedira, Kais
Khamessi, Oussema
Hkimi, Chaima
Kamoun, Selim
Dhamer, Nader
Daassi, Kamel
Ben Salah, Wassim
Othman, Houcemeddine
Belhadj, Wahbi
Ghorbal, Youssef
GIGASCIENCE, 2024, 13
[49] Load Balancing Algorithms for Big Data Flow Classification Based on Heterogeneous Computing in Software Definition Networks
Ping, Yang
JOURNAL OF GRID COMPUTING, 2020, 18 (02) : 275 - 291
[50] Marcher: A Heterogeneous System Supporting Energy-Aware High Performance Computing and Big Data Analytics
Zong, Ziliang
Ge, Rong
Gu, Qijun
BIG DATA RESEARCH, 2017, 8 : 27 - 38

← 1 2 3 4 5 →