RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

被引:4
|
作者
Pallotta, Simone [1 ]
Cascianelli, Silvia [1 ]
Masseroli, Marco [1 ]
机构
[1] Dipartimento Elettron & Informaz & Bioingn, Via Ponzio 34-5, I-20133 Milan, Italy
基金
欧洲研究理事会;
关键词
Heterogeneous omics big data; Data scalability; Distribution transparency; Tertiary data analysis; GENOMICS; TOOLKIT; BINDING; HADOOP; VHL;
D O I
10.1186/s12859-022-04648-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. Results We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. Conclusions RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
引用
收藏
页数:28
相关论文
共 50 条
  • [41] NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data
    Agostinis, Federico
    Romualdi, Chiara
    Sales, Gabriele
    Risso, Davide
    BIOINFORMATICS, 2022, 38 (09) : 2648 - 2650
  • [42] ParSMURF-NG: A Machine Learning High Performance Computing System for the Analysis of Imbalanced Big Omics Data
    Petrini, Alessandro
    Notaro, Marco
    Gliozzo, Jessica
    Castrignano, Tiziana
    Robinson, Peter N.
    Casiraghi, Elena
    Valentini, Giorgio
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS. AIAI 2022 IFIP WG 12.5 INTERNATIONAL WORKSHOPS, 2022, 652 : 424 - 435
  • [43] SIMPO: A Scalable In-Memory Persistent Object Framework Using NVRAM for Reliable Big Data Computing
    Zhang, Mingzhe
    Lam, King Tin
    Yao, Xin
    Wang, Cho-Li
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2018, 15 (01)
  • [44] GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data
    Chen, Cen
    Li, Kenli
    Ouyang, Aijia
    Zeng, Zeng
    Li, Keqin
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (06) : 1275 - 1288
  • [45] GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data
    Chen, Cen
    Li, Kenli
    Ouyang, Aijia
    Tang, Zhuo
    Li, Keqin
    PROCEEDINGS 45TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - ICPP 2016, 2016, : 542 - 551
  • [46] Big Data Privacy Preserving in Multi-Access Edge Computing for Heterogeneous Internet of Things
    Du, Miao
    Wang, Kun
    Chen, Yuanfang
    Wang, Xiaoyan
    Sun, Yanfei
    IEEE COMMUNICATIONS MAGAZINE, 2018, 56 (08) : 62 - 67
  • [47] Scalable Big Data Computing for the Personalization of Machine Learned Models and its Application to Automatic Speech Recognition Service
    Ahnn, Jong Hoon
    2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014,
  • [48] Design and implementation of a scalable high-performance computing (HPC) cluster for omics data analysis: achievements, challenges and recommendations in LMICs
    Ghedira, Kais
    Khamessi, Oussema
    Hkimi, Chaima
    Kamoun, Selim
    Dhamer, Nader
    Daassi, Kamel
    Ben Salah, Wassim
    Othman, Houcemeddine
    Belhadj, Wahbi
    Ghorbal, Youssef
    GIGASCIENCE, 2024, 13
  • [49] Load Balancing Algorithms for Big Data Flow Classification Based on Heterogeneous Computing in Software Definition Networks
    Ping, Yang
    JOURNAL OF GRID COMPUTING, 2020, 18 (02) : 275 - 291
  • [50] Marcher: A Heterogeneous System Supporting Energy-Aware High Performance Computing and Big Data Analytics
    Zong, Ziliang
    Ge, Rong
    Gu, Qijun
    BIG DATA RESEARCH, 2017, 8 : 27 - 38