Data Science for Genomic Data Management: Challenges, Resources, Experiences

被引:0
|
作者
Ceri S. [1 ]
Pinoli P. [1 ]
机构
[1] Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, Milan
基金
欧盟地平线“2020”; 欧洲研究理事会;
关键词
Data driven genomic computing; Data scarcity; Genomic data science; Genomic datasets;
D O I
10.1007/s42979-019-0005-0
中图分类号
学科分类号
摘要
We highlight several challenges which are faced by data scientists who use public datasets for solving biological and clinical problems. In spite of the large efforts in building such public datasets, they are dispersed over many sources and heterogeneous for their formats and sequencing/calling techniques, often meeting highly variable quality standards. Moreover, for most research questions, scientists hardly find datasets with enough samples for building and training machine learning models. Data scarcity depends on the complexity of the genomic domain, with its multi-facets, as well as the lack of organic initiatives to provide standardization across communities. In this paper, we discuss our approach to genomic data management, that can strongly improve the problems of data dispersion and format heterogeneity through high-level abstractions for genomics. We briefly present the computational resources that were recently developed by the GeCo project (ERC Advanced Grant); they include GDM, a Genomic Data Model providing interoperability across data formats; GMQL, a genometric query language for answering data science queries over genomic datasets; and an in-house integrated repository providing attribute-based and keyword-based search over normalized metadata from several open data repositories. We describe these resources at work on a specific research question, and we highlight how we managed to produce a model for addressing such specific research question by overcoming the lack of sufficient samples and labelled datasets. © 2019, Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 50 条
  • [1] BIG DATA CHALLENGES FOR HUMAN RESOURCES MANAGEMENT
    Bara, Adela
    Simonca , Iuliana
    Belciu, Anda
    Nedelcu, Bogdan
    PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON INFORMATICS IN ECONOMY (IE 2015): EDUCATION, RESEARCH & BUSINESS TECHNOLOGIES, 2015, : 364 - 368
  • [2] Challenges and Gaps in Clinical Trial Genomic Data Management
    Asad, Sarah
    Kananen, Kathryn
    Mueller, Kurt R.
    Symmans, W. Fraser
    Wen, Yujia
    Perou, Charles M.
    Blachly, James S.
    Chen, James
    Vincent, Benjamin G.
    Stover, Daniel G.
    JCO CLINICAL CANCER INFORMATICS, 2022, 6 (01):
  • [3] ICSU and the challenges of data and information management for International science
    Fox, Peter
    Harris, Ray
    Data Science Journal, 2013, 12
  • [4] Data-Management for Extreme Science: Experiences in Translational Computer Science Research
    Parashar, Manish
    PROCEEDINGS OF THE 31ST INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2022, 2022, : 3 - 3
  • [5] transPLANT Resources for Triticeae Genomic Data
    Spannagl, Manuel
    Alaux, Michael
    Lange, Matthias
    Bolser, Daniel M.
    Bader, Kai C.
    Letellier, Thomas
    Kimmel, Erik
    Flores, Raphael
    Pommier, Cyril
    Kerhornou, Arnaud
    Walts, Brandon
    Nussbaumer, Thomas
    Grabmuller, Christoph
    Chen, Jinbo
    Colmsee, Christian
    Beier, Sebastian
    Mascher, Martin
    Schmutzer, Thomas
    Arend, Daniel
    Thanki, Anil
    Ramirez-Gonzalez, Ricardo
    Ayling, Martin
    Ayling, Sarah
    Caccamo, Mario
    Mayer, Klaus F. X.
    Scholz, Uwe
    Steinbach, Delphine
    Quesneville, Hadi
    Kersey, Paul J.
    PLANT GENOME, 2016, 9 (01):
  • [6] Xenopus genomic data and browser resources
    Vize, Peter D.
    Zorn, Aaron M.
    DEVELOPMENTAL BIOLOGY, 2017, 426 (02) : 194 - 199
  • [7] Data science on multimedia data: Challenges and applications
    Piccialli, Francesco
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (03) : 3059 - 3059
  • [8] Data science on multimedia data: Challenges and applications
    Multimedia Tools and Applications, 2022, 81 : 3059 - 3059
  • [9] Data and Information for Integrated Water Resources Management (IWRM): Needs and Challenges
    Elfithri, Rahmah
    Mokhtar, Mazlin
    Saad, Nik
    ASIAN JOURNAL OF WATER ENVIRONMENT AND POLLUTION, 2008, 5 (04) : 49 - 57
  • [10] DATA DICTIONARIES - AIDS FOR THE MANAGEMENT OF DATA RESOURCES
    SCHUTT, A
    SCHUTT, D
    WILDGRUBE, E
    ANGEWANDTE INFORMATIK, 1981, (07): : 281 - 285