Data Science for Genomic Data Management: Challenges, Resources, Experiences

被引:0
|
作者
Ceri S. [1 ]
Pinoli P. [1 ]
机构
[1] Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, Milan
基金
欧盟地平线“2020”; 欧洲研究理事会;
关键词
Data driven genomic computing; Data scarcity; Genomic data science; Genomic datasets;
D O I
10.1007/s42979-019-0005-0
中图分类号
学科分类号
摘要
We highlight several challenges which are faced by data scientists who use public datasets for solving biological and clinical problems. In spite of the large efforts in building such public datasets, they are dispersed over many sources and heterogeneous for their formats and sequencing/calling techniques, often meeting highly variable quality standards. Moreover, for most research questions, scientists hardly find datasets with enough samples for building and training machine learning models. Data scarcity depends on the complexity of the genomic domain, with its multi-facets, as well as the lack of organic initiatives to provide standardization across communities. In this paper, we discuss our approach to genomic data management, that can strongly improve the problems of data dispersion and format heterogeneity through high-level abstractions for genomics. We briefly present the computational resources that were recently developed by the GeCo project (ERC Advanced Grant); they include GDM, a Genomic Data Model providing interoperability across data formats; GMQL, a genometric query language for answering data science queries over genomic datasets; and an in-house integrated repository providing attribute-based and keyword-based search over normalized metadata from several open data repositories. We describe these resources at work on a specific research question, and we highlight how we managed to produce a model for addressing such specific research question by overcoming the lack of sufficient samples and labelled datasets. © 2019, Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 50 条
  • [41] Diversifying the genomic data science research community
    Alcazar, Rosa
    Alvarez, Maria
    Arnold, Rachel
    Ayalew, Mentewab
    Best, Lyle G.
    Campbell, Michael C.
    Chowdhury, Kamal
    Cox, Katherine E. L.
    Daulton, Christina
    Deng, Youping
    Easter, Carla
    Fuller, Karla
    Hakim, Shazia Tabassum
    Hoffman, Ava M.
    Kucher, Natalie
    Lee, Andrew
    Lee, Joslynn
    Leek, Jeffrey T.
    Meller, Robert
    Mendez, Loyda B.
    Mendez-Gonzalez, Miguel P.
    Mosher, Stephen
    Nishiguchi, Michele
    Pratap, Siddharth
    Rolle, Tiffany
    Roy, Sourav
    Saidi, Rachel
    Schatz, Michael C.
    Sen, Shurjo K.
    Sniezek, James
    Martinez, Edu Suarez
    Tan, Frederick J.
    Vessio, Jennifer
    Watson, Karriem
    Westbroek, Wendy
    Wilcox, Joseph
    Wright, Carrie
    Xie, Xianfa
    GENOME RESEARCH, 2022, 32 (07) : 1231 - 1241
  • [42] Genome and genomic data: in between science and care
    Lyonnet, Stanislas
    M S-MEDECINE SCIENCES, 2023, 39 (04): : 311 - 312
  • [43] Chances and Challenges in Fusing Data Science with Materials Science
    Prakash, A.
    Sandfeld, S.
    PRAKTISCHE METALLOGRAPHIE-PRACTICAL METALLOGRAPHY, 2018, 55 (08): : 493 - 514
  • [44] Data Management for Heterogeneous Genomic Datasets
    Ceri, Stefano
    Kaitoua, Abdulrahman
    Masseroli, Marco
    Pinoli, Pietro
    Venco, Francesco
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2017, 14 (06) : 1251 - 1264
  • [45] Hurdles for Genomic Data Usage Management
    Naveed, Muhammad
    2014 IEEE SECURITY AND PRIVACY WORKSHOPS (SPW 2014), 2014, : 44 - 48
  • [46] Data, knowledge, and modeling challenges for science-informed management of river deltas
    Schmitt, Rafael Jan Pablo
    Minderhoud, Philip Simon Johannes
    ONE EARTH, 2023, 6 (03): : 216 - 235
  • [47] Challenges in Data Acquisition and Management in Big Data Environments
    Staegemann, Daniel
    Volk, Matthias
    Saxena, Akanksha
    Pohl, Matthias
    Nahhas, Abdulrahman
    Hausler, Robert
    Abdallah, Mohammad
    Bosse, Sascha
    Jamous, Naoum
    Turowski, Klaus
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON INTERNET OF THINGS, BIG DATA AND SECURITY (IOTBDS), 2021, : 193 - 204
  • [48] BUILDING RESOURCES FOR THE DIVERSIFICATION OF GENOMIC DATA ON SUICIDE MORTALITY
    Behera, Chittaranjan
    Kaushik, Ruchika
    Patra, Bichitra Nand
    Haldar, Partha
    Satapathy, Sujata Sathtapathy
    Han, Seonggyun
    DiBlasi, Emily
    Pettine, Warren
    Coon, Hilary
    Shabalin, Andrey
    Docherty, Anna
    EUROPEAN NEUROPSYCHOPHARMACOLOGY, 2024, 87 : 19 - 19
  • [49] Management von Data Science
    Utz Schäffer
    Jürgen Weber
    Controlling & Management Review, 2021, 65 (8) : 3 - 3
  • [50] Hosting a data science hackathon with limited resources
    Kuter, Kristin
    Wedrychowicz, Christopher
    STAT, 2021, 10 (01):