Synthetic observations from deep generative models and binary omics data with limited sample size

被引:5
|
作者
Nussberger, Jens [1 ,2 ]
Boesel, Frederic [1 ,2 ]
Lenz, Stefan [1 ,2 ]
Binder, Harald [1 ,2 ]
Hess, Moritz [1 ,2 ]
机构
[1] Univ Freiburg, Fac Med, Inst Med Biometry & Stat, Freiburg, Germany
[2] Univ Freiburg, Med Ctr, Freiburg, Germany
关键词
generative models; SNP data; benchmarking; synthetic data; data privacy;
D O I
10.1093/bib/bbaa226
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Deep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g. as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios (ORs) is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 300 variables, with a tendency of over-estimating ORs when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of ORs. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Deep Generative Models for Synthetic Data: A Survey
    Eigenschink, Peter
    Reutterer, Thomas
    Vamosi, Stefan
    Vamosi, Ralf
    Sun, Chang
    Kalcher, Klaudius
    IEEE ACCESS, 2023, 11 : 47304 - 47320
  • [2] Deep generative models in single-cell omics
    Rivero-Garcia I.
    Torres M.
    Sánchez-Cabo F.
    Computers in Biology and Medicine, 2024, 176
  • [3] Unsupervised Hybrid Deep Generative Models for Photovoltaic Synthetic Data Generation
    de Jesus, Dan A. Rosa
    Mandal, Paras
    Senjyu, Tomonobu
    Kamalasadan, Sukumar
    2021 IEEE POWER & ENERGY SOCIETY GENERAL MEETING (PESGM), 2021,
  • [4] Exploring generative deep learning for omics data using log-linear models
    Hess, Moritz
    Hackenberg, Maren
    Binder, Harald
    BIOINFORMATICS, 2020, 36 (20) : 5045 - 5053
  • [5] Dependency-aware deep generative models for multitasking analysis of spatial omics data
    Tian, Tian
    Zhang, Jie
    Lin, Xiang
    Wei, Zhi
    Hakonarson, Hakon
    NATURE METHODS, 2024, 21 (08) : 1501 - 1513
  • [6] A Critical Assessment of Generative Models for Synthetic Data Augmentation on Limited Pneumonia X-ray Data
    Schaudt, Daniel
    Spaete, Christian
    von Schwerin, Reinhold
    Reichert, Manfred
    von Schwerin, Marianne
    Beer, Meinrad
    Kloth, Christopher
    BIOENGINEERING-BASEL, 2023, 10 (12):
  • [7] Applying Deep Generative Neural Networks to Data Augmentation for Consumer Survey Data with a Small Sample Size
    Watanuki, Shinya
    Edo, Katsue
    Miura, Toshihiko
    APPLIED SCIENCES-BASEL, 2024, 14 (19):
  • [8] Synthetic data generation with deep generative models to enhance predictive tasks in trading strategies
    Carvajal-Patino, Daniel
    Ramos-Pollan, Raul
    RESEARCH IN INTERNATIONAL BUSINESS AND FINANCE, 2022, 62
  • [9] Synthetic single cell RNA sequencing data from small pilot studies using deep generative models
    Treppner, Martin
    Salas-Bastos, Adrian
    Hess, Moritz
    Lenz, Stefan
    Vogel, Tanja
    Binder, Harald
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [10] Synthetic single cell RNA sequencing data from small pilot studies using deep generative models
    Martin Treppner
    Adrián Salas-Bastos
    Moritz Hess
    Stefan Lenz
    Tanja Vogel
    Harald Binder
    Scientific Reports, 11