Synthetic observations from deep generative models and binary omics data with limited sample size

被引:5
|
作者
Nussberger, Jens [1 ,2 ]
Boesel, Frederic [1 ,2 ]
Lenz, Stefan [1 ,2 ]
Binder, Harald [1 ,2 ]
Hess, Moritz [1 ,2 ]
机构
[1] Univ Freiburg, Fac Med, Inst Med Biometry & Stat, Freiburg, Germany
[2] Univ Freiburg, Med Ctr, Freiburg, Germany
关键词
generative models; SNP data; benchmarking; synthetic data; data privacy;
D O I
10.1093/bib/bbaa226
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Deep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g. as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios (ORs) is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 300 variables, with a tendency of over-estimating ORs when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of ORs. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] GENERATIVE ENSEMBLE REGRESSION: LEARNING PARTICLE DYNAMICS FROM OBSERVATIONS OF ENSEMBLES WITH PHYSICS-INFORMED DEEP GENERATIVE MODELS
    Yang, Liu
    Daskalakis, Constantinos
    Karniadakis, George E.
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2022, 44 (01): : B80 - B99
  • [22] Sample size and power calculations with correlated binary data
    Pan, W
    CONTROLLED CLINICAL TRIALS, 2001, 22 (03): : 211 - 227
  • [23] Deep Generative Models for Relational Data with Side Information
    Hu, Changwei
    Rai, Piyush
    Carin, Lawrence
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [24] On oversampling imbalanced data with deep conditional generative models
    Fajardo, Val Andrei
    Findlay, David
    Jaiswal, Charu
    Yin, Xinshang
    Houmanfar, Roshanak
    Xie, Honglei
    Liang, Jiaxi
    She, Xichen
    Emerson, D. B.
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 169 (169)
  • [25] Optimally Sample-Efficient Phase Retrieval with Deep Generative Models
    Hand, Paul
    Leong, Oscar
    Voroninski, Vladislav
    2019 13TH INTERNATIONAL CONFERENCE ON SAMPLING THEORY AND APPLICATIONS (SAMPTA), 2019,
  • [26] A virtual sample generation method based on manifold learning and a generative adversarial network for soft sensor models with limited data
    Bai, Xinpeng
    Li, Shaojun
    JOURNAL OF THE TAIWAN INSTITUTE OF CHEMICAL ENGINEERS, 2023, 151
  • [27] A Review of Generative Models in Generating Synthetic Attack Data for Cybersecurity
    Agrawal, Garima
    Kaur, Amardeep
    Myneni, Sowmya
    ELECTRONICS, 2024, 13 (02)
  • [28] DEEP GENERATIVE REGRESSION MODELS FOR SOIL MOISTURE RETRIEVAL FROM GNSS-R OBSERVATIONS
    Tsagkatakis, G.
    Melebari, A.
    Akbar, R.
    Campbell, J. D.
    Hodges, E.
    Moghaddam, M.
    2023 INTERNATIONAL CONFERENCE ON ELECTROMAGNETICS IN ADVANCED APPLICATIONS, ICEAA, 2023, : 291 - 291
  • [29] Sample size determination for biomedical big data with limited labels
    Richter, Aaron N.
    Khoshgoftaar, Taghi M.
    NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2020, 9 (01):
  • [30] Sample size determination for biomedical big data with limited labels
    Aaron N. Richter
    Taghi M. Khoshgoftaar
    Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9