Synthetic observations from deep generative models and binary omics data with limited sample size

被引:5
|
作者
Nussberger, Jens [1 ,2 ]
Boesel, Frederic [1 ,2 ]
Lenz, Stefan [1 ,2 ]
Binder, Harald [1 ,2 ]
Hess, Moritz [1 ,2 ]
机构
[1] Univ Freiburg, Fac Med, Inst Med Biometry & Stat, Freiburg, Germany
[2] Univ Freiburg, Med Ctr, Freiburg, Germany
关键词
generative models; SNP data; benchmarking; synthetic data; data privacy;
D O I
10.1093/bib/bbaa226
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Deep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g. as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios (ORs) is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 300 variables, with a tendency of over-estimating ORs when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of ORs. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Disease variant prediction with deep generative models of evolutionary data
    Jonathan Frazer
    Pascal Notin
    Mafalda Dias
    Aidan Gomez
    Joseph K. Min
    Kelly Brock
    Yarin Gal
    Debora S. Marks
    Nature, 2021, 599 : 91 - 95
  • [42] Neurosymbolic Deep Generative Models for Sequence Data with Relational Constraints
    Young, Halley
    Du, Maxwell
    Bastani, Osbert
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [43] Adversarial Attacks Against Deep Generative Models on Data: A Survey
    Sun, Hui
    Zhu, Tianqing
    Zhang, Zhiqiu
    Jin, Dawei
    Xiong, Ping
    Zhou, Wanlei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (04) : 3367 - 3388
  • [44] Oversampling Tabular Data with Deep Generative Models: Is it worth the effort?
    Camino, Ramiro D.
    State, Radu
    Hammerschmidt, Christian A.
    NEURIPS WORKSHOPS, 2020, 2020, 137 : 148 - 157
  • [45] Synthetic Design of Overlapping Genes Using Deep Generative Models of Protein Sequences
    Byeon, Gun Woo
    Goy, Marc Exposit
    Baker, David
    Seelig, Georg
    PROTEIN SCIENCE, 2024, 33 : 199 - 200
  • [46] Generative Models for Synthetic Urban Mobility Data: A Systematic Literature Review
    Kapp, Alexandra
    Hansmeyer, Julia
    Mihaljevic, Helena
    ACM COMPUTING SURVEYS, 2024, 56 (04)
  • [47] Generating Synthetic Tabular Data for DDoS Detection Using Generative Models
    Saka, Samed
    Al-Ataby, Ali
    Selis, Valerio
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 1436 - 1442
  • [48] Continual Learning of Generative Models With Limited Data: From Wasserstein-1 Barycenter to Adaptive Coalescence
    Dedeoglu, Mehmet
    Lin, Sen
    Zhang, Zhaofeng
    Zhang, Junshan
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (09) : 12042 - 12056
  • [49] Invisible Threats in the Data: A Study on Data Poisoning Attacks in Deep Generative Models
    Yang, Ziying
    Zhang, Jie
    Wang, Wei
    Li, Huan
    APPLIED SCIENCES-BASEL, 2024, 14 (19):
  • [50] Handling Ill-Conditioned Omics Data With Deep Probabilistic Models
    Martinez-Garcia, Maria
    Olmos, Pablo
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (09) : 4601 - 4610