Synthetic observations from deep generative models and binary omics data with limited sample size

被引:5
|
作者
Nussberger, Jens [1 ,2 ]
Boesel, Frederic [1 ,2 ]
Lenz, Stefan [1 ,2 ]
Binder, Harald [1 ,2 ]
Hess, Moritz [1 ,2 ]
机构
[1] Univ Freiburg, Fac Med, Inst Med Biometry & Stat, Freiburg, Germany
[2] Univ Freiburg, Med Ctr, Freiburg, Germany
关键词
generative models; SNP data; benchmarking; synthetic data; data privacy;
D O I
10.1093/bib/bbaa226
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Deep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g. as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios (ORs) is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 300 variables, with a tendency of over-estimating ORs when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of ORs. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Synthetic data generation in motion analysis: A generative deep learning framework
    Perrone, Mattia
    Mell, Steven P.
    Martin, John T.
    Nho, Shane J.
    Simmons, Scott
    Malloy, Philip
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART H-JOURNAL OF ENGINEERING IN MEDICINE, 2025, 239 (02) : 202 - 211
  • [32] Training deep-learning segmentation models from severely limited data
    Zhao, Yao
    Rhee, Dong Joo
    Cardenas, Carlos
    Court, Laurence E.
    Yang, Jinzhong
    MEDICAL PHYSICS, 2021, 48 (04) : 1697 - 1706
  • [33] Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models
    Rosa Lundbye Allesøe
    Agnete Troen Lundgaard
    Ricardo Hernández Medina
    Alejandro Aguayo-Orozco
    Joachim Johansen
    Jakob Nybo Nissen
    Caroline Brorsson
    Gianluca Mazzoni
    Lili Niu
    Jorge Hernansanz Biel
    Cristina Leal Rodríguez
    Valentas Brasas
    Henry Webel
    Michael Eriksen Benros
    Anders Gorm Pedersen
    Piotr Jaroslaw Chmura
    Ulrik Plesner Jacobsen
    Andrea Mari
    Robert Koivula
    Anubha Mahajan
    Ana Vinuela
    Juan Fernandez Tajes
    Sapna Sharma
    Mark Haid
    Mun-Gwan Hong
    Petra B. Musholt
    Federico De Masi
    Josef Vogt
    Helle Krogh Pedersen
    Valborg Gudmundsdottir
    Angus Jones
    Gwen Kennedy
    Jimmy Bell
    E. Louise Thomas
    Gary Frost
    Henrik Thomsen
    Elizaveta Hansen
    Tue Haldor Hansen
    Henrik Vestergaard
    Mirthe Muilwijk
    Marieke T. Blom
    Leen M. ‘t Hart
    Francois Pattou
    Violeta Raverdy
    Soren Brage
    Tarja Kokkola
    Alison Heggie
    Donna McEvoy
    Miranda Mourby
    Jane Kaye
    Nature Biotechnology, 2023, 41 : 399 - 408
  • [34] Exact unconditional sample size determination for paired binary data
    Shan, Guogen
    Zhang, Hua
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2017, 84 : 188 - 190
  • [35] Disease variant prediction with deep generative models of evolutionary data
    Frazer, Jonathan
    Notin, Pascal
    Dias, Mafalda
    Gomez, Aidan
    Min, Joseph K.
    Brock, Kelly
    Gal, Yarin
    Marks, Debora S.
    NATURE, 2021, 599 (7883) : 91 - +
  • [36] Binary imbalanced data classification based on diversity oversampling by generative models
    Zhai, Junhai
    Qi, Jiaxing
    Shen, Chu
    INFORMATION SCIENCES, 2022, 585 : 313 - 343
  • [37] Reconstruction of incomplete wildfire data using deep generative models
    Ivek, Tomislav
    Vlah, Domagoj
    EXTREMES, 2023, 26 (02) : 251 - 271
  • [38] Reconstruction of incomplete wildfire data using deep generative models
    Tomislav Ivek
    Domagoj Vlah
    Extremes, 2023, 26 : 251 - 271
  • [39] Deep Generative Models for Data Synthesis and Augmentation in Machine Learning
    Adavala, Kiran Mayee
    Vhatkar, Sangeeta
    Ruprah, Taranpreet Singh
    Bhatia, Sukhwinder Kaur
    Kumar, Vipin
    Sharma, Dharmendra
    Praveen, B. Shyam
    JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (03) : 1242 - 1249
  • [40] Assessing Deep Generative Models on Time Series Network Data
    Naveed, Muhammad Haris
    Hashmi, Umair Sajid
    Tajved, Nayab
    Sultan, Neha
    Imran, Ali
    IEEE ACCESS, 2022, 10 : 64601 - 64617