Deep Generative Models for Synthetic Data: A Survey

被引:22
|
作者
Eigenschink, Peter [1 ]
Reutterer, Thomas [1 ]
Vamosi, Stefan [1 ]
Vamosi, Ralf [1 ,2 ]
Sun, Chang [3 ]
Kalcher, Klaudius [4 ]
机构
[1] Vienna Univ Econ & Business, Dept Mkt, A-1020 Vienna, Austria
[2] Vienna Univ Technol, High Performance Comp, A-1040 Vienna, Austria
[3] Maastricht Univ, Inst Data Sci, NL-6200 MD Maastricht, Netherlands
[4] Mostly AI GmbH, A-1030 Vienna, Austria
关键词
Data models; Synthetic data; Measurement; Biological system modeling; Analytical models; Training data; Medical services; Artificial intelligence; big data; deep learning; generative models; neural networks; synthetic data; privacy; NATURAL-LANGUAGE GENERATION; PREDICTION;
D O I
10.1109/ACCESS.2023.3275134
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A growing interest in synthetic data has stimulated the development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from one another. This is why models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records anymore. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic sequential data, an important and challenging sub-category of synthetic data, based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data-set relative to the original data-set regardless of the models' internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. We find that realism and coherence are more important for synthetic data natural language, speech and audio processing tasks. At the same time, novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests.
引用
收藏
页码:47304 / 47320
页数:17
相关论文
共 50 条
  • [21] Disease variant prediction with deep generative models of evolutionary data
    Frazer, Jonathan
    Notin, Pascal
    Dias, Mafalda
    Gomez, Aidan
    Min, Joseph K.
    Brock, Kelly
    Gal, Yarin
    Marks, Debora S.
    NATURE, 2021, 599 (7883) : 91 - +
  • [22] Reconstruction of incomplete wildfire data using deep generative models
    Ivek, Tomislav
    Vlah, Domagoj
    EXTREMES, 2023, 26 (02) : 251 - 271
  • [23] Reconstruction of incomplete wildfire data using deep generative models
    Tomislav Ivek
    Domagoj Vlah
    Extremes, 2023, 26 : 251 - 271
  • [24] Deep Generative Models for Data Synthesis and Augmentation in Machine Learning
    Adavala, Kiran Mayee
    Vhatkar, Sangeeta
    Ruprah, Taranpreet Singh
    Bhatia, Sukhwinder Kaur
    Kumar, Vipin
    Sharma, Dharmendra
    Praveen, B. Shyam
    JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (03) : 1242 - 1249
  • [25] Assessing Deep Generative Models on Time Series Network Data
    Naveed, Muhammad Haris
    Hashmi, Umair Sajid
    Tajved, Nayab
    Sultan, Neha
    Imran, Ali
    IEEE ACCESS, 2022, 10 : 64601 - 64617
  • [26] Disease variant prediction with deep generative models of evolutionary data
    Jonathan Frazer
    Pascal Notin
    Mafalda Dias
    Aidan Gomez
    Joseph K. Min
    Kelly Brock
    Yarin Gal
    Debora S. Marks
    Nature, 2021, 599 : 91 - 95
  • [27] Neurosymbolic Deep Generative Models for Sequence Data with Relational Constraints
    Young, Halley
    Du, Maxwell
    Bastani, Osbert
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [28] Oversampling Tabular Data with Deep Generative Models: Is it worth the effort?
    Camino, Ramiro D.
    State, Radu
    Hammerschmidt, Christian A.
    NEURIPS WORKSHOPS, 2020, 2020, 137 : 148 - 157
  • [29] Synthetic Design of Overlapping Genes Using Deep Generative Models of Protein Sequences
    Byeon, Gun Woo
    Goy, Marc Exposit
    Baker, David
    Seelig, Georg
    PROTEIN SCIENCE, 2024, 33 : 199 - 200
  • [30] Generative Models for Synthetic Urban Mobility Data: A Systematic Literature Review
    Kapp, Alexandra
    Hansmeyer, Julia
    Mihaljevic, Helena
    ACM COMPUTING SURVEYS, 2024, 56 (04)