The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data

被引:0
|
作者
Decruyenaere, Alexander [1 ]
Dehaene, Heidelinde [1 ]
Rabaey, Paloma [2 ]
Polet, Christiaan [1 ]
Decruyenaere, Johan [1 ]
Vansteelandt, Stijn [3 ]
Demeester, Thomas [2 ]
机构
[1] Ghent Univ Hosp, SYNDARA Res Grp, Ghent, Belgium
[2] Univ Ghent, Imec, Ghent, Belgium
[3] Univ Ghent, Ghent, Belgium
来源
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE | 2024年 / 244卷
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Recent advances in generative models facilitate the creation of synthetic data to be made available for research in privacy-sensitive contexts. However, the analysis of synthetic data raises a unique set of methodological challenges. In this work, we highlight the importance of inferential utility and provide empirical evidence against naive inference from synthetic data, whereby synthetic data are treated as if they were actually observed. Before publishing synthetic data, it is essential to develop statistical inference tools for such data. By means of a simulation study, we show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased. Despite the use of a previously proposed correction factor, this problem persists for deep generative models, in part due to slower convergence of estimators and resulting underestimation of the true standard error. We further demonstrate our findings through a case study.
引用
收藏
页码:966 / 996
页数:31
相关论文
共 13 条
  • [1] The Real Deal About Synthetic Data
    Lucini, Fernando
    MIT SLOAN MANAGEMENT REVIEW, 2022, 63 (02) : 11 - 13
  • [2] Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data
    Lautrup, Anton D.
    Hyrup, Tobias
    Zimek, Arthur
    Schneider-Kamp, Peter
    DATA MINING AND KNOWLEDGE DISCOVERY, 2025, 39 (01) : 1 - 25
  • [3] Systematic Review of Generative Modelling Tools and Utility Metrics for Fully Synthetic Tabular Data
    Lautrup, Anton danholt
    Hyrup, Tobias
    Zimek, Arthur
    Schneider-kamp, Peter
    ACM COMPUTING SURVEYS, 2025, 57 (04)
  • [4] Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions
    Hernadez, Mikel
    Epelde, Gorka
    Alberdi, Ane
    Cilla, Rodrigo
    Rankin, Debbie
    METHODS OF INFORMATION IN MEDICINE, 2023, 62 : e19 - e38
  • [5] Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data
    Pereira, Mayana
    Kshirsagar, Meghana
    Mukherjee, Sumit
    Dodhia, Rahul
    Lavista Ferres, Juan
    de Sousa, Rafael
    PLOS ONE, 2024, 19 (02):
  • [6] Reconsidering utility: unveiling the limitations of synthetic mobility data generation algorithms in real-life scenarios
    Kapp, Alexandra
    Mihaljevic, Helena
    31ST ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS, ACM SIGSPATIAL GIS 2023, 2023, : 550 - 561
  • [7] Feasibility and utility of synthetic control arms derived from real-world data to support clinical development
    Lyman, Jaclyn Paige
    Doucette, Abigail
    Zheng-Lin, Binbin
    Cabanski, Christopher R.
    Maloy, Molly A.
    Bayless, Nicholas L.
    Xu, Jingying
    Smith, William
    Karakunnel, Joyson Joseph
    Fairchild, Justin P.
    Ibrahim, Ramy
    O'Reilly, Eileen Mary
    Vonderheide, Robert H.
    Gabriel, Peter Edward
    JOURNAL OF CLINICAL ONCOLOGY, 2022, 40 (04)
  • [8] Towards artificial intelligence-based disease prediction algorithms that comprehensively leverage and continuously learn from real-world clinical tabular data systems
    Lee-St, Terrence J.
    Kanwar, Oshin
    Abidi, Emna
    El Nekidy, Wasim
    Piechowski-Jozwiak, Bartlomiej
    PLOS DIGITAL HEALTH, 2024, 3 (09):
  • [9] ARTIFICIAL INTELLIGENCE-POWERED IDENTIFICATION, ACCESS, AND UTILITY MAPPING OF REAL-WORLD DATA SOURCES FOR ALZHEIMER'S DISEASE IN ASIA PACIFIC
    Low, K. W.
    Tan, K.
    Hogg, L.
    Toh, M.
    Gras, A.
    Jain, V
    Escalante, V
    VALUE IN HEALTH, 2024, 27 (12) : S583 - S583
  • [10] Real-world data on the clinical utility of a novel artificial intelligence-based computational method to support treatment decisions in gastrointestinal cancers.
    Petak, Istvan
    Vodicska, Barbara
    Kispeter, Eniko
    Doczi, Robert
    Tihanyi, Dora
    Lakatos, Dora
    Dirner, Anna
    Vidermann, Matyas
    Szalkai-Denes, Reka
    Deri, Julia
    Kamal, Maud
    Schwab, Richard
    Le Tourneau, Christophe
    JOURNAL OF CLINICAL ONCOLOGY, 2022, 40 (16)