Generation and evaluation of synthetic patient data

被引:176
|
作者
Goncalves, Andre [1 ]
Ray, Priyadip [1 ]
Soper, Braden [1 ]
Stevens, Jennifer [2 ]
Coyle, Linda [2 ]
Sales, Ana Paula [1 ]
机构
[1] Lawrence Livermore Natl Lab, 7000 East Ave, Livermore, CA 94550 USA
[2] Informat Management Syst, 1455 Res Blvd,Suite 315, Rockville, MD USA
基金
美国国家卫生研究院;
关键词
Synthetic data generation; Cancer patient data; Information disclosure; Generative models; PRIVACY; RISK; TEXT;
D O I
10.1186/s12874-020-00977-1
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundMachine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges.MethodsIn this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed.ResultsWhile the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases.ConclusionsWe discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
引用
收藏
页数:40
相关论文
共 50 条
  • [31] Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
    Morgan Guillaudeux
    Olivia Rousseau
    Julien Petot
    Zineb Bennis
    Charles-Axel Dein
    Thomas Goronflot
    Nicolas Vince
    Sophie Limou
    Matilde Karakachoff
    Matthieu Wargny
    Pierre-Antoine Gourraud
    npj Digital Medicine, 6
  • [32] Synthetic data generation by probabilistic PCA
    Park, Min-Jeong
    KOREAN JOURNAL OF APPLIED STATISTICS, 2023, 36 (04) : 279 - 294
  • [33] Synthetic Data Generation for Statistical Testing
    Soltana, Ghanem
    Sabetzadeh, Mehrdad
    Briand, Lionel C.
    PROCEEDINGS OF THE 2017 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE'17), 2017, : 872 - 882
  • [34] Replicant™ framework for synthetic data generation
    Kenul, Emily
    Black, Margaret
    Massey, Drew
    Havelka, Zachary
    Henkai, Mawia
    Gavin, Kyle
    Shellhorn, Luke
    SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING: TOOLS, TECHNIQUES, AND APPLICATIONS II, 2024, 13035
  • [35] Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
    Guillaudeux, Morgan
    Rousseau, Olivia
    Petot, Julien
    Bennis, Zineb
    Dein, Charles-Axel
    Goronflot, Thomas
    Vince, Nicolas
    Limou, Sophie
    Karakachoff, Matilde
    Wargny, Matthieu
    Gourraud, Pierre-Antoine
    NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [36] GENERATION OF SYNTHETIC MT DATA TRAINS
    VARENTSOV, IM
    SOKOLOVA, EY
    FIZIKA ZEMLI, 1994, (06): : 80 - 88
  • [37] A synthetic fraud data generation methodology
    Lundin, E
    Kvarnström, H
    Jonsson, E
    INFORMATION AND COMMUNICATIONS SECURITY, PROCEEDINGS, 2002, 2513 : 265 - 277
  • [38] Synthetic Social Media Data Generation
    Sagduyu, Yalin E.
    Grushin, Alexander
    Shi, Yi
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2018, 5 (03): : 605 - 620
  • [39] Synthetic Data Generation for the Internet of Things
    Anderson, Jason W.
    Kennedy, K. E.
    Ngo, Linh B.
    Luckow, Andre
    Apon, Amy W.
    2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014, : 171 - 176
  • [40] Scaling Synthetic Brain Data Generation
    Doan, Mike
    Plis, Sergey
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2025, 29 (02) : 840 - 847