Generation and evaluation of synthetic patient data

被引:176
|
作者
Goncalves, Andre [1 ]
Ray, Priyadip [1 ]
Soper, Braden [1 ]
Stevens, Jennifer [2 ]
Coyle, Linda [2 ]
Sales, Ana Paula [1 ]
机构
[1] Lawrence Livermore Natl Lab, 7000 East Ave, Livermore, CA 94550 USA
[2] Informat Management Syst, 1455 Res Blvd,Suite 315, Rockville, MD USA
基金
美国国家卫生研究院;
关键词
Synthetic data generation; Cancer patient data; Information disclosure; Generative models; PRIVACY; RISK; TEXT;
D O I
10.1186/s12874-020-00977-1
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundMachine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges.MethodsIn this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed.ResultsWhile the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases.ConclusionsWe discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
引用
收藏
页数:40
相关论文
共 50 条
  • [21] SYNTHETIC PRECIPITATION DATA GENERATION
    ABTEW, W
    MORAS, RG
    CAMPBELL, KL
    COMPUTERS & INDUSTRIAL ENGINEERING, 1990, 19 (1-4) : 582 - 586
  • [22] EVALUATING THE POTENTIAL OF SYNTHETIC PATIENT DATA GENERATION TO ACCELERATE REAL-WORLD EVIDENCE (RWE) GENERATION
    Toernqvist, M.
    Dry, L.
    Pinon, G.
    Movschin, A.
    VALUE IN HEALTH, 2024, 27 (12)
  • [23] Synthetic Video Generation for Evaluation of Sprite Generation
    Chen, Yi
    Ayguen, Ramazan S.
    INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2010, 1 (02): : 34 - 61
  • [24] Synthetic Patient Perspective Data for the Curation and Evaluation of Rare Disease Patient-Facing Technology
    Nielsen, Emily
    Owen, Tom
    Roach, Matthew
    Dix, Alan
    ARTIFICIAL INTELLIGENCE IN HEALTHCARE, PT II, AIIH 2024, 2024, 14976 : 330 - 343
  • [25] Generation of synthetic CT data using patient specific daily MR image data and image registration
    Kraus, Kim Melanie
    Jaekel, Oliver
    Niebuhr, Nina I.
    Pfaffenberger, Asja
    PHYSICS IN MEDICINE AND BIOLOGY, 2017, 62 (04): : 1358 - 1377
  • [26] Synthetic Data Generation for Data Envelopment Analysis
    Lychev, Andrey V.
    DATA, 2023, 8 (10)
  • [27] Synthetic data generation by probabilistic PCA
    Park, Min-Jeong
    KOREAN JOURNAL OF APPLIED STATISTICS, 2022, 35 (04) : 279 - 294
  • [28] SDG - A system for synthetic data generation
    Azalov, P
    Zlatarova, F
    ITCC 2003: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: COMPUTERS AND COMMUNICATIONS, PROCEEDINGS, 2003, : 69 - 75
  • [29] Synthetic data generation by diffusion models
    Zhu, Jun
    NATIONAL SCIENCE REVIEW, 2024, 11 (08)
  • [30] Synthetic data generation by diffusion models
    Jun Zhu
    National Science Review, 2024, 11 (08) : 19 - 21