Generation and evaluation of synthetic patient data

被引:176
|
作者
Goncalves, Andre [1 ]
Ray, Priyadip [1 ]
Soper, Braden [1 ]
Stevens, Jennifer [2 ]
Coyle, Linda [2 ]
Sales, Ana Paula [1 ]
机构
[1] Lawrence Livermore Natl Lab, 7000 East Ave, Livermore, CA 94550 USA
[2] Informat Management Syst, 1455 Res Blvd,Suite 315, Rockville, MD USA
基金
美国国家卫生研究院;
关键词
Synthetic data generation; Cancer patient data; Information disclosure; Generative models; PRIVACY; RISK; TEXT;
D O I
10.1186/s12874-020-00977-1
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundMachine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges.MethodsIn this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed.ResultsWhile the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases.ConclusionsWe discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
引用
收藏
页数:40
相关论文
共 50 条
  • [1] Generation and evaluation of synthetic patient data
    Andre Goncalves
    Priyadip Ray
    Braden Soper
    Jennifer Stevens
    Linda Coyle
    Ana Paula Sales
    BMC Medical Research Methodology, 20
  • [2] Generation and evaluation of medical synthetic data
    Goncalves, Andre R.
    Ray, Priyadip
    Soper, Braden
    Myneni, Madhumita
    Stevens, Jennifer L.
    Coyle, Linda M.
    Sales, Ana Paula
    CANCER RESEARCH, 2019, 79 (13)
  • [3] Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets
    Rodriguez-Almeida, Antonio J.
    Fabelo, Himar
    Ortega, Samuel
    Deniz, Alejandro
    Balea-Fernandez, Francisco J.
    Quevedo, Eduardo
    Soguero-Ruiz, Cristina
    Wagner, Ana M.
    Callico, Gustavo M.
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (06) : 2670 - 2680
  • [4] An Evaluation Framework for Synthetic Data Generation Models
    Livieris, I. E.
    Alimpertis, N.
    Domalis, G.
    Tsakalidis, D.
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, PT III, AIAI 2024, 2024, 713 : 320 - 335
  • [5] Synthetic data in medicine: generation, evaluation and limits
    Benani, Alaedine
    Vibert, Julien
    Demuth, Stanislas
    M S-MEDECINE SCIENCES, 2024, 40 (8-9): : 661 - 664
  • [6] Generation and evaluation of privacy preserving synthetic health data
    Yale, Andrew
    Dash, Saloni
    Dutta, Ritik
    Guyon, Isabelle
    Pavao, Adrien
    Bennett, Kristin P.
    NEUROCOMPUTING, 2020, 416 : 244 - 255
  • [7] Survey on Synthetic Data Generation, Evaluation Methods and GANs
    Figueira, Alvaro
    Vaz, Bruno
    MATHEMATICS, 2022, 10 (15)
  • [8] Empirical Evaluation on Synthetic Data Generation with Generative Adversarial Network
    Lu, Pei-Hsuan
    Wang, Pang-Chieh
    Yu, Chia-Mu
    PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, MINING AND SEMANTICS (WIMS 2019), 2019,
  • [9] Evaluation of synthetic data generation for intelligent climate control in greenhouses
    Morales-Garcia, Juan
    Bueno-Crespo, Andres
    Terroso-Saenz, Fernando
    Arcas-Tunez, Francisco
    Martinez-Espana, Raquel
    Cecilia, Jose M.
    APPLIED INTELLIGENCE, 2023, 53 (21) : 24765 - 24781
  • [10] Evaluation of synthetic data generation for intelligent climate control in greenhouses
    Juan Morales-García
    Andrés Bueno-Crespo
    Fernando Terroso-Sáenz
    Francisco Arcas-Túnez
    Raquel Martínez-España
    José M. Cecilia
    Applied Intelligence, 2023, 53 : 24765 - 24781