Generation and evaluation of synthetic patient data

被引:176
|
作者
Goncalves, Andre [1 ]
Ray, Priyadip [1 ]
Soper, Braden [1 ]
Stevens, Jennifer [2 ]
Coyle, Linda [2 ]
Sales, Ana Paula [1 ]
机构
[1] Lawrence Livermore Natl Lab, 7000 East Ave, Livermore, CA 94550 USA
[2] Informat Management Syst, 1455 Res Blvd,Suite 315, Rockville, MD USA
基金
美国国家卫生研究院;
关键词
Synthetic data generation; Cancer patient data; Information disclosure; Generative models; PRIVACY; RISK; TEXT;
D O I
10.1186/s12874-020-00977-1
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundMachine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges.MethodsIn this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed.ResultsWhile the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases.ConclusionsWe discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
引用
收藏
页数:40
相关论文
共 50 条
  • [41] Generation of synthetic data for tropical cyclones
    Abraham, R
    Mohanty, UC
    Dash, SK
    12TH INTERNATIONAL CONFERENCE ON INTERACTIVE INFORMATION AND PROCESSING SYSTEMS (IIPS) FOR METEOROLOGY, OCEANOGRAPHY, AND HYDROLOGY: JOINT SESSION WITH FIFTH SYMPOSIUM ON EDUCATION, 1996, : 479 - 479
  • [42] Declarative generation of synthetic XML data
    Barbosa, Denilson
    Mendelzon, Alberto O.
    SOFTWARE-PRACTICE & EXPERIENCE, 2006, 36 (10): : 1051 - 1079
  • [43] THE GENERATION OF SYNTHETIC CLINICAL TRIAL DATA
    Mosquera, L.
    VALUE IN HEALTH, 2019, 22 : S519 - S519
  • [44] Status of Synthetic Data Generation for Structured Health Data
    El Emam, Khaled
    JCO CLINICAL CANCER INFORMATICS, 2023, 7
  • [45] Status of Synthetic Data Generation for Structured Health Data
    El Emam, Khaled
    JCO CLINICAL CANCER INFORMATICS, 2023, 7
  • [46] Evaluation of a Fintech Sales Synthetic Data Generation Model Using a Generative Adversarial Network
    Lopez, Felipe A.
    Duran-Riveros, Marcia
    Maldonado-Duran, Sebastian
    Ruete, David
    Costa, Giannina
    Coronado-Hernandez, Jairo R.
    Gatica, Gustavo
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS-ICCSA 2024 WORKSHOPS, PT VI, 2024, 14820 : 56 - 70
  • [47] A Software Framework for Synthetic Aeronautical Data Traffic Generation in Support of LDACS Evaluation Activities
    Jansen, Leonardus J. A.
    Graeupl, Thomas
    Maeurer, Nils
    Morioka, Kazuyuki
    Schmitt, Corinna
    2023 INTEGRATED COMMUNICATION, NAVIGATION AND SURVEILLANCE CONFERENCE, ICNS, 2023,
  • [48] An Image Feature Mapping Model for Continuous Longitudinal Data Completion and Generation of Synthetic Patient Trajectories
    Chadebec, Clement
    Huijben, Evi M. C.
    Pluim, Josien P. W.
    Allassonniere, Stephanie
    van Eijnatten, Maureen A. J. M.
    DEEP GENERATIVE MODELS, DGM4MICCAI 2022, 2022, 13609 : 55 - 64
  • [49] Privacy Assessment of Synthetic Patient Data
    Nezhad, Ferdoos Hossein
    Rotalinti, Ylenia
    Myles, Puja
    Tucker, Allan
    2023 IEEE 36TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, CBMS, 2023, : 1 - 6
  • [50] Creating synthetic patient data to support the design and evaluation of novel health information technology
    Pollack, Ari H.
    Simon, Tamara D.
    Snyder, Jaime
    Pratt, Wanda
    JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 95