A method for generating synthetic longitudinal health data

被引:13
|
作者
Mosquera, Lucy [1 ,2 ]
El Emam, Khaled [1 ,2 ,3 ]
Ding, Lei [4 ]
Sharma, Vishal [5 ]
Zhang, Xue Hua [1 ]
El Kababji, Samer [2 ]
Carvalho, Chris [6 ]
Hamilton, Brian [7 ]
Palfrey, Dan [8 ]
Kong, Linglong [4 ]
Jiang, Bei [4 ]
Eurich, Dean T. [5 ]
机构
[1] Replica Analyt Ltd, Ottawa, ON, Canada
[2] Childrens Hosp Eastern Ontario Res Inst, 401 Smyth Rd, Ottawa, ON K1J 8L1, Canada
[3] Univ Ottawa, Sch Epidemiol & Publ Hlth, Ottawa, ON, Canada
[4] Univ Alberta, Dept Math & Stat Sci, Edmonton, AB, Canada
[5] Univ Alberta, Sch Publ Hlth, Edmonton, AB, Canada
[6] Hlth Cities, Edmonton, AB, Canada
[7] BW Hamilton Consulting Inc, Edmonton, AB, Canada
[8] Inst Hlth Econ, Edmonton, AB, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Synthetic data; Administrative health data; Data privacy; Data sharing; UTILITY;
D O I
10.1186/s12874-023-01869-w
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] A method for generating synthetic longitudinal health data
    Lucy Mosquera
    Khaled El Emam
    Lei Ding
    Vishal Sharma
    Xue Hua Zhang
    Samer El Kababji
    Chris Carvalho
    Brian Hamilton
    Dan Palfrey
    Linglong Kong
    Bei Jiang
    Dean T. Eurich
    BMC Medical Research Methodology, 23
  • [2] Generating synthetic data
    Ayilara, Olawale F.
    Platt, Robert W.
    Dahl, Matt
    Coulombe, Janie
    Ginestet, Pablo Gonzalez
    Chateau, Dan
    Lix, Lisa M.
    INTERNATIONAL JOURNAL OF POPULATION DATA SCIENCE (IJPDS), 2023, 8 (01):
  • [3] Medical calculators derived synthetic cohorts: a novel method for generating synthetic patient data
    Jeanson, Francis
    Farkouh, Michael E.
    Godoy, Lucas C.
    Minha, Sa'ar
    Tzuman, Oran
    Marcus, Gil
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [4] A Tool for Generating Synthetic Data
    Peng, Taoxin
    Telle, Alexander
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON DATA SCIENCE, E-LEARNING AND INFORMATION SYSTEMS 2018 (DATA'18), 2018,
  • [5] ASIDS: A Robust Data Synthesis Method for Generating Optimal Synthetic Samples
    Du, Yukun
    Cai, Yitao
    Jin, Xiao
    Wang, Hongxia
    Li, Yao
    Lu, Min
    MATHEMATICS, 2023, 11 (18)
  • [6] Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications
    Jin Li
    Benjamin J. Cairns
    Jingsong Li
    Tingting Zhu
    npj Digital Medicine, 6
  • [7] Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications
    Li, Jin
    Cairns, Benjamin J.
    Li, Jingsong
    Zhu, Tingting
    NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [8] EXPERIMENTAL EVALUATION OF A MACHINE-LEARNING METHOD FOR GENERATING SYNTHETIC PATIENT DATA FOR APPLICATIONS IN HEALTH ECONOMICS AND OUTCOMES RESEARCH
    Chebuniaev, I
    Aballea, S.
    Toumi, M.
    VALUE IN HEALTH, 2024, 27 (12)
  • [9] PCPs and the Hardness of Generating Synthetic Data
    Jonathan Ullman
    Salil Vadhan
    Journal of Cryptology, 2020, 33 : 2078 - 2112
  • [10] Synner: Generating Realistic Synthetic Data
    Mannino, Miro
    Abouzied, Azza
    SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 2749 - 2752