Disclosure control using partially synthetic data for large-scale health surveys, with applications to CanCORS

被引:15
|
作者
Loong, Bronwyn [1 ,4 ]
Zaslavsky, Alan M. [2 ]
He, Yulei [2 ]
Harrington, David P. [3 ]
机构
[1] Australian Natl Univ, Res Sch Finance Actuarial Studies & Appl Stat, Canberra, ACT 0200, Australia
[2] Harvard Univ, Sch Med, Dept Hlth Care Policy, Boston, MA 02115 USA
[3] Dana Farber Canc Inst, Dept Biostat & Computat Biol, Boston, MA 02215 USA
[4] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
关键词
data confidentiality; data utility; disclosure risk; multiple imputation; synthetic data; MULTIPLE-IMPUTATION; LIKELIHOOD; SELECTION; TESTS;
D O I
10.1002/sim.5841
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents' identities and sensitive attributes by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by the Cancer Care Outcomes Research and Surveillance (CanCORS) project, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the USA. We review inferential methods for partially synthetic data and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. Copyright (c) 2013 John Wiley & Sons, Ltd.
引用
收藏
页码:4139 / 4161
页数:23
相关论文
共 50 条
  • [1] The use of online surveys in capturing large-scale data
    Stenton, J
    Pascoe, J
    EDUCATING: WEAVING RESEARCH INTO PRACTICE, VOL 3, 2004, : 148 - 157
  • [2] Large-Scale Generation and Validation of Synthetic PMU Data
    Idehen, Ikponmwosa
    Jang, Wonhyeok
    Overbye, Thomas J.
    IEEE TRANSACTIONS ON SMART GRID, 2020, 11 (05) : 4290 - 4298
  • [3] SparkXS: Efficient Access Control for Intelligent and Large-Scale Streaming Data Applications
    Preuveneers, Davy
    Joosen, Wouter
    2015 INTERNATIONAL CONFERENCE ON INTELLIGENT ENVIRONMENTS IE 2015, 2015, : 96 - 103
  • [4] Stratifying risk using large-scale electronic health records data
    Perlis, R. Y.
    McCoy, T.
    Wiste, A.
    Ostacher, M.
    Castro, V.
    BIPOLAR DISORDERS, 2015, 17 : 12 - 12
  • [5] Perturbed Gibbs Samplers for Generating Large-Scale Privacy-Safe Synthetic Health Data
    Park, Yubin
    Ghosh, Joydeep
    Shankar, Mallikarjun
    2013 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2013), 2013, : 493 - 498
  • [6] Large-Scale Data Processing for Information Retrieval Applications
    Khandel, Pooya
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3489 - 3489
  • [7] Optimizing data stream processing for large-scale applications
    Cappellari, Paolo
    Roantree, Mark
    Chun, Soon Ae
    SOFTWARE-PRACTICE & EXPERIENCE, 2018, 48 (09): : 1607 - 1641
  • [8] Large-scale separation flow control on airfoil with synthetic jet
    Tang, Z. L.
    Sheng, J. D.
    Zhang, G. D.
    Periaux, J.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS, 2018, 32 (2-3) : 104 - 120
  • [9] Change Detection in Partially Observed Large-Scale Traffic Network Data
    Zhao, Meng
    Gahrooei, Mostafa Reisi
    Ilbeigi, Mohammad
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (11) : 18913 - 18924
  • [10] An error control scheme for large-scale multicast applications
    Papadopoulos, C
    Parulkar, G
    Varghese, C
    IEEE INFOCOM '98 - THE CONFERENCE ON COMPUTER COMMUNICATIONS, VOLS. 1-3: GATEWAY TO THE 21ST CENTURY, 1998, : 1188 - 1196