Generative adversarial networks for imputing missing data for big data clinical research

被引:42
|
作者
Dong, Weinan [1 ]
Fong, Daniel Yee Tak [2 ]
Yoon, Jin-sun [3 ]
Wan, Eric Yuk Fai [1 ,4 ]
Bedford, Laura Elizabeth [1 ]
Tang, Eric Ho Man [1 ]
Lam, Cindy Lo Kuen [1 ]
机构
[1] Univ Hong Kong, Fac Med, Dept Family Med & Primary Care, Hong Kong, Peoples R China
[2] Univ Hong Kong, Fac Med, Sch Nursing, Hong Kong, Peoples R China
[3] Univ Calif Los Angeles, Elect & Comp Engn Dept, Los Angeles, CA USA
[4] Univ Hong Kong, Fac Med, Dept Pharmacol & Pharm, Hong Kong, Peoples R China
关键词
Generative adversarial network; Missing data imputation; Machine learning; Clinical research; Big data; MULTIPLE IMPUTATION;
D O I
10.1186/s12874-021-01272-3
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Missing data is a pervasive problem in clinical research. Generative adversarial imputation nets (GAIN), a novel machine learning data imputation approach, has the potential to substitute missing data accurately and efficiently but has not yet been evaluated in empirical big clinical datasets. Objectives: This study aimed to evaluate the accuracy of GAIN in imputing missing values in large real-world clinical datasets with mixed-type variables. The computation efficiency of GAIN was also evaluated. The performance of GAIN was compared with other commonly used methods, MICE and missForest. Methods: Two real world clinical datasets were used. The first was that of a cohort study on the long-term outcomes of patients with diabetes (50,000 complete cases), and the second was of a cohort study on the effectiveness of a risk assessment and management programme for patients with hypertension (10,000 complete cases). Missing data (missing at random) to independent variables were simulated at different missingness rates (20, 50%). The normalized root mean square error (NRMSE) between imputed values and real values for continuous variables and the proportion of falsely classified (PFC) for categorical variables were used to measure imputation accuracy. Computation time per imputation for each method was recorded. The differences in accuracy of different imputation methods were compared using ANOVA or non-parametric test. Results: Both missForest and GAIN were more accurate than MICE. GAIN showed similar accuracy as missForest when the simulated missingness rate was 20%, but was more accurate when the simulated missingness rate was 50%. GAIN was the most accurate for the imputation of skewed continuous and imbalanced categorical variables at both missingness rates. GAIN had a much higher computation speed (32 min on PC) comparing to that of missForest (1300 min) when the sample size is 50,000. Conclusion: GAIN showed better accuracy as an imputation method for missing data in large real-world clinical datasets compared to MICE and missForest, and was more resistant to high missingness rate (50%). The high computation speed is an added advantage of GAIN in big clinical data research. It holds potential as an accurate and efficient method for missing data imputation in future big data clinical research.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Imputing Structured Missing Values in Spatial Data with Clustered Adversarial Matrix Factorization
    Wang, Qi
    Tan, Pang-Ning
    Zhou, Jiayu
    2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 1284 - 1289
  • [22] Missing Data Imputation for Real Time-series Data in a Steel Industry using Generative Adversarial Networks
    Sarda, Kisan
    Yerudkar, Amol
    Del Vecchio, Carmen
    IECON 2021 - 47TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2021,
  • [23] Conditional Generative Adversarial Networks with Adversarial Attack and Defense for Generative Data Augmentation
    Baek, Francis
    Kim, Daeho
    Park, Somin
    Kim, Hyoungkwan
    Lee, SangHyun
    JOURNAL OF COMPUTING IN CIVIL ENGINEERING, 2022, 36 (03)
  • [24] Generative Adversarial Networks for Bitcoin Data Augmentation
    Zola, Francesco
    Lukas Bruse, Jan
    Etxeberria Barrio, Xabier
    Galar, Mikel
    Orduna Urrutia, Raul
    2020 2ND CONFERENCE ON BLOCKCHAIN RESEARCH & APPLICATIONS FOR INNOVATIVE NETWORKS AND SERVICES (BRAINS), 2020, : 136 - 143
  • [25] Augmenting data with generative adversarial networks: An overview
    Ljubic, Hrvoje
    Martinovic, Goran
    Volaric, Tomislav
    INTELLIGENT DATA ANALYSIS, 2022, 26 (02) : 361 - 378
  • [26] Training Generative Adversarial Networks with Limited Data
    Karras, Tero
    Aittala, Miika
    Hellsten, Janne
    Laine, Samuli
    Lehtinen, Jaakko
    Aila, Timo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [27] Data Synthesis based on Generative Adversarial Networks
    Park, Noseong
    Mohammadi, Mahmoud
    Gorde, Kshitij
    Jajodia, Sushil
    Park, Hongkyu
    Kim, Youngmin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (10): : 1071 - 1083
  • [28] Data Augmentation with Improved Generative Adversarial Networks
    Shi, Hongjiang
    Wang, Lu
    Ding, Guangtai
    Yang, Fenglei
    Li, Xiaoqiang
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 73 - 78
  • [29] Data Augmentation Powered by Generative Adversarial Networks
    Poka, Karoly Bence
    Szemenyei, Marton
    2020 23RD IEEE INTERNATIONAL SYMPOSIUM ON MEASUREMENT AND CONTROL IN ROBOTICS (ISMCR), 2020,
  • [30] GAGIN: generative adversarial guider imputation network for missing data
    Wei Wang
    Yimeng Chai
    Yue Li
    Neural Computing and Applications, 2022, 34 : 7597 - 7610