A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets

被引:8
|
作者
Bernardini, Michele [1 ]
Doinychko, Anastasiia [4 ]
Romeo, Luca [2 ]
Frontoni, Emanuele [3 ]
Amini, Massih-Reza [4 ]
机构
[1] Univ Politecn Marche, Dept Informat Engn DII, Ancona, Italy
[2] Univ Macerata, Dept Econ & Law, Macerata, Italy
[3] Univ Macerata, Dept Polit Sci Commun & Int Relat, Macerata, Italy
[4] Univ Grenoble Alpes, Grenoble Informat Lab, St Martin Dheres, France
关键词
Data imputation; Generative Adversarial Network; Electronic Health Record; Machine Learning; Predictive medicine; TIME-SERIES; MODEL;
D O I
10.1016/j.compbiomed.2023.107188
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The missing data mechanism is a relevant problem in Machine Learning (ML) and biomedical informatics communities. Real-world Electronic Health Record (EHR) datasets comprise several missing values, thus revealing a high level of spatiotemporal sparsity in the predictors' matrix. Several approaches in the state-of-the-art tried to deal with this problem by proposing different data imputation strategies that (i) are often unrelated to the ML model, (ii) are not conceived for EHR data where laboratory exams are not prescribed uniformly over time and percentage of missing values is high (iii) exploit only univariate and linear information on the observed features. Our paper proposes a data imputation strategy based on a clinical conditional Generative Adversarial Network (ccGAN) capable of imputing missing values by exploiting non-linear and multivariate information across patients. Unlike other GAN data imputation-based approaches, our method deals explicitly with the high level of missingness of routine EHR data by conditioning the imputing strategy to the observable values and those fully-annotated. We demonstrated the statistical significance of the ccGAN to other state-of-the-art approaches in terms of imputation (around 19.79% of gain to the best competitor) and predictive performance (up to 1.60% of gain to the best competitor) on a real multi-diabetic centers dataset. We also demonstrated its robustness across different missingness rates (up to 1.61% of gain to the best competitor in the highest missingness rates condition) on an additional benchmark EHR dataset.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Generative adversarial networks for imputing missing data for big data clinical research
    Weinan Dong
    Daniel Yee Tak Fong
    Jin-sun Yoon
    Eric Yuk Fai Wan
    Laura Elizabeth Bedford
    Eric Ho Man Tang
    Cindy Lo Kuen Lam
    BMC Medical Research Methodology, 21
  • [32] Generative adversarial networks for imputing missing data for big data clinical research
    Dong, Weinan
    Fong, Daniel Yee Tak
    Yoon, Jin-sun
    Wan, Eric Yuk Fai
    Bedford, Laura Elizabeth
    Tang, Eric Ho Man
    Lam, Cindy Lo Kuen
    BMC MEDICAL RESEARCH METHODOLOGY, 2021, 21 (01)
  • [33] Missing Data Imputation Method Combining Random Forest and Generative Adversarial Imputation Network
    Ou, Hongsen
    Yao, Yunan
    He, Yi
    SENSORS, 2024, 24 (04)
  • [34] Conditional Generative Adversarial Networks with Adversarial Attack and Defense for Generative Data Augmentation
    Baek, Francis
    Kim, Daeho
    Park, Somin
    Kim, Hyoungkwan
    Lee, SangHyun
    JOURNAL OF COMPUTING IN CIVIL ENGINEERING, 2022, 36 (03)
  • [35] Missing Slice Imputation in Population CMR Imaging via Conditional Generative Adversarial Nets
    Zhang, Le
    Pereanez, Marco
    Bowles, Christopher
    Piechnik, Stefan
    Neubauer, Stefan
    Petersen, Steffen
    Frangi, Alejandro
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT II, 2019, 11765 : 651 - 659
  • [36] Self-supervised generative adversarial learning with conditional cyclical constraints towards missing traffic data imputation
    Li, Jinlong
    Li, Ruonan
    Xu, Lunhui
    Liu, Jie
    KNOWLEDGE-BASED SYSTEMS, 2024, 284
  • [37] Conditional Wasserstein Generative Adversarial Networks for Rebalancing Iris Image Datasets
    Li, Yung-Hui
    Aslam, Muhammad Saqlain
    Harfiya, Latifa Nabila
    Chang, Ching-Chun
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (09) : 1450 - 1458
  • [38] Missing Data Imputation for Real Time-series Data in a Steel Industry using Generative Adversarial Networks
    Sarda, Kisan
    Yerudkar, Amol
    Del Vecchio, Carmen
    IECON 2021 - 47TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2021,
  • [39] LiDAR Data Classification Based on Improved Conditional Generative Adversarial Networks
    Wang, Aili
    Xue, Dong
    Wu, Haibin
    Iwahori, Yuji
    IEEE ACCESS, 2020, 8 : 209674 - 209686
  • [40] Demand Side Data Generating Based on Conditional Generative Adversarial Networks
    Lan, Jian
    Guo, Qinglai
    Sun, Hongbin
    CLEANER ENERGY FOR CLEANER CITIES, 2018, 152 : 1188 - 1193