Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data

被引:18
|
作者
Getz, Kylie [1 ]
Hubbard, Rebecca A. [2 ,3 ]
Linn, Kristin A. [2 ,4 ]
机构
[1] Rutgers State Univ, Sch Publ Hlth, Dept Biostat & Epidemiol, Piscataway, NJ USA
[2] Univ Penn, Perelman Sch Med, Dept Biostat, Epidemiol & Informat, Philadelphia, PA USA
[3] Univ Penn, Abramson Canc Ctr, Philadelphia, PA USA
[4] 423 Guardian Dr, Philadelphia, PA 19104 USA
基金
美国国家卫生研究院;
关键词
Chained equations; Denoising autoencoders; Electronic health records; Missing data; Multiple imputation; Plasmode simulation; Random forests; MISSING DATA; RANDOM FOREST; MICE;
D O I
10.1097/EDE.0000000000001578
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Background:Missing data are common in studies using electronic health records (EHRs)-derived data. Missingness in EHR data is related to healthcare utilization patterns, resulting in complex and potentially missing not at random missingness mechanisms. Prior research has suggested that machine learning-based multiple imputation methods may outperform traditional methods and may perform well even in settings of missing not at random missingness. Methods:We used plasmode simulations based on a nationwide EHR-derived de-identified database for patients with metastatic urothelial carcinoma to compare the performance of multiple imputation using chained equations, random forests, and denoising autoencoders in terms of bias and precision of hazard ratio estimates under varying proportions of observations with missing values and missingness mechanisms (missing completely at random, missing at random, and missing not at random). Results:Multiple imputation by chained equations and random forest methods had low bias and similar standard errors for parameter estimates under missingness completely at random. Under missingness at random, denoising autoencoders had higher bias than multiple imputation by chained equations and random forests. Contrary to results of prior studies of denoising autoencoders, all methods exhibited substantial bias under missingness not at random, with bias increasing in direct proportion to the amount of missing data. Conclusions:We found no advantage of denoising autoencoders for multiple imputation in the setting of an epidemiologic study conducted using EHR data. Results suggested that denoising autoencoders may overfit the data leading to poor confounder control. Use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.
引用
收藏
页码:206 / 215
页数:10
相关论文
共 50 条
  • [1] Multiple Imputation of Missing Data in Longitudinal Electronic Health Records
    Petersen, Irene
    Welch, Catherine
    Bartlett, Jonathan
    Morris, Richard
    Walters, Kate
    Nazareth, Irwin
    Marston, Louise
    White, Ian
    Carpenter, James
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2013, 22 : 302 - 302
  • [2] Missing Value Imputation Methods for Electronic Health Records
    Psychogyios, Konstantinos
    Ilias, Loukas
    Ntanos, Christos
    Askounis, Dimitris
    IEEE ACCESS, 2023, 11 : 21562 - 21574
  • [3] Reporting of demographic data and representativeness in machine learning models using electronic health records
    Bozkurt, Selen
    Cahan, Eli M.
    Seneviratne, Martin G.
    Sun, Ran
    Lossio-Ventura, Juan A.
    Ioannidis, John P. A.
    Hernandez-Boussard, Tina
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2020, 27 (12) : 1878 - 1884
  • [4] Delirium Prediction using Machine Learning Models on Preoperative Electronic Health Records Data
    Davoudi, Anis
    Ebadi, Ashkan
    Rashidi, Parisa
    Ozrazgat-Baslanti, Tazcan
    Bihorac, Azra
    Bursian, Alberto C.
    2017 IEEE 17TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2017, : 568 - 573
  • [5] Subphenotyping depression using machine learning and electronic health records
    Xu, Zhenxing
    Wang, Fei
    Adekkanattu, Prakash
    Bose, Budhaditya
    Vekaria, Veer
    Brandt, Pascal
    Jiang, Guoqian
    Kiefer, Richard C.
    Luo, Yuan
    Pacheco, Jennifer A.
    Rasmussen, Luke V.
    Xu, Jie
    Alexopoulos, George
    Pathak, Jyotishman
    LEARNING HEALTH SYSTEMS, 2020, 4 (04):
  • [6] Data Analytics and Machine Learning for Disease Identification in Electronic Health Records
    Benke, Kurt K.
    JAMA OPHTHALMOLOGY, 2019, 137 (05) : 497 - 498
  • [7] Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review
    Nickson, David
    Meyer, Caroline
    Walasek, Lukasz
    Toro, Carla
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2023, 23 (01)
  • [8] Prediction and diagnosis of depression using machine learning with electronic health records data: a systematic review
    David Nickson
    Caroline Meyer
    Lukasz Walasek
    Carla Toro
    BMC Medical Informatics and Decision Making, 23
  • [9] PREDICTORS OF DISEASE MODIFYING THERAPY INITIATION IN PATIENTS WITH MULTIPLE SCLEROSIS USING ELECTRONIC HEALTH RECORDS DATA - A MACHINE LEARNING PERSPECTIVE
    Icten, Z.
    Hitchcock, C.
    Davis, S.
    Ciofani, D.
    Sanky, M.
    Hadzi, T.
    Khalil, I
    Alas, V
    VALUE IN HEALTH, 2017, 20 (05) : A1 - A2
  • [10] Ensemble machine learning methods in screening electronic health records: A scoping review
    Stevens, Christophe A. T.
    Lyons, Alexander R. M.
    Dharmayat, Kanika, I
    Mahani, Alireza
    Ray, Kausik K.
    Vallejo-Vaz, Antonio J.
    Sharabiani, Mansour T. A.
    DIGITAL HEALTH, 2023, 9