Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data

被引:18
|
作者
Getz, Kylie [1 ]
Hubbard, Rebecca A. [2 ,3 ]
Linn, Kristin A. [2 ,4 ]
机构
[1] Rutgers State Univ, Sch Publ Hlth, Dept Biostat & Epidemiol, Piscataway, NJ USA
[2] Univ Penn, Perelman Sch Med, Dept Biostat, Epidemiol & Informat, Philadelphia, PA USA
[3] Univ Penn, Abramson Canc Ctr, Philadelphia, PA USA
[4] 423 Guardian Dr, Philadelphia, PA 19104 USA
基金
美国国家卫生研究院;
关键词
Chained equations; Denoising autoencoders; Electronic health records; Missing data; Multiple imputation; Plasmode simulation; Random forests; MISSING DATA; RANDOM FOREST; MICE;
D O I
10.1097/EDE.0000000000001578
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Background:Missing data are common in studies using electronic health records (EHRs)-derived data. Missingness in EHR data is related to healthcare utilization patterns, resulting in complex and potentially missing not at random missingness mechanisms. Prior research has suggested that machine learning-based multiple imputation methods may outperform traditional methods and may perform well even in settings of missing not at random missingness. Methods:We used plasmode simulations based on a nationwide EHR-derived de-identified database for patients with metastatic urothelial carcinoma to compare the performance of multiple imputation using chained equations, random forests, and denoising autoencoders in terms of bias and precision of hazard ratio estimates under varying proportions of observations with missing values and missingness mechanisms (missing completely at random, missing at random, and missing not at random). Results:Multiple imputation by chained equations and random forest methods had low bias and similar standard errors for parameter estimates under missingness completely at random. Under missingness at random, denoising autoencoders had higher bias than multiple imputation by chained equations and random forests. Contrary to results of prior studies of denoising autoencoders, all methods exhibited substantial bias under missingness not at random, with bias increasing in direct proportion to the amount of missing data. Conclusions:We found no advantage of denoising autoencoders for multiple imputation in the setting of an epidemiologic study conducted using EHR data. Results suggested that denoising autoencoders may overfit the data leading to poor confounder control. Use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.
引用
收藏
页码:206 / 215
页数:10
相关论文
共 50 条
  • [21] A comparison of imputation methods using machine learning models
    Suh, Heajung
    Song, Jongwoo
    COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS, 2023, 30 (03) : 331 - 341
  • [22] Robustness of Multiple Imputation Methods for Missing Risk Factor Data from Electronic Medical Records for Observational Studies
    Sanjoy K. Paul
    Joanna Ling
    Mayukh Samanta
    Olga Montvida
    Journal of Healthcare Informatics Research, 2022, 6 : 385 - 400
  • [23] Robustness of Multiple Imputation Methods for Missing Risk Factor Data from Electronic Medical Records for Observational Studies
    Paul, Sanjoy K.
    Ling, Joanna
    Samanta, Mayukh
    Montvida, Olga
    JOURNAL OF HEALTHCARE INFORMATICS RESEARCH, 2022, 6 (04) : 385 - 400
  • [24] Imputation in well log data: A benchmark for machine learning methods
    Gama, Pedro H. T.
    Faria, Jackson
    Sena, Jessica
    Neves, Francisco
    Riffel, Vinicius R.
    Perez, Lucas
    Korenchendler, Andre
    Sobreira, Matheus C. A.
    Machado, Alexei M. C.
    COMPUTERS & GEOSCIENCES, 2025, 196
  • [25] BAYESIAN PROFILING MULTIPLE IMPUTATION FOR MISSING HEMOGLOBIN VALUES IN ELECTRONIC HEALTH RECORDS
    Si, Yajuan
    Palta, Mari
    Smith, Maureen
    ANNALS OF APPLIED STATISTICS, 2020, 14 (04): : 1903 - 1924
  • [26] Evaluation of machine learning methods for covariate data imputation in pharmacometrics
    Braem, Dominic Stefan
    Nahum, Uri
    Atkinson, Andrew
    Koch, Gilbert
    Pfister, Marc
    CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY, 2022, 11 (12): : 1638 - 1648
  • [27] An overview and evaluation of recent machine learning imputation methods using cardiac imaging data
    Liu Y.
    Gopalakrishnan V.
    Liu, Yuzhe (y.liu@pitt.edu), 1600, MDPI (02):
  • [28] Machine Learning and Electronic Health Records: A Paradigm Shift
    Adkins, Daniel E.
    AMERICAN JOURNAL OF PSYCHIATRY, 2017, 174 (02): : 93 - 94
  • [29] Modeling Pipe Break Data Using Survival Analysis with Machine Learning Imputation Methods
    Xu, Hao
    Sinha, Sunil K.
    JOURNAL OF PERFORMANCE OF CONSTRUCTED FACILITIES, 2021, 35 (05)
  • [30] Individualized melanoma risk prediction using machine learning with electronic health records
    Wan, G.
    Nguyen, N.
    Yan, B.
    Khattab, S.
    Estiri, H.
    Semenov, Y.
    JOURNAL OF INVESTIGATIVE DERMATOLOGY, 2024, 144 (08) : S35 - S35