Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data

被引:18
|
作者
Getz, Kylie [1 ]
Hubbard, Rebecca A. [2 ,3 ]
Linn, Kristin A. [2 ,4 ]
机构
[1] Rutgers State Univ, Sch Publ Hlth, Dept Biostat & Epidemiol, Piscataway, NJ USA
[2] Univ Penn, Perelman Sch Med, Dept Biostat, Epidemiol & Informat, Philadelphia, PA USA
[3] Univ Penn, Abramson Canc Ctr, Philadelphia, PA USA
[4] 423 Guardian Dr, Philadelphia, PA 19104 USA
基金
美国国家卫生研究院;
关键词
Chained equations; Denoising autoencoders; Electronic health records; Missing data; Multiple imputation; Plasmode simulation; Random forests; MISSING DATA; RANDOM FOREST; MICE;
D O I
10.1097/EDE.0000000000001578
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Background:Missing data are common in studies using electronic health records (EHRs)-derived data. Missingness in EHR data is related to healthcare utilization patterns, resulting in complex and potentially missing not at random missingness mechanisms. Prior research has suggested that machine learning-based multiple imputation methods may outperform traditional methods and may perform well even in settings of missing not at random missingness. Methods:We used plasmode simulations based on a nationwide EHR-derived de-identified database for patients with metastatic urothelial carcinoma to compare the performance of multiple imputation using chained equations, random forests, and denoising autoencoders in terms of bias and precision of hazard ratio estimates under varying proportions of observations with missing values and missingness mechanisms (missing completely at random, missing at random, and missing not at random). Results:Multiple imputation by chained equations and random forest methods had low bias and similar standard errors for parameter estimates under missingness completely at random. Under missingness at random, denoising autoencoders had higher bias than multiple imputation by chained equations and random forests. Contrary to results of prior studies of denoising autoencoders, all methods exhibited substantial bias under missingness not at random, with bias increasing in direct proportion to the amount of missing data. Conclusions:We found no advantage of denoising autoencoders for multiple imputation in the setting of an epidemiologic study conducted using EHR data. Results suggested that denoising autoencoders may overfit the data leading to poor confounder control. Use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.
引用
收藏
页码:206 / 215
页数:10
相关论文
共 50 条
  • [31] Predicting the Risk of Inpatient Hypoglycemia With Machine Learning Using Electronic Health Records
    Ruan, Yue
    Bellot, Alexis
    Moysova, Zuzana
    Tan, Garry D.
    Lumb, Alistair
    Davies, Jim
    van der Schaar, Mihaela
    Rea, Rustam
    DIABETES CARE, 2020, 43 (07) : 1504 - 1511
  • [32] Using Electronic Health Records and Machine Learning to Predict Incident Psychiatric Hospitalization
    DeFerio, Joseph
    Banerjee, Samprit
    Alexopoulos, George
    Pathak, Jyotishman
    BIOLOGICAL PSYCHIATRY, 2020, 87 (09) : S68 - S69
  • [33] Data imputation on IoT gateways using machine learning
    Franca, Cinthya M.
    Couto, Rodrigo S.
    Velloso, Pedro B.
    2021 19TH MEDITERRANEAN COMMUNICATION AND COMPUTER NETWORKING CONFERENCE (MEDCOMNET), 2021,
  • [34] USING PROPENSITY MATCHING AND IMPUTATION METHODS TO INTEGRATE PATIENT-REPORTED SURVEY DATA WITH ELECTRONIC HEALTH RECORDS IN TYPE 2 DIABETES
    Lee, L. K.
    Liebert, R.
    Gupta, S.
    NM, Flores
    Haskell, T.
    VALUE IN HEALTH, 2017, 20 (05) : A322 - A322
  • [35] Interpatient Similarity-based Imputation of Missing Data in Electronic Health Records
    Jazayeri, Ali
    Liang, Ou Stella
    Yang, Christopher C.
    2019 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI), 2019, : 547 - 549
  • [36] Development and validation of models for detection of postoperative infections using structured electronic health records data and machine learning
    Colborn, Kathryn L.
    Zhuang, Yaxu
    Dyas, Adam R.
    Henderson, William G.
    Madsen, Helen J.
    Bronsert, Michael R.
    Matheny, Michael E.
    Lambert-Kerzner, Anne
    Myers, Quintin W. O.
    Meguid, Robert A.
    SURGERY, 2023, 173 (02) : 464 - 471
  • [37] A machine-learning prediction model to identify risk of firearm injury using electronic health records data
    Zhou, Hui
    Nau, Claudia
    Xie, Fagen
    Contreras, Richard
    Grant, Deborah Ling
    Negriff, Sonya
    Sidell, Margo
    Koebnick, Corinna
    Hechter, Rulin
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (10) : 2173 - 2180
  • [38] DETECTING INCIDENTS OF INJECTION FROM ELECTRONIC MEDICAL RECORDS USING MACHINE LEARNING METHODS
    Okamoto, K.
    Goka, K.
    Hirose, M.
    Yamamoto, T.
    Hiragi, S.
    Yamamoto, G.
    Sugiyama, O.
    Nambu, M.
    Kuroda, T.
    VALUE IN HEALTH, 2018, 21 : S372 - S372
  • [39] Text Classification Model in Chinese Electronic Medical Records Using Machine Learning Methods
    Zhang, Ping
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2020, 127 : 123 - 123
  • [40] Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data
    Li, Lin
    Lee, Chuang-Chung
    Zhou, Fang Liz
    Molony, Cliona
    Doder, Zoran
    Zalmover, Evgeny
    Sharma, Kristen
    Juhaeri, Juhaeri
    Wu, Chuntao
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2021, 30 (05) : 610 - 618