Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis

被引:0
|
作者
Tiwaskar S. [1 ]
Rashid M. [1 ]
Gokhale P. [1 ]
机构
[1] Department of Computer Engineering, Faculty of Science and Technology, Vishwakarma University, Pune
关键词
Diabetes prediction; Machine learning-based imputation techniques; Missing data;
D O I
10.1007/s11042-024-19103-0
中图分类号
学科分类号
摘要
In the realm of medical datasets, particularly when considering diabetes, the occurrence of data incompleteness is a prevalent issue. Unveiling valuable patterns through medical data analysis is crucial for early and precise medical predictions. However, the quality of data and the proper handling of missing data hold significant significance. To address this challenge, imputation stands as a robust approach. The main goal of this paper aims to provide a comprehensive investigation into the effects brought about by Machine Learning (ML) based imputation techniques, specifically K Nearest Neighbor Imputation (KNNI), Multiple Imputation by Chained Equations (MICE), and MissForest. Results of all three techniques are compared with the complete dataset for five missing rates (10% to 50%), and evaluated using four categories of evaluation criteria i.e. (1) model performance, (2) imputation error rate (Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Coefficient of Determination (R^2) values), (3) Pearson correlation analysis and, (4) model selection basis (Bayesian information criterion (BIC), Akaike information criterion (AIC), values). Model performance includes accuracy, precision, recall, F1 score, and Matthews Correlation Coefficient (Mcoff) score of four ML classifiers viz. (a) Random Forest (RF), (b) Support vector machine (SVM), (c) AdaBoost, (d) XGBoost (XGB). For all missing rate cases, the MissForest technique is better than the KNNI and MICE in accuracy and Mcoff in 80% of cases, precision in 40% of cases, recall in 60% of cases, F1 score, MAE, RMSE, R^2 in 100% of cases, AIC in 80% of cases, and BIC values in 100% of cases. Also, the correlation analysis confirms that the MissForest imputation preserves association between the variables, like the complete dataset. Overall, our findings suggest that MissForest is a better machine learning-based imputation technique for handling missing data in diabetes research. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:5905 / 5925
页数:20
相关论文
共 50 条
  • [1] A systematic review of machine learning-based missing value imputation techniques
    Thomas, Tressy
    Rajabi, Enayat
    DATA TECHNOLOGIES AND APPLICATIONS, 2021, 55 (04) : 558 - 585
  • [2] Machine Learning-based Techniques for Incremental Functional Diagnosis: a Comparative Analysis
    Bolchini, Cristiana
    Cassano, Luca
    PROCEEDINGS OF THE 2014 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFTS), 2014, : 246 - 251
  • [3] The impact of imputation quality on machine learning classifiers for datasets with missing values
    Tolou Shadbahr
    Michael Roberts
    Jan Stanczuk
    Julian Gilbey
    Philip Teare
    Sören Dittmer
    Matthew Thorpe
    Ramon Viñas Torné
    Evis Sala
    Pietro Lió
    Mishal Patel
    Jacobus Preller
    James H. F. Rudd
    Tuomas Mirtti
    Antti Sakari Rannikko
    John A. D. Aston
    Jing Tang
    Carola-Bibiane Schönlieb
    Communications Medicine, 3
  • [4] Machine Learning Based Missing Data Imputation in Categorical Datasets
    Ishaq, Muhammad
    Zahir, Sana
    Iftikhar, Laila
    Bulbul, Mohammad Farhad
    Rho, Seungmin
    Lee, Mi Young
    IEEE ACCESS, 2024, 12 : 88332 - 88344
  • [5] The impact of imputation quality on machine learning classifiers for datasets with missing values
    Shadbahr, Tolou
    Roberts, Michael
    Stanczuk, Jan
    Gilbey, Julian
    Teare, Philip
    Dittmer, Soeren
    Thorpe, Matthew
    Torne, Ramon Vinas
    Sala, Evis
    Lio, Pietro
    Patel, Mishal
    Preller, Jacobus
    Rudd, James H. F.
    Mirtti, Tuomas
    Rannikko, Antti Sakari
    Aston, John A. D.
    Tang, Jing
    Schonlieb, Carola-Bibiane
    COMMUNICATIONS MEDICINE, 2023, 3 (01):
  • [6] Machine Learning-based Classification of Online Industrial Datasets
    Faber, Rastislav
    L'ubusky, Karol
    Paulen, Radoslav
    2023 24TH INTERNATIONAL CONFERENCE ON PROCESS CONTROL, PC, 2023, : 132 - 137
  • [7] CondiS web app: imputation of censored lifetimes for machine learning-based survival analysis
    Wang, Yizhuo
    Flowers, Christopher R.
    Li, Ziyi
    Huang, Xuelin
    BIOINFORMATICS, 2022, 38 (17) : 4252 - 4254
  • [8] Machine Learning-Based A Comparative Analysis for Air Quality Prediction
    Utku, Anil
    Can, Umit
    2022 30TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2022,
  • [9] Review of Federated Learning and Machine Learning-Based Methods for Medical Image Analysis
    Hernandez-Cruz, Netzahualcoyotl
    Saha, Pramit
    Sarker, Md Mostafa Kamal
    Noble, J. Alison
    BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (09)
  • [10] The impact of multinationality on firm value: A comparative analysis of machine learning techniques
    Kuzey, Cemil
    Uyar, Ali
    Delen, Dursun
    DECISION SUPPORT SYSTEMS, 2014, 59 : 127 - 142