Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis

被引：0

作者：

Tiwaskar S. ^{[1
]}

Rashid M. ^{[1
]}

Gokhale P. ^{[1
]}

机构：

[1] Department of Computer Engineering, Faculty of Science and Technology, Vishwakarma University, Pune

来源：

Multimedia Tools and Applications | 2025年 / 84卷 / 09期

关键词：

Diabetes prediction; Machine learning-based imputation techniques; Missing data;

D O I：

10.1007/s11042-024-19103-0

中图分类号：

学科分类号：

摘要：

In the realm of medical datasets, particularly when considering diabetes, the occurrence of data incompleteness is a prevalent issue. Unveiling valuable patterns through medical data analysis is crucial for early and precise medical predictions. However, the quality of data and the proper handling of missing data hold significant significance. To address this challenge, imputation stands as a robust approach. The main goal of this paper aims to provide a comprehensive investigation into the effects brought about by Machine Learning (ML) based imputation techniques, specifically K Nearest Neighbor Imputation (KNNI), Multiple Imputation by Chained Equations (MICE), and MissForest. Results of all three techniques are compared with the complete dataset for five missing rates (10% to 50%), and evaluated using four categories of evaluation criteria i.e. (1) model performance, (2) imputation error rate (Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Coefficient of Determination (R^2) values), (3) Pearson correlation analysis and, (4) model selection basis (Bayesian information criterion (BIC), Akaike information criterion (AIC), values). Model performance includes accuracy, precision, recall, F1 score, and Matthews Correlation Coefficient (Mcoff) score of four ML classifiers viz. (a) Random Forest (RF), (b) Support vector machine (SVM), (c) AdaBoost, (d) XGBoost (XGB). For all missing rate cases, the MissForest technique is better than the KNNI and MICE in accuracy and Mcoff in 80% of cases, precision in 40% of cases, recall in 60% of cases, F1 score, MAE, RMSE, R^2 in 100% of cases, AIC in 80% of cases, and BIC values in 100% of cases. Also, the correlation analysis confirms that the MissForest imputation preserves association between the variables, like the complete dataset. Overall, our findings suggest that MissForest is a better machine learning-based imputation technique for handling missing data in diabetes research. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

引用

页码：5905 / 5925

页数：20

共 50 条

[1] A systematic review of machine learning-based missing value imputation techniques
Thomas, Tressy
Rajabi, Enayat
DATA TECHNOLOGIES AND APPLICATIONS, 2021, 55 (04) : 558 - 585
[2] Machine Learning-based Techniques for Incremental Functional Diagnosis: a Comparative Analysis
Bolchini, Cristiana
Cassano, Luca
PROCEEDINGS OF THE 2014 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFTS), 2014, : 246 - 251
[3] The impact of imputation quality on machine learning classifiers for datasets with missing values
Tolou Shadbahr
Michael Roberts
Jan Stanczuk
Julian Gilbey
Philip Teare
Sören Dittmer
Matthew Thorpe
Ramon Viñas Torné
Evis Sala
Pietro Lió
Mishal Patel
Jacobus Preller
James H. F. Rudd
Tuomas Mirtti
Antti Sakari Rannikko
John A. D. Aston
Jing Tang
Carola-Bibiane Schönlieb
Communications Medicine, 3
[4] Machine Learning Based Missing Data Imputation in Categorical Datasets
Ishaq, Muhammad
Zahir, Sana
Iftikhar, Laila
Bulbul, Mohammad Farhad
Rho, Seungmin
Lee, Mi Young
IEEE ACCESS, 2024, 12 : 88332 - 88344
[5] The impact of imputation quality on machine learning classifiers for datasets with missing values
Shadbahr, Tolou
Roberts, Michael
Stanczuk, Jan
Gilbey, Julian
Teare, Philip
Dittmer, Soeren
Thorpe, Matthew
Torne, Ramon Vinas
Sala, Evis
Lio, Pietro
Patel, Mishal
Preller, Jacobus
Rudd, James H. F.
Mirtti, Tuomas
Rannikko, Antti Sakari
Aston, John A. D.
Tang, Jing
Schonlieb, Carola-Bibiane
COMMUNICATIONS MEDICINE, 2023, 3 (01):
[6] Machine Learning-based Classification of Online Industrial Datasets
Faber, Rastislav
L'ubusky, Karol
Paulen, Radoslav
2023 24TH INTERNATIONAL CONFERENCE ON PROCESS CONTROL, PC, 2023, : 132 - 137
[7] CondiS web app: imputation of censored lifetimes for machine learning-based survival analysis
Wang, Yizhuo
Flowers, Christopher R.
Li, Ziyi
Huang, Xuelin
BIOINFORMATICS, 2022, 38 (17) : 4252 - 4254
[8] Machine Learning-Based A Comparative Analysis for Air Quality Prediction
Utku, Anil
Can, Umit
2022 30TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2022,
[9] Review of Federated Learning and Machine Learning-Based Methods for Medical Image Analysis
Hernandez-Cruz, Netzahualcoyotl
Saha, Pramit
Sarker, Md Mostafa Kamal
Noble, J. Alison
BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (09)
[10] The impact of multinationality on firm value: A comparative analysis of machine learning techniques
Kuzey, Cemil
Uyar, Ali
Delen, Dursun
DECISION SUPPORT SYSTEMS, 2014, 59 : 127 - 142

← 1 2 3 4 5 →