Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis

被引:0
|
作者
Tiwaskar S. [1 ]
Rashid M. [1 ]
Gokhale P. [1 ]
机构
[1] Department of Computer Engineering, Faculty of Science and Technology, Vishwakarma University, Pune
关键词
Diabetes prediction; Machine learning-based imputation techniques; Missing data;
D O I
10.1007/s11042-024-19103-0
中图分类号
学科分类号
摘要
In the realm of medical datasets, particularly when considering diabetes, the occurrence of data incompleteness is a prevalent issue. Unveiling valuable patterns through medical data analysis is crucial for early and precise medical predictions. However, the quality of data and the proper handling of missing data hold significant significance. To address this challenge, imputation stands as a robust approach. The main goal of this paper aims to provide a comprehensive investigation into the effects brought about by Machine Learning (ML) based imputation techniques, specifically K Nearest Neighbor Imputation (KNNI), Multiple Imputation by Chained Equations (MICE), and MissForest. Results of all three techniques are compared with the complete dataset for five missing rates (10% to 50%), and evaluated using four categories of evaluation criteria i.e. (1) model performance, (2) imputation error rate (Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Coefficient of Determination (R^2) values), (3) Pearson correlation analysis and, (4) model selection basis (Bayesian information criterion (BIC), Akaike information criterion (AIC), values). Model performance includes accuracy, precision, recall, F1 score, and Matthews Correlation Coefficient (Mcoff) score of four ML classifiers viz. (a) Random Forest (RF), (b) Support vector machine (SVM), (c) AdaBoost, (d) XGBoost (XGB). For all missing rate cases, the MissForest technique is better than the KNNI and MICE in accuracy and Mcoff in 80% of cases, precision in 40% of cases, recall in 60% of cases, F1 score, MAE, RMSE, R^2 in 100% of cases, AIC in 80% of cases, and BIC values in 100% of cases. Also, the correlation analysis confirms that the MissForest imputation preserves association between the variables, like the complete dataset. Overall, our findings suggest that MissForest is a better machine learning-based imputation technique for handling missing data in diabetes research. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:5905 / 5925
页数:20
相关论文
共 50 条
  • [11] The impact factors on the performance of machine learning-based vulnerability detection: A comparative study
    Zheng, Wei
    Gao, Jialiang
    Wu, Xiaoxue
    Liu, Fengyu
    Xun, Yuxing
    Liu, Guoliang
    Chen, Xiang
    JOURNAL OF SYSTEMS AND SOFTWARE, 2020, 168 (168)
  • [12] Comparative Analysis of Machine Learning-Based Algorithms for Detection of Anomalies in IIoT
    Naik, Bhupal D. S.
    Dondeti, Venkatesulu
    Balakrishna, Sivadi
    INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2022, 12 (01)
  • [13] Machine learning-based techniques for fault diagnosis in the semiconductor manufacturing process: a comparative study
    Abubakar Abdussalam Nuhu
    Qasim Zeeshan
    Babak Safaei
    Muhammad Atif Shahzad
    The Journal of Supercomputing, 2023, 79 : 2031 - 2081
  • [14] Machine learning-based techniques for fault diagnosis in the semiconductor manufacturing process: a comparative study
    Nuhu, Abubakar Abdussalam
    Zeeshan, Qasim
    Safaei, Babak
    Shahzad, Muhammad Atif
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (02): : 2031 - 2081
  • [15] Comprehensive Investigation and Comparative Analysis of Machine Learning-Based Small-Signal Modelling Techniques for GaN HEMTs
    Husain, Saddam
    Hashmi, Mohammad
    Ghannouchi, Fadhel M.
    IEEE Journal of the Electron Devices Society, 2022, 10 : 1015 - 1032
  • [16] Comprehensive Investigation and Comparative Analysis of Machine Learning-Based Small-Signal Modelling Techniques for GaN HEMTs
    Husain, Saddam
    Hashmi, Mohammad
    Ghannouchi, Fadhel M. M.
    IEEE JOURNAL OF THE ELECTRON DEVICES SOCIETY, 2022, 10 : 1015 - 1032
  • [17] Analysis of Suitable Machine Learning Imputation Techniques for Arthritis Profile Data
    Ramasamy, Uma
    Santhoshkumar, Sundar
    IETE JOURNAL OF RESEARCH, 2024, 70 (01) : 334 - 355
  • [18] Analysis of Machine Learning Based Imputation of Missing Data
    Rizvi, Syed Tahir Hussain
    Latif, Muhammad Yasir
    Amin, Muhammad Saad
    Telmoudi, Achraf Jabeur
    Shah, Nasir Ali
    CYBERNETICS AND SYSTEMS, 2023,
  • [19] Spatial datasets for benchmarking machine learning-based landslide susceptibility models
    Samodra, Guruh
    Malawani, Mukhamad Ngainul
    Suhendro, Indranova
    Mardiatno, Djati
    DATA IN BRIEF, 2024, 57
  • [20] Comparative Analysis of Machine Learning Classifiers on Bioinformatics and Clinical Datasets
    Ranadive, Falguni
    Surti, Akil
    Sharma, Priyanka
    PROCEEDINGS OF THE 2019 6TH INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT (INDIACOM), 2019, : 608 - 611