Machine Learning Based Missing Data Imputation in Categorical Datasets

被引:1
|
作者
Ishaq, Muhammad [1 ]
Zahir, Sana [1 ]
Iftikhar, Laila [1 ]
Bulbul, Mohammad Farhad [2 ]
Rho, Seungmin [3 ]
Lee, Mi Young [4 ]
机构
[1] Univ Agr Peshawar, Inst Comp Sci & Informat Technol, Peshawar 25000, Khyber Pakhtunk, Pakistan
[2] Jashore Univ Sci & Technol, Dept Math, Jashore 7408, Bangladesh
[3] Chung Ang Univ, Dept Ind Secur, Seoul 06974, South Korea
[4] Chung Ang Univ, Dept Res, Seoul 06974, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Data cleansing; missing data imputation; classification; regression and categorical datasets; MULTIPLE IMPUTATION;
D O I
10.1109/ACCESS.2024.3411817
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In order to predict and fill in the gaps in categorical datasets, this research looked into the use of machine learning algorithms. The emphasis was on ensemble models constructed using the Error Correction Output Codes (ECOC) framework, including models based on SVM and KNN as well as a hybrid classifier that combines models based on SVM, KNN, and MLP. Three diverse datasets-the CPU, Hypothyroid, and Breast Cancer datasets-were employed to validate these algorithms. Results indicated that these machine learning techniques provided substantial performance in predicting and completing missing data, with the effectiveness varying based on the specific dataset and missing data pattern. Compared to solo models, ensemble models that made use of the ECOC framework significantly improved prediction accuracy and robustness. Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data and the possibility of over-fitting. Subsequent research endeavors ought to evaluate the feasibility and efficacy of deep learning algorithms in the context of the imputation of missing data.
引用
收藏
页码:88332 / 88344
页数:13
相关论文
共 50 条
  • [31] Variable selection with missing data in both covariates and outcomes: Imputation and machine learning
    Hu, Liangyuan
    Lin, Jung-Yi Joyce
    Ji, Jiayi
    STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (12) : 2651 - 2671
  • [32] Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?
    Chen, Zhi
    Tan, Sarah
    Chajewska, Urszula
    Rudin, Cynthia
    Caruana, Rich
    CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 209, 2023, 209 : 86 - 99
  • [33] Prediction of concrete strengths enabled by missing data imputation and interpretable machine learning
    Lyngdoh, Gideon A.
    Zaki, Mohd
    Krishnan, N. M. Anoop
    Das, Sumanta
    CEMENT & CONCRETE COMPOSITES, 2022, 128
  • [34] Machine-Learning-Based Imputation Method for Filling Missing Values in Ground Meteorological Observation Data
    Li, Cong
    Ren, Xupeng
    Zhao, Guohui
    ALGORITHMS, 2023, 16 (09)
  • [35] Handling Missing Data in Presence of Categorical Variables: a New Imputation Procedure
    Ferrari, Pier Alda
    Barbiero, Alessandro
    Manzi, Giancarlo
    NEW PERSPECTIVES IN STATISTICAL MODELING AND DATA ANALYSIS, 2011, : 473 - 480
  • [36] Multiple imputation of unordered categorical missing data: A comparison of the multivariate normal imputation and multiple imputation by chained equations
    Karangwa, Innocent
    Kotze, Danelle
    Blignaut, Renette
    BRAZILIAN JOURNAL OF PROBABILITY AND STATISTICS, 2016, 30 (04) : 521 - 539
  • [37] A Minimal Learning Machine for Datasets with Missing Values
    Paiva Mesquita, Diego P.
    Gomes, Joao Paulo P.
    Souza, Amauri H., Jr.
    NEURAL INFORMATION PROCESSING, PT I, 2015, 9489 : 565 - 572
  • [38] Imputation of missing values in lipidomic datasets
    Froelich, Nicolas
    Klose, Christian
    Widen, Elisabeth
    Ripatti, Samuli
    Gerl, Mathias J.
    PROTEOMICS, 2024, 24 (15)
  • [39] Machine learning imputation of missing Mesonet temperature observations
    Boomgard-Zagrodnik, Joseph P.
    Brown, David J.
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2022, 192
  • [40] Generative adversarial learning for missing data imputation
    Xinyang Wang
    Hongyu Chen
    Jiayu Zhang
    Jicong Fan
    Neural Computing and Applications, 2025, 37 (3) : 1403 - 1416