Machine Learning Based Missing Data Imputation in Categorical Datasets

被引：1

作者：

Ishaq, Muhammad ^{[1
]}

Zahir, Sana ^{[1
]}

Iftikhar, Laila ^{[1
]}

Bulbul, Mohammad Farhad ^{[2
]}

Rho, Seungmin ^{[3
]}

Lee, Mi Young ^{[4
]}

机构：

[1] Univ Agr Peshawar, Inst Comp Sci & Informat Technol, Peshawar 25000, Khyber Pakhtunk, Pakistan

[2] Jashore Univ Sci & Technol, Dept Math, Jashore 7408, Bangladesh

[3] Chung Ang Univ, Dept Ind Secur, Seoul 06974, South Korea

[4] Chung Ang Univ, Dept Res, Seoul 06974, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

新加坡国家研究基金会;

关键词：

Data cleansing; missing data imputation; classification; regression and categorical datasets; MULTIPLE IMPUTATION;

D O I：

10.1109/ACCESS.2024.3411817

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In order to predict and fill in the gaps in categorical datasets, this research looked into the use of machine learning algorithms. The emphasis was on ensemble models constructed using the Error Correction Output Codes (ECOC) framework, including models based on SVM and KNN as well as a hybrid classifier that combines models based on SVM, KNN, and MLP. Three diverse datasets-the CPU, Hypothyroid, and Breast Cancer datasets-were employed to validate these algorithms. Results indicated that these machine learning techniques provided substantial performance in predicting and completing missing data, with the effectiveness varying based on the specific dataset and missing data pattern. Compared to solo models, ensemble models that made use of the ECOC framework significantly improved prediction accuracy and robustness. Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data and the possibility of over-fitting. Subsequent research endeavors ought to evaluate the feasibility and efficacy of deep learning algorithms in the context of the imputation of missing data.

引用

页码：88332 / 88344

页数：13

共 50 条

[41] Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets
Schumann, Yannis
Gocke, Antonia
Neumann, Julia E.
PROTEOMICS, 2025, 25 (1-2)
[42] Fuzzy min–max neural networks for categorical data: application to missing data imputation
Pilar Rey-del-Castillo
Jesús Cardeñosa
Neural Computing and Applications, 2012, 21 : 1349 - 1362
[43] Combining data discretization and missing value imputation for incomplete medical datasets
Huang, Min-Wei
Tsai, Chih-Fong
Tsui, Shu-Ching
Lin, Wei-Chao
PLOS ONE, 2023, 18 (11):
[44] A data-driven missing value imputation approach for longitudinal datasets
Caio Ribeiro
Alex A. Freitas
Artificial Intelligence Review, 2021, 54 : 6277 - 6307
[45] A data-driven missing value imputation approach for longitudinal datasets
Ribeiro, Caio
Freitas, Alex A.
ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (08) : 6277 - 6307
[46] A systematic review of machine learning-based missing value imputation techniques
Thomas, Tressy
Rajabi, Enayat
DATA TECHNOLOGIES AND APPLICATIONS, 2021, 55 (04) : 558 - 585
[47] A Novel Index Measure Imputation Algorithm for Missing Data Values: A Machine Learning Approach
Madhu, G.
Rajinikanth, T. V.
2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2012, : 81 - 87
[48] Enhanced Application of Principal Component Analysis in Machine Learning for Imputation of Missing Traffic Data
Choi, Yoon-Young
Shon, Heeseung
Byon, Young-Ji
Kim, Dong-Kyu
Kang, Seungmo
APPLIED SCIENCES-BASEL, 2019, 9 (10):
[49] Modulo 9 model-based learning for missing data imputation
Ngueilbaye, Alladoumbaye
Wang, Hongzhi
Mahamat, Daouda Ahmat
Junaidu, Sahalu B.
APPLIED SOFT COMPUTING, 2021, 103
[50] Complete imputation of missing repeated categorical data: one-sample applications
West, CP
Dawson, JD
STATISTICS IN MEDICINE, 2002, 21 (02) : 203 - 217

← 1 2 3 4 5 →