Revisiting the Problem of Missing Values in High-Dimensional Data and Feature Selection Effect

被引：0

作者：

Elia, Marina G. ^{[1
]}

Duan, Wenting ^{[1
]}

机构：

[1] Univ Lincoln, Dept Comp Sci, Lincoln, England

来源：

ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, PT I, AIAI 2024 | 2024年 / 711卷

关键词：

Simulation; Missing Data; Imputation Methods; Feature Selection; VALUE IMPUTATION;

D O I：

10.1007/978-3-031-63211-2_16

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Missing values are a prevalent challenge in mass spectrometry (MS) data, and many of the typical analysis approaches (e.g. regression approaches) follow a listwise deletion in the presence of missingness leading to information loss. To address missingness, numerous imputation methods (IMs) have been proposed. Nonetheless, the choice of method is of key importance both in relation to computational cost, especially for high dimensional data, as well as in relation to the impact on downstream data analyses. Despite the extensive published literature for utilizing distinct IMs tailored to specific missing value scenarios, there is scant research concerning the impact of IMs on feature selection. In this study, four computationally fast IMs (Zero, Mean, Median, and Expectation-Maximization) were considered on synthetically missing MS data for a range of scenarios, including different missing mechanisms (Missing at Random-MAR, Missing Completely at RandomMCAR, Missing Not at Random-MNAR) and percentages of missingness (10%, 20%, 50%). Least absolute shrinkage and selection operator (LASSO) regression was employed on the imputed data to examine how the choice of different IMs, under different scenarios of missingness, performed in terms of the choice of features and their estimates. We observed that all IMs considered, achieved high levels of accuracy performance at different scenarios of missingness. Also, LASSO regression results showed a certain level of agreement, as evidenced by the features that were commonly selected across different IMs for the same scenario of missingness. It must be noted that the magnitude of coefficients of the common selected features was influenced by the choice of IM. The findings from this simulation study provide valuable insights to analysts and researchers, highlighting that computationally efficient IMs can offer a good level of accuracy for missingness scenarios in high dimensional data. Acknowledging potential challenges, this study provides a foundation for further simulations to guide the choice of imputation approach for scenarios of high dimensional data in the presence of missingness.

引用

页码：201 / 213

页数：13

共 50 条

[21] Hybrid Feature Selection for High-Dimensional Manufacturing Data
Sun, Yajuan
Yu, Jianlin
Li, Xiang
Wu, Ji Yan
Lu, Wen Feng
2021 26TH IEEE INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2021,
[22] A hybrid feature selection method for high-dimensional data
Taheri, Nooshin
Nezamabadi-pour, Hossein
2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), 2014, : 141 - 145
[23] Clustering high-dimensional data via feature selection
Liu, Tianqi
Lu, Yu
Zhu, Biqing
Zhao, Hongyu
BIOMETRICS, 2023, 79 (02) : 940 - 950
[24] On the scalability of feature selection methods on high-dimensional data
Bolon-Canedo, V.
Rego-Fernandez, D.
Peteiro-Barral, D.
Alonso-Betanzos, A.
Guijarro-Berdinas, B.
Sanchez-Marono, N.
KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 56 (02) : 395 - 442
[25] A hybrid feature selection scheme for high-dimensional data
Ganjei, Mohammad Ahmadi
Boostani, Reza
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 113
[26] Evaluating Feature Selection Robustness on High-Dimensional Data
Pes, Barbara
HYBRID ARTIFICIAL INTELLIGENT SYSTEMS (HAIS 2018), 2018, 10870 : 235 - 247
[27] Feature selection for classifying high-dimensional numerical data
Wu, YM
Zhang, AD
PROCEEDINGS OF THE 2004 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, 2004, : 251 - 258
[28] High-dimensional variable selection in regression and classification with missing data
Gao, Qi
Lee, Thomas C. M.
SIGNAL PROCESSING, 2017, 131 : 1 - 7
[29] A Light Causal Feature Selection Approach to High-Dimensional Data
Ling, Zhaolong
Li, Ying
Zhang, Yiwen
Yu, Kui
Zhou, Peng
Li, Bo
Wu, Xindong
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (08) : 7639 - 7650
[30] Filter Feature Selection Performance Comparison in High-dimensional Data
Huertas, Carlos
Juarez-Ramirez, Reyes
2014 17TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2014,

← 1 2 3 4 5 →