Revisiting the Problem of Missing Values in High-Dimensional Data and Feature Selection Effect

被引:0
|
作者
Elia, Marina G. [1 ]
Duan, Wenting [1 ]
机构
[1] Univ Lincoln, Dept Comp Sci, Lincoln, England
关键词
Simulation; Missing Data; Imputation Methods; Feature Selection; VALUE IMPUTATION;
D O I
10.1007/978-3-031-63211-2_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Missing values are a prevalent challenge in mass spectrometry (MS) data, and many of the typical analysis approaches (e.g. regression approaches) follow a listwise deletion in the presence of missingness leading to information loss. To address missingness, numerous imputation methods (IMs) have been proposed. Nonetheless, the choice of method is of key importance both in relation to computational cost, especially for high dimensional data, as well as in relation to the impact on downstream data analyses. Despite the extensive published literature for utilizing distinct IMs tailored to specific missing value scenarios, there is scant research concerning the impact of IMs on feature selection. In this study, four computationally fast IMs (Zero, Mean, Median, and Expectation-Maximization) were considered on synthetically missing MS data for a range of scenarios, including different missing mechanisms (Missing at Random-MAR, Missing Completely at RandomMCAR, Missing Not at Random-MNAR) and percentages of missingness (10%, 20%, 50%). Least absolute shrinkage and selection operator (LASSO) regression was employed on the imputed data to examine how the choice of different IMs, under different scenarios of missingness, performed in terms of the choice of features and their estimates. We observed that all IMs considered, achieved high levels of accuracy performance at different scenarios of missingness. Also, LASSO regression results showed a certain level of agreement, as evidenced by the features that were commonly selected across different IMs for the same scenario of missingness. It must be noted that the magnitude of coefficients of the common selected features was influenced by the choice of IM. The findings from this simulation study provide valuable insights to analysts and researchers, highlighting that computationally efficient IMs can offer a good level of accuracy for missingness scenarios in high dimensional data. Acknowledging potential challenges, this study provides a foundation for further simulations to guide the choice of imputation approach for scenarios of high dimensional data in the presence of missingness.
引用
收藏
页码:201 / 213
页数:13
相关论文
共 50 条
  • [41] Genetic Programming for Feature Selection and Construction to High-Dimensional Data
    Ma, Jianbin
    Zhu, Man
    2024 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND INTELLIGENT SYSTEMS ENGINEERING, MLISE 2024, 2024, : 196 - 200
  • [42] Scalable Feature Selection in High-Dimensional Data Based on GRASP
    Moshki, Mohsen
    Kabiri, Peyman
    Mohebalhojeh, Alireza
    APPLIED ARTIFICIAL INTELLIGENCE, 2015, 29 (03) : 283 - 296
  • [43] Missing Data Imputation with High-Dimensional Data
    Brini, Alberto
    van den Heuvel, Edwin R.
    AMERICAN STATISTICIAN, 2024, 78 (02): : 240 - 252
  • [44] Improving Penalized Logistic Regression Model with Missing Values in High-Dimensional Data
    Alharthi, Aiedh Mrisi
    Lee, Muhammad Hisyam
    Algamal, Zakariya Yahya
    INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2022, 18 (02) : 40 - 54
  • [45] Handling high-dimensional data with missing values by modern machine learning techniques
    Chen, Sixia
    Xu, Chao
    JOURNAL OF APPLIED STATISTICS, 2023, 50 (03) : 786 - 804
  • [46] Semisupervised Bacterial Heuristic Feature Selection Algorithm for High-Dimensional Classification with Missing Labels
    Wang, Hong
    Ou, Yikun
    Wang, Yixin
    Xing, Tongtong
    Tan, Lijing
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2023, 2023
  • [47] Bird's Eye View feature selection for high-dimensional data
    Belhaouari, Samir Brahim
    Shakeel, Mohammed Bilal
    Erbad, Aiman
    Oflaz, Zarina
    Kassoul, Khelil
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [48] Feature selection using autoencoders with Bayesian methods to high-dimensional data
    Shu, Lei
    Huang, Kun
    Jiang, Wenhao
    Wu, Wenming
    Liu, Hongling
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 41 (06) : 7397 - 7406
  • [49] Feature Selection for High-Dimensional Data Through Instance Vote Combining
    Chamakura, Lily
    Saha, Goutam
    PROCEEDINGS OF THE 7TH ACM IKDD CODS AND 25TH COMAD (CODS-COMAD 2020), 2020, : 161 - 169
  • [50] Improving Evolutionary Algorithm Performance for Feature Selection in High-Dimensional Data
    Cilia, N.
    De Stefano, C.
    Fontanella, F.
    di Freca, A. Scotto
    APPLICATIONS OF EVOLUTIONARY COMPUTATION, EVOAPPLICATIONS 2018, 2018, 10784 : 439 - 454