Consistent and unbiased variable selection under indepedent features using Random Forest permutation importance

被引:2
|
作者
Ramosaj, Burim [1 ]
Pauly, Markus [1 ]
机构
[1] Tech Univ Dortmund, Inst Math Stat & Applicat Ind, Fac Stat, Joseph Von Fraunhofer Str 2-4, D-44227 Dortmund, Germany
关键词
Random Forest; permutation importance; unbiasedness; consistency; Out-of-Bag samples; statistical learning;
D O I
10.3150/22-BEJ1534
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Variable selection in sparse regression models is an important task as applications ranging from biomedical re-search to econometrics have shown. Especially for higher dimensional regression problems, for which the regres-sion function as the link between response and covariates cannot be directly detected, the selection of informative variables is challenging. Under these circumstances, the Random Forest method is an helpful tool to predict new outcomes while delivering measures for variable selection. One common approach is the usage of the permutation importance. Due to its intuitive idea and flexible usage, it is important to explore circumstances, for which the permutation importance based on Random Forest correctly indicates informative covariates. Regarding the latter, we deliver theoretical guarantees for the validity of the permutation importance measure under specific assump-tions such as the mutual independence of the features and prove its (asymptotic) unbiasedness, while under slightly stricter assumptions, consistency of the permutation importance measure is established. An extensive simulation study supports our findings.
引用
收藏
页码:2101 / 2118
页数:18
相关论文
共 50 条
  • [31] Variable selection using random forests
    Sandri, Marco
    Zuccolotto, Paola
    DATA ANALYSIS, CLASSIFICATION AND THE FORWARD SEARCH, 2006, : 263 - +
  • [32] Design of a Database-Driven Modeling based on Variable Selection using a Random Forest
    Imaji, Hiromu
    Kinoshita, Takuya
    Yamamoto, Toru
    2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 215 - 220
  • [33] Variable selection for estimating individual tree height using genetic algorithm and random forest
    Miranda, Evandro Nunes
    Groenner Barbosa, Bruno Henrique
    Godinho Silva, Sergio Henrique
    Ussi Monti, Cassio Augusto
    Tng, David Yue Phin
    Gomide, Lucas Rezende
    FOREST ECOLOGY AND MANAGEMENT, 2022, 504
  • [34] A Hybrid Random Forest Variable Selection Approach for Omics Data
    Fouodo, Cesaire J. K.
    Koenig, Inke R.
    Szymczak, Silke
    GENETIC EPIDEMIOLOGY, 2022, 46 (07) : 494 - 494
  • [35] Inference after variable selection using restricted permutation methods
    Wang, Rui
    Lagakos, Stephen W.
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2009, 37 (04): : 625 - 644
  • [36] Predictive modeling of Pan Evaporation using Random Forest Algorithm along with Features Selection
    Rakhee
    Singh, Archana
    Mittal, Mamta
    Kumar, Amrender
    PROCEEDINGS OF THE CONFLUENCE 2020: 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING, 2020, : 380 - 384
  • [37] Using SVM and Random forest for different features selection in predicting bike rental amount
    Shiao, Yi Chen
    Chung, Wei Hsiang
    Chen, Rung Ching
    2018 9TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST), 2018, : 246 - 250
  • [38] Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features
    Voges, Lucas F.
    Jarren, Lukas C.
    Seifert, Stephan
    BIOINFORMATICS, 2023, 39 (08)
  • [39] Selection of features for fault diagnosis on rotating machines using random forest and wavelet analysis
    Saari, J.
    Lundberg, J.
    Odelius, J.
    Rantatalo, M.
    INSIGHT, 2018, 60 (08) : 434 - 442
  • [40] Bias in random forest variable importance measures: Illustrations, sources and a solution
    Strobl, Carolin
    Boulesteix, Anne-Laure
    Zeileis, Achim
    Hothorn, Torsten
    BMC BIOINFORMATICS, 2007, 8 (1)