nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data

被引:18
|
作者
Lewis, Myles J. [1 ,2 ]
Spiliopoulou, Athina [3 ]
Goldmann, Katriona [1 ,4 ]
Pitzalis, Costantino [1 ]
McKeigue, Paul [3 ]
Barnes, Michael R. [2 ,4 ]
机构
[1] Queen Mary Univ London, William Harvey Res Inst, Ctr Expt Med & Rheumatol, Barts & London Sch Med & Dent, London EC1M 6BQ, England
[2] Alan Turing Inst, London NW1 2AJ, England
[3] Univ Edinburgh, Usher Inst, Coll Med & Vet Med, Edinburgh EH16 4UX, Scotland
[4] Queen Mary Univ London, William Harvey Res Inst, Ctr Translat Bioinformat, Barts & London Sch Med & Dent, London EC1M 6BQ, England
来源
BIOINFORMATICS ADVANCES | 2023年 / 3卷 / 01期
关键词
PREDICTION;
D O I
10.1093/bioadv/vbad048
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Motivation Although machine learning models are commonly used in medical research, many analyses implement a simple partition into training data and hold-out test data, with cross-validation (CV) for tuning of model hyperparameters. Nested CV with embedded feature selection is especially suited to biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P >> n).Results The nestedcv R package implements fully nested k x l-fold CV for lasso and elastic-net regularized linear models via the glmnet package and supports a large array of other machine learning models via the caret framework. Inner CV is used to tune models and outer CV is used to determine model performance without bias. Fast filter functions for feature selection are provided and the package ensures that filters are nested within the outer CV loop to avoid information leakage from performance test sets. Measurement of performance by outer CV is also used to implement Bayesian linear and logistic regression models using the horseshoe prior over parameters to encourage a sparse model and determine unbiased model accuracy.Availability and implementation The R package nestedcv is available from CRAN: https://CRAN.R-project.org/package=nestedcv.
引用
收藏
页数:5
相关论文
共 23 条
  • [1] Nested cross-validation with ensemble feature selection and classification model for high-dimensional biological data
    Zhong, Yi
    Chalise, Prabhakar
    He, Jianghua
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2023, 52 (01) : 110 - 125
  • [2] WilcoxCV: an R package for fast variable selection in cross-validation
    Boulesteix, Anne-Laure
    BIOINFORMATICS, 2007, 23 (13) : 1702 - 1704
  • [3] Model selection properties of forward selection and sequential cross-validation for high-dimensional regression
    Wieczorek, Jerzy
    Lei, Jing
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2022, 50 (02): : 454 - 470
  • [4] Single Sequence Fast Feature Selection for High-Dimensional Data
    Boldt, Francisco de Assis
    Rauber, Thomas W.
    Varejao, Flavio M.
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 697 - 704
  • [5] Hybrid fast unsupervised feature selection for high-dimensional data
    Manbari, Zhaleh
    AkhlaghianTab, Fardin
    Salavati, Chiman
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 124 : 97 - 118
  • [6] Fast Cross-validation for Multi-penalty High-dimensional Ridge Regression
    van de Wiel, Mark A.
    van Nee, Mirrelijn M.
    Rauschenberger, Armin
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2021, 30 (04) : 835 - 847
  • [7] Implementation of FAST Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data
    Shilu, Smit
    Sheth, Kushal
    Mehul, Ekata
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ICT FOR SUSTAINABLE DEVELOPMENT ICT4SD 2015, VOL 2, 2016, 409 : 203 - 213
  • [8] Accurate and fast feature selection workflow for high-dimensional omics data
    Perez-Riverol, Yasset
    Kuhn, Max
    Vizcaino, Juan Antonio
    Hitz, Marc-Phillip
    Audain, Enrique
    PLOS ONE, 2017, 12 (12):
  • [9] SFE: A Simple, Fast, and Efficient Feature Selection Algorithm for High-Dimensional Data
    Ahadzadeh, Behrouz
    Abdar, Moloud
    Safara, Fatemeh
    Khosravi, Abbas
    Menhaj, Mohammad Bagher
    Suganthan, Ponnuthurai Nagaratnam
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2023, 27 (06) : 1896 - 1911
  • [10] A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data
    Song, Qinbao
    Ni, Jingjie
    Wang, Guangtao
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (01) : 1 - 14