Performance and explainability of feature selection-boosted tree-based classifiers for COVID-19 detection

被引:2
|
作者
Rufino, Jesus [1 ]
Ramirez, Juan Marcos [1 ]
Aguilar, Jose [1 ,2 ,3 ]
Baquero, Carlos [4 ,5 ]
Champati, Jaya [1 ]
Frey, Davide [6 ]
Lillo, Rosa Elvira [7 ]
Fernandez-Anta, Antonio [1 ]
机构
[1] IMDEA Networks Inst, Madrid 28918, Spain
[2] Univ Los Andes, CEMISID, Merida 5101, Venezuela
[3] Univ EAFIT, CIDITIC, Medellin, Colombia
[4] Univ Minho, Braga, Portugal
[5] INESCTEC, Braga, Portugal
[6] INRIA, Rennes, France
[7] Univ Carlos III, Madrid, Spain
关键词
COVID-19; detection; Explainability analysis; Gradient boosting classifiers; Random forest; Recursive feature elimination; Shapley values;
D O I
10.1016/j.heliyon.2023.e23219
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
In this paper, we evaluate the performance and analyze the explainability of machine learning models boosted by feature selection in predicting COVID-19-positive cases from self-reported information. In essence, this work describes a methodology to identify COVID-19 infections that considers the large amount of information collected by the University of Maryland Global COVID-19 Trends and Impact Survey (UMD-CTIS). More precisely, this methodology performs a feature selection stage based on the recursive feature elimination (RFE) method to reduce the number of input variables without compromising detection accuracy. A tree-based supervised machine learning model is then optimized with the selected features to detect COVID-19-active cases. In contrast to previous approaches that use a limited set of selected symptoms, the proposed approach builds the detection engine considering a broad range of features including self-reported symptoms, local community information, vaccination acceptance, and isolation measures, among others. To implement the methodology, three different supervised classifiers were used: random forests (RF), light gradient boosting (LGB), and extreme gradient boosting (XGB). Based on data collected from the UMD-CTIS, we evaluated the detection performance of the methodology for four countries (Brazil, Canada, Japan, and South Africa) and two periods (2020 and 2021). The proposed approach was assessed in terms of various quality metrics: F1-score, sensitivity, specificity, precision, receiver operating characteristic (ROC), and area under the ROC curve (AUC). This work also shows the normalized daily incidence curves obtained by the proposed approach for the four countries. Finally, we perform an explainability analysis using Shapley values and feature importance to determine the relevance of each feature and the corresponding contribution for each country and each country/year.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] Genetic Algorithms for Feature Selection in the Classification of COVID-19 Patients
    Aliani, Cosimo
    Rossi, Eva
    Solinski, Mateusz
    Francia, Piergiorgio
    Lanata, Antonio
    Buchner, Teodor
    Bocchi, Leonardo
    BIOENGINEERING-BASEL, 2024, 11 (09):
  • [42] Comparative analysis of feature selection techniques for COVID-19 dataset
    Mohtasham, Farideh
    Pourhoseingholi, MohamadAmin
    Nazari, Seyed Saeed Hashemi
    Kavousi, Kaveh
    Zali, Mohammad Reza
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [43] Audio Feature Ranking for Sound-Based COVID-19 Patient Detection
    Meister, Julia A.
    Nguyen, Khuong An
    Luo, Zhiyuan
    PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2022, 2022, 13566 : 146 - 158
  • [44] Accurate detection of COVID-19 using deep features based on X-Ray images and feature selection methods
    Narin, Ali
    COMPUTERS IN BIOLOGY AND MEDICINE, 2021, 137
  • [45] Hybrid optimized feature selection and deep learning based COVID-19 disease prediction
    Joseph, S. John
    Raj, R. Gandhi
    COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING, 2023, 26 (16) : 2070 - 2088
  • [46] Incorporating Feature Selection Methods into Machine Learning-Based Covid-19 Diagnosis
    Danaci, Cagla
    Tuncer, Seda Arslan
    APPLIED COMPUTER SYSTEMS, 2022, 27 (01) : 13 - 18
  • [47] forgeNet: a graph deep neural network model using tree-based ensemble classifiers for feature graph construction
    Kong, Yunchuan
    Yu, Tianwei
    BIOINFORMATICS, 2020, 36 (11) : 3507 - 3515
  • [48] Tree-based data mining for safety assessment of first COVID-19 booster doses in the Vaccine Safety Datalink
    Yih, Katherine
    Daley, Matthew F.
    Duffy, Jonathan
    Fireman, Bruce
    McClure, David
    Nelson, Jennifer
    Qian, Lei
    Smith, Ning
    Vazquez-Benitez, Gabriela
    Weintraub, Eric
    Williams, Joshua T. B.
    Xu, Stanley
    Maro, Judith C.
    VACCINE, 2023, 41 (02) : 460 - 466
  • [49] Performance comparisons of tree-based and cell-based contact detection algorithms
    Han, K.
    Feng, Y. T.
    Owen, D. R. J.
    ENGINEERING COMPUTATIONS, 2007, 24 (1-2) : 165 - 181
  • [50] Intrusion detection based on feature selection and tree Parzen estimation
    Jin Z.
    Wu T.
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2021, 43 (07): : 1954 - 1960