Performance and explainability of feature selection-boosted tree-based classifiers for COVID-19 detection

被引:2
|
作者
Rufino, Jesus [1 ]
Ramirez, Juan Marcos [1 ]
Aguilar, Jose [1 ,2 ,3 ]
Baquero, Carlos [4 ,5 ]
Champati, Jaya [1 ]
Frey, Davide [6 ]
Lillo, Rosa Elvira [7 ]
Fernandez-Anta, Antonio [1 ]
机构
[1] IMDEA Networks Inst, Madrid 28918, Spain
[2] Univ Los Andes, CEMISID, Merida 5101, Venezuela
[3] Univ EAFIT, CIDITIC, Medellin, Colombia
[4] Univ Minho, Braga, Portugal
[5] INESCTEC, Braga, Portugal
[6] INRIA, Rennes, France
[7] Univ Carlos III, Madrid, Spain
关键词
COVID-19; detection; Explainability analysis; Gradient boosting classifiers; Random forest; Recursive feature elimination; Shapley values;
D O I
10.1016/j.heliyon.2023.e23219
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
In this paper, we evaluate the performance and analyze the explainability of machine learning models boosted by feature selection in predicting COVID-19-positive cases from self-reported information. In essence, this work describes a methodology to identify COVID-19 infections that considers the large amount of information collected by the University of Maryland Global COVID-19 Trends and Impact Survey (UMD-CTIS). More precisely, this methodology performs a feature selection stage based on the recursive feature elimination (RFE) method to reduce the number of input variables without compromising detection accuracy. A tree-based supervised machine learning model is then optimized with the selected features to detect COVID-19-active cases. In contrast to previous approaches that use a limited set of selected symptoms, the proposed approach builds the detection engine considering a broad range of features including self-reported symptoms, local community information, vaccination acceptance, and isolation measures, among others. To implement the methodology, three different supervised classifiers were used: random forests (RF), light gradient boosting (LGB), and extreme gradient boosting (XGB). Based on data collected from the UMD-CTIS, we evaluated the detection performance of the methodology for four countries (Brazil, Canada, Japan, and South Africa) and two periods (2020 and 2021). The proposed approach was assessed in terms of various quality metrics: F1-score, sensitivity, specificity, precision, receiver operating characteristic (ROC), and area under the ROC curve (AUC). This work also shows the normalized daily incidence curves obtained by the proposed approach for the four countries. Finally, we perform an explainability analysis using Shapley values and feature importance to determine the relevance of each feature and the corresponding contribution for each country and each country/year.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] A new COVID-19 Patients Detection Strategy (CPDS) based on hybrid feature selection and enhanced KNN classifier
    Shaban, Warda M.
    Rabie, Asmaa H.
    Saleh, Ahmed, I
    Abo-Elsoud, M. A.
    KNOWLEDGE-BASED SYSTEMS, 2020, 205 (205)
  • [32] Analyzing the Features Affecting the Performance of Teachers during Covid-19: A Multilevel Feature Selection
    Saeed, Alqahtani
    Habib, Raja
    Zaffar, Maryam
    Quraishi, Khurrum Shehzad
    Altaf, Oriba
    Irfan, Muhammad
    Glowacz, Adam
    Tadeusiewicz, Ryszard
    Huneif, Mohammed Ayed
    Abdulwahab, Alqahtani
    Alduraibi, Sharifa Khalid
    Alshehri, Fahad
    Alduraibi, Alaa Khalid
    Almushayti, Ziyad
    ELECTRONICS, 2021, 10 (14)
  • [33] An automated COVID-19 detection based on fused dynamic exemplar pyramid feature extraction and hybrid feature selection using deep learning
    Ozyurt, Fatih
    Tuncer, Turker
    Subasi, Abdulhamit
    COMPUTERS IN BIOLOGY AND MEDICINE, 2021, 132
  • [34] An automated COVID-19 detection based on fused dynamic exemplar pyramid feature extraction and hybrid feature selection using deep learning
    Ozyurt, Fatih
    Tuncer, Turker
    Subasi, Abdulhamit
    Computers in Biology and Medicine, 2021, 132
  • [35] Anomaly-based error and intrusion detection in tabular data: No DNN outperforms tree-based classifiers
    Zoppi, Tommaso
    Gazzini, Stefano
    Ceccarelli, Andrea
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 160 : 951 - 965
  • [36] Deep Learning Methods to Reveal Important X-ray Features in COVID-19 Detection: Investigation of Explainability and Feature Reproducibility
    Apostolopoulos, Ioannis D.
    Apostolopoulos, Dimitris J.
    Papathanasiou, Nikolaos D.
    REPORTS, 2022, 5 (02)
  • [37] Automatic feature subset selection for decision tree-based ensemble methods in the prediction of bioactivity
    Cao, Dong-Sheng
    Xu, Qing-Song
    Liang, Yi-Zeng
    Chen, Xian
    Li, Hong-Dong
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2010, 103 (02) : 129 - 136
  • [38] An Analysis of Feature Selection Techniques For COVID-19 Detection on Chest X-Ray Data
    Selleti, Andre L. Jeller
    Silla Jr, Carlos N.
    2021 IEEE 21ST INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (IEEE BIBE 2021), 2021,
  • [39] An Improved DeepNN with Feature Ranking for Covid-19 Detection
    El-Attar, Noha E.
    Sabbeh, Sahar F.
    Fasihuddin, Heba
    Awad, Wael A.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (02): : 2249 - 2269
  • [40] Enhancing Feature Selection Optimization for COVID-19 Microarray Data
    Krishanthi, Gayani
    Jayetileke, Harshanie
    Wu, Jinran
    Liu, Chanjuan
    Wang, You-Gan
    COVID, 2023, 3 (09): : 1336 - 1355