Performance and explainability of feature selection-boosted tree-based classifiers for COVID-19 detection

被引：2

作者：

Rufino, Jesus ^{[1
]}

Ramirez, Juan Marcos ^{[1
]}

Aguilar, Jose ^{[1
,2
,3
]}

Baquero, Carlos ^{[4
,5
]}

Champati, Jaya ^{[1
]}

Frey, Davide ^{[6
]}

Lillo, Rosa Elvira ^{[7
]}

Fernandez-Anta, Antonio ^{[1
]}

机构：

[1] IMDEA Networks Inst, Madrid 28918, Spain

[2] Univ Los Andes, CEMISID, Merida 5101, Venezuela

[3] Univ EAFIT, CIDITIC, Medellin, Colombia

[4] Univ Minho, Braga, Portugal

[5] INESCTEC, Braga, Portugal

[6] INRIA, Rennes, France

[7] Univ Carlos III, Madrid, Spain

来源：

HELIYON | 2024年 / 10卷 / 01期

关键词：

COVID-19; detection; Explainability analysis; Gradient boosting classifiers; Random forest; Recursive feature elimination; Shapley values;

D O I：

10.1016/j.heliyon.2023.e23219

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

In this paper, we evaluate the performance and analyze the explainability of machine learning models boosted by feature selection in predicting COVID-19-positive cases from self-reported information. In essence, this work describes a methodology to identify COVID-19 infections that considers the large amount of information collected by the University of Maryland Global COVID-19 Trends and Impact Survey (UMD-CTIS). More precisely, this methodology performs a feature selection stage based on the recursive feature elimination (RFE) method to reduce the number of input variables without compromising detection accuracy. A tree-based supervised machine learning model is then optimized with the selected features to detect COVID-19-active cases. In contrast to previous approaches that use a limited set of selected symptoms, the proposed approach builds the detection engine considering a broad range of features including self-reported symptoms, local community information, vaccination acceptance, and isolation measures, among others. To implement the methodology, three different supervised classifiers were used: random forests (RF), light gradient boosting (LGB), and extreme gradient boosting (XGB). Based on data collected from the UMD-CTIS, we evaluated the detection performance of the methodology for four countries (Brazil, Canada, Japan, and South Africa) and two periods (2020 and 2021). The proposed approach was assessed in terms of various quality metrics: F1-score, sensitivity, specificity, precision, receiver operating characteristic (ROC), and area under the ROC curve (AUC). This work also shows the normalized daily incidence curves obtained by the proposed approach for the four countries. Finally, we perform an explainability analysis using Shapley values and feature importance to determine the relevance of each feature and the corresponding contribution for each country and each country/year.

引用

页数：21

共 50 条

[41] Genetic Algorithms for Feature Selection in the Classification of COVID-19 Patients
Aliani, Cosimo
Rossi, Eva
Solinski, Mateusz
Francia, Piergiorgio
Lanata, Antonio
Buchner, Teodor
Bocchi, Leonardo
BIOENGINEERING-BASEL, 2024, 11 (09):
[42] Comparative analysis of feature selection techniques for COVID-19 dataset
Mohtasham, Farideh
Pourhoseingholi, MohamadAmin
Nazari, Seyed Saeed Hashemi
Kavousi, Kaveh
Zali, Mohammad Reza
SCIENTIFIC REPORTS, 2024, 14 (01):
[43] Audio Feature Ranking for Sound-Based COVID-19 Patient Detection
Meister, Julia A.
Nguyen, Khuong An
Luo, Zhiyuan
PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2022, 2022, 13566 : 146 - 158
[44] Accurate detection of COVID-19 using deep features based on X-Ray images and feature selection methods
Narin, Ali
COMPUTERS IN BIOLOGY AND MEDICINE, 2021, 137
[45] Hybrid optimized feature selection and deep learning based COVID-19 disease prediction
Joseph, S. John
Raj, R. Gandhi
COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING, 2023, 26 (16) : 2070 - 2088
[46] Incorporating Feature Selection Methods into Machine Learning-Based Covid-19 Diagnosis
Danaci, Cagla
Tuncer, Seda Arslan
APPLIED COMPUTER SYSTEMS, 2022, 27 (01) : 13 - 18
[47] forgeNet: a graph deep neural network model using tree-based ensemble classifiers for feature graph construction
Kong, Yunchuan
Yu, Tianwei
BIOINFORMATICS, 2020, 36 (11) : 3507 - 3515
[48] Tree-based data mining for safety assessment of first COVID-19 booster doses in the Vaccine Safety Datalink
Yih, Katherine
Daley, Matthew F.
Duffy, Jonathan
Fireman, Bruce
McClure, David
Nelson, Jennifer
Qian, Lei
Smith, Ning
Vazquez-Benitez, Gabriela
Weintraub, Eric
Williams, Joshua T. B.
Xu, Stanley
Maro, Judith C.
VACCINE, 2023, 41 (02) : 460 - 466
[49] Performance comparisons of tree-based and cell-based contact detection algorithms
Han, K.
Feng, Y. T.
Owen, D. R. J.
ENGINEERING COMPUTATIONS, 2007, 24 (1-2) : 165 - 181
[50] Intrusion detection based on feature selection and tree Parzen estimation
Jin Z.
Wu T.
Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2021, 43 (07): : 1954 - 1960

← 1 2 3 4 5 →