Feature selection and validated predictive performance in the domain of Legionella pneumophila: A comparative study

被引:10
|
作者
Van Der Ploeg T. [1 ,2 ]
Steyerberg E.W. [2 ]
机构
[1] Department of Science, Medical Center Alkmaar, Inholland University, Alkmaar
[2] Department of Public Health, Erasmus MC-University Medical Center Rotterdam, Rotterdam
关键词
Support Vector Machine; Feature Selection; Random Forest; Support Vector Machine Model; Little Absolute Shrinkage Selection Operator;
D O I
10.1186/s13104-016-1945-2
中图分类号
学科分类号
摘要
Background: Genetic comparisons of clinical and environmental Legionella strains form an essential part of outbreak investigations. DNA microarrays often comprise many DNA markers (features). Feature selection and the development of prediction models are particularly challenging in this domain with many variables and comparatively few subjects or data points. We aimed to compare modeling strategies to develop prediction models for classifying infections as clinical or environmental. Methods: We applied a bootstrap strategy for preselecting important features to a database containing 222 Legionella pneumophila strains with 448 continuous markers and a dichotomous outcome (clinical or environmental). Feature selection was done with 50 bootstrap samples resulting in a top 10 of most important features for each of four modeling techniques: classification and regression trees (CART), random forests (RF), support vector machines (SVM) and least absolute shrinkage and selection operator (LASSO). Validation was done in a second bootstrap resampling loop (200x) for evaluation of discriminatory model performance according to the AUC. Results: The top 5 of selected features differed considerably between the various modeling techniques, with only one common feature ("LePn.007B8"). The mean validated AUC-values of the SVM model and the CART model were 0.859 and 0.873 respectively. The LASSO and the RF model showed higher validated AUC-values (0.925 and 0.975 respectively). Conclusions: In the domain of Legionella pneumophila, which comprises many potential features for classifying of infections as clinical or environmental, the RF and LASSO techniques provide good prediction models. The identification of potentially biologically relevant features is highly dependent on the technique used, and should hence be interpreted with caution. © 2016 van de Ploeg and Steyerberg.
引用
收藏
相关论文
共 50 条
  • [1] A comparative performance study of feature selection methods for the anti-spam filtering domain
    Mendez, J. R.
    Fdez-Riverola, F.
    Diaz, F.
    Iglesias, E. L.
    Corchado, J. M.
    ADVANCES IN DATA MINING: APPLICATIONS IN MEDICINE, WEB MINING, MARKETING, IMAGE AND SIGNAL MINING, 2006, 4065 : 106 - 120
  • [2] Comparative Study of Pneumonia Caused by Streptococcus pneumonia and Legionella pneumophila
    Lyu, Jiwon
    Song, Jin Woo
    Choi, Chang-Min
    Oh, Yeon-Mok
    Do Lee, Sang
    Kim, Woo Sung
    Kim, Dong Soon
    Kim, Mi-Na
    Shim, Tae Sun
    TUBERCULOSIS AND RESPIRATORY DISEASES, 2010, 68 (02) : 74 - 79
  • [3] A Comparative Study of Feature Selection Techniques for Classify Student Performance
    Punlumjeak, Wattana
    Rachburee, Nachirat
    2015 7TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING (ICITEE), 2015, : 425 - 429
  • [4] Predictive Analysis of Students' Learning Performance Using Data Mining Techniques: A Comparative Study of Feature Selection Methods
    Mustapha, S. M. F. D. Syed
    APPLIED SYSTEM INNOVATION, 2023, 6 (05)
  • [5] Positive Selection in F-Box Domain (Ipp0233) Encoded in Legionella pneumophila Strains
    Kenzaka, Takehiko
    Yasui, Madoka
    Baba, Takashi
    Nasu, Masao
    Tani, Katsuji
    BIOCONTROL SCIENCE, 2018, 23 (02) : 53 - 59
  • [6] COMPARATIVE-STUDY OF LEGIONELLA-PNEUMOPHILA AND OTHER NOSOCOMIAL-ACQUIRED PNEUMONIAS
    ROIG, J
    AGUILAR, X
    RUIZ, J
    DOMINGO, C
    MESALLES, E
    MANTEROLA, J
    MORERA, J
    CHEST, 1991, 99 (02) : 344 - 350
  • [7] COMPARATIVE-STUDY OF SELECTIVE MEDIA FOR ISOLATION OF LEGIONELLA-PNEUMOPHILA FROM POTABLE WATER
    EDELSTEIN, PH
    JOURNAL OF CLINICAL MICROBIOLOGY, 1982, 16 (04) : 697 - 699
  • [8] COMPARATIVE STUDY OF FEATURE SELECTION METHODS TO ANALYZE PERFORMANCE OF LUNG CANCER DATA
    Koc, Emel
    Ozer, A. Nevra
    PROCEEDINGS OF THE EUROPEAN CONFERENCE ON DATA MINING 2015 AND INTERNATIONAL CONFERENCES ON INTELLIGENT SYSTEMS AND AGENTS 2015 AND THEORY AND PRACTICE IN MODERN COMPUTING 2015, 2015, : 219 - 222
  • [9] Development predictive QSAR models for artemisinin analogues by various feature selection methods: A comparative study
    Abbasitabar, F.
    Zare-Shahabadi, V.
    SAR AND QSAR IN ENVIRONMENTAL RESEARCH, 2012, 23 (1-2) : 1 - 15
  • [10] A Feature Selection Algorithm Performance Metric for Comparative Analysis
    Mostert, Werner
    Malan, Katherine M.
    Engelbrecht, Andries P.
    ALGORITHMS, 2021, 14 (03)