Classification Performance of Three Approaches for Combining Data Sampling and Gene Selection on Bioinformatics Data

被引:0
|
作者
Khoshgoftaar, Taghi M. [1 ]
Fazelpour, Alireza [1 ]
Dittman, David J. [1 ]
Napolitano, Amri [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
关键词
Data sampling techniques; data sampling order; class imbalance; feature selection; CHEMOTHERAPY; PREDICTOR;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). Data sampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: data sampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features; data sampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features; and feature selection followed by data sampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other hand, with the 35: 65 class ratio, DS-FS-Sam was most frequently the top-performing approach, and although it was not statistically significantly better than the other approaches, we would generally recommend this approach be used for the 35: 65 class ratio (although specific choices of learner and ranker may vary). Overall, we can see that the optimal approach will depend on the choice of class ratio.
引用
收藏
页码:315 / 321
页数:7
相关论文
共 50 条
  • [1] Select-Bagging: Effectively Combining Gene Selection and Bagging for Balanced Bioinformatics Data
    Dittman, David J.
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    Fazelpour, Alireza
    2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2014, : 413 - 419
  • [2] Is Gene Selection Enough for Imbalanced Bioinformatics Data?
    Abu Shanab, Ahmad
    Khoshgoftaar, Taghi M.
    2018 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2018, : 346 - 355
  • [3] A Comparison of PSO and GA Approaches for Gene Selection and Classification of Microarray Data
    Garcia-Nieto, Jose
    Alba, Enrique
    Jourdan, Laetitia
    Talbi, El-Ghazali
    GECCO 2007: GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, VOL 1 AND 2, 2007, : 427 - 427
  • [4] GENE EXPRESSION DATA CLASSIFICATION COMBINING HIERARCHICAL REPRESENTATION AND EFFICIENT FEATURE SELECTION
    Bosio, Mattia
    Bellot, Pau
    Salembier, Philippe
    Oliveras-Verges, Albert
    JOURNAL OF BIOLOGICAL SYSTEMS, 2012, 20 (04) : 349 - 375
  • [5] Ensemble vs. Data Sampling: Which Option Is Best Suited to Improve Classification Performance of Imbalanced Bioinformatics Data?
    Khoshgoftaar, Taghi M.
    Fazelpour, Alireza
    Dittman, David J.
    Napolitano, Amri
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 705 - 712
  • [6] Utilizing Ensemble, Data Sampling and Feature Selection Techniques for Improving Classification Performance on Tweet Sentiment Data
    Prusa, Joseph
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 535 - 542
  • [7] Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics
    Iqbal, Muhammad Javed
    Faye, Ibrahima
    Samir, Brahim Belhaouari
    Said, Abas Md
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [8] Aided Selection of Sampling Methods for Imbalanced Data Classification
    Sahni, Deep
    Pappu, Satya Jayadev
    Bhatt, Nirav
    CODS-COMAD 2021: PROCEEDINGS OF THE 3RD ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE & MANAGEMENT OF DATA (8TH ACM IKDD CODS & 26TH COMAD), 2021, : 198 - 202
  • [9] Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?
    Fazelpour, Alireza
    Khoshgoftaar, Taghi M.
    Dittman, David J.
    Napolitano, Amri
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 527 - 534
  • [10] Combining Dissimilarities for Three-Way Data Classification
    Munoz, Diana Porro
    Talavera, Isneri
    Duin, Robert P. W.
    Alzate, Mauricio Orozco
    COMPUTACION Y SISTEMAS, 2011, 15 (01): : 117 - 127