Classification Performance of Three Approaches for Combining Data Sampling and Gene Selection on Bioinformatics Data

被引:0
|
作者
Khoshgoftaar, Taghi M. [1 ]
Fazelpour, Alireza [1 ]
Dittman, David J. [1 ]
Napolitano, Amri [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
关键词
Data sampling techniques; data sampling order; class imbalance; feature selection; CHEMOTHERAPY; PREDICTOR;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). Data sampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: data sampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features; data sampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features; and feature selection followed by data sampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other hand, with the 35: 65 class ratio, DS-FS-Sam was most frequently the top-performing approach, and although it was not statistically significantly better than the other approaches, we would generally recommend this approach be used for the 35: 65 class ratio (although specific choices of learner and ranker may vary). Overall, we can see that the optimal approach will depend on the choice of class ratio.
引用
收藏
页码:315 / 321
页数:7
相关论文
共 50 条
  • [21] Multi-Label Bioinformatics Data Classification With Ensemble Embedded Feature Selection
    Guo, Yumeng
    Chung, Fu-Lai
    Li, Guozheng
    Zhang, Lei
    IEEE ACCESS, 2019, 7 : 103863 - 103875
  • [22] Comparing Approaches for Combining Data Sampling and Stacked Autoencoder to address Bankruptcy Prediction
    Smiti, Salima
    Soui, Makram
    Ejbali, Ridha
    Ghedira, Khaled
    VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 5195 - 5205
  • [23] Random forest for gene selection and microarray data classification
    Moorthy, Kohbalan
    Mohamad, Mohd Saberi
    BIOINFORMATION, 2011, 7 (03) : 142 - 146
  • [24] Feature Selection and Classification in gene expression cancer data
    Pavithra, D.
    Lakshmanan, B.
    2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN DATA SCIENCE (ICCIDS), 2017,
  • [25] Biomarker selection and classification of carcinogenic gene expression data
    Keum, Changwon
    Lee, Sung Kwang
    Go, Seo youn
    Kwon, Kyoung jin
    Chung, Youn jee
    Cho, Il young
    Sheen, Yhunyhong
    No, Kyoung Tai
    MOLECULAR & CELLULAR TOXICOLOGY, 2007, 3 (04) : 67 - 67
  • [26] Ensemble gene selection by grouping for microarray data classification
    Liu, Huawen
    Liu, Lei
    Zhang, Huijie
    JOURNAL OF BIOMEDICAL INFORMATICS, 2010, 43 (01) : 81 - 87
  • [27] Advances in metaheuristics for gene selection and classification of microarray data
    Duval, Beatrice
    Hao, Jin-Kao
    BRIEFINGS IN BIOINFORMATICS, 2010, 11 (01) : 127 - 141
  • [28] Random Forest for Gene Selection and Microarray Data Classification
    Moorthy, Kohbalan
    Mohamad, Mohd Saberi
    KNOWLEDGE TECHNOLOGY, 2012, 295 : 174 - 183
  • [29] Improving the performance of principal components for classification of gene expression data through feature selection
    Acuna, Edgar
    Porras, Jaime
    DATA SCIENCE AND CLASSIFICATION, 2006, : 325 - +
  • [30] The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification
    Bennin, Kwabena Ebo
    Keung, Jacky
    Monden, Akito
    Phannachitta, Passakorn
    Mensah, Solomon
    11TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT (ESEM 2017), 2017, : 364 - 373