Classification Performance of Three Approaches for Combining Data Sampling and Gene Selection on Bioinformatics Data

被引：0

作者：

Khoshgoftaar, Taghi M. ^{[1
]}

Fazelpour, Alireza ^{[1
]}

Dittman, David J. ^{[1
]}

Napolitano, Amri ^{[1
]}

机构：

[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA

来源：

2014 IEEE 15TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI) | 2014年

关键词：

Data sampling techniques; data sampling order; class imbalance; feature selection; CHEMOTHERAPY; PREDICTOR;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). Data sampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: data sampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features; data sampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features; and feature selection followed by data sampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other hand, with the 35: 65 class ratio, DS-FS-Sam was most frequently the top-performing approach, and although it was not statistically significantly better than the other approaches, we would generally recommend this approach be used for the 35: 65 class ratio (although specific choices of learner and ranker may vary). Overall, we can see that the optimal approach will depend on the choice of class ratio.

引用

页码：315 / 321

页数：7

共 50 条

[41] Combining multiple approaches for gene microarray classification
Nanni, Loris
Brahnam, Sheryl
Lumini, Alessandra
BIOINFORMATICS, 2012, 28 (08) : 1151 - 1157
[42] Analysis of recursive gene selection approaches from microarray data
Li, F
Yang, YM
BIOINFORMATICS, 2005, 21 (19) : 3741 - 3747
[43] Database approaches and data representation in structural bioinformatics
Gopal, Kreshna
Sacchettini, James C.
Ioerger, Thomas R.
PROCEEDINGS OF THE 7TH IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, VOLS I AND II, 2007, : 425 - +
[44] Gene Selection for Cancer Classification from Microarray Data Using Data Overlap Measure
Sarbazi-Azad, Saeed
Abadeh, Mohammad Saniee
2018 25TH IRANIAN CONFERENCE ON BIOMEDICAL ENGINEERING AND 2018 3RD INTERNATIONAL IRANIAN CONFERENCE ON BIOMEDICAL ENGINEERING (ICBME), 2018, : 257 - 262
[45] APPROACHES TO SAMPLES SELECTION FOR MACHINE LEARNING BASED CLASSIFICATION OF TEXTUAL DATA
Darena, Frantisek
Zizka, Jan
COMPUTING AND INFORMATICS, 2013, 32 (05) : 949 - 967
[46] Approaches to samples selection for machine learning based classification of textual data
Department of Informatics, Faculty of Business and Economics, Mendel University in Brno, Zemedelská 1, 61300 Brno, Czech Republic
Comput. Inf., 2013, 5 (949-967):
[47] Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance
Potharaju, Sai Prasad
Sreedevi, M.
CLINICAL EPIDEMIOLOGY AND GLOBAL HEALTH, 2019, 7 (02): : 171 - 176
[48] Combining qualitative and quantitative trait data in classification of gene bank accessions
Gunjaca, J
Satovic, Z
Kolak, I
ITI 2000: PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES, 2000, : 311 - 315
[49] How Ranker and Learner Choice Affects Classification Performance on Noisy Bioinformatics Data
Abu Shanab, Ahmad
Khoshgoftaar, Taghi M.
Wald, Randall
Napolitano, Amri
2014 IEEE 15TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2014, : 277 - 282
[50] Gene selection for classification of microarray data based on the Bayes error
Ji-Gang Zhang
Hong-Wen Deng
BMC Bioinformatics, 8

← 1 2 3 4 5 →