Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?

被引:0
|
作者
Fazelpour, Alireza [1 ]
Khoshgoftaar, Taghi M. [1 ]
Dittman, David J. [1 ]
Napolitano, Amri [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
关键词
Boosting; data sampling; ensemble learning; class imbalance; bioinformatics; CHEMOTHERAPY; PREDICTOR;
D O I
10.1109/ICMLA.2015.23
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Bioinformatics datasets contain many challenging characteristics, such as class imbalance, which adversely impacts the performance of supervised classification models built on these datasets. Techniques such as ensemble learning and data sampling from the domain of data mining can be deployed to alleviate the problem and to improve the classification performance. In this study, we sought to seek whether inclusion of data sampling within the ensemble framework can further improve the performance of classification models. To this end, we performed an experimental study using two newly hybrid ensemble techniques, one integrates feature selection within the boosting process and the other incorporates random under-sampling followed by feature selection within the boosting framework, two learners, three forms of feature rankers, and four feature subset sizes on 15 highly imbalanced bioinformatics datasets. Our results and statistical analysis demonstrate that the difference between the two boosting methods is statistically insignificant. Therefore, as the inclusion of data sampling has no significant positive effect on the performance of ensemble classifiers, it is not required to achieve maximum classification performance. To our knowledge, this is the first empirical study that examined the effects of data sampling, random under-sampling, to enhance classification performance of boosting algorithm for highly imbalanced bioinformatics data.
引用
收藏
页码:527 / 534
页数:8
相关论文
共 50 条
  • [41] An Empirical Study of Boosting Methods on Severely Imbalanced Data
    Liu, Xu-Ying
    APPLIED SCIENCE, MATERIALS SCIENCE AND INFORMATION TECHNOLOGIES IN INDUSTRY, 2014, 513-517 : 2510 - 2513
  • [42] Boosting Support Vector Machines for Imbalanced Microarray Data
    Pratama, Risky Frasetio Wahyu
    Purnami, Santi Wulan
    Rahayu, Santi Puteri
    INNS CONFERENCE ON BIG DATA AND DEEP LEARNING, 2018, 144 : 174 - 183
  • [43] Cost-sensitive boosting for classification of imbalanced data
    Sun, Yamnin
    Kamel, Mohamed S.
    Wong, Andrew K. C.
    Wang, Yang
    PATTERN RECOGNITION, 2007, 40 (12) : 3358 - 3378
  • [44] An evaluation of progressive sampling for imbalanced data sets
    Ng, Willie
    Dash, Manoranjan
    ICDM 2006: SIXTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, WORKSHOPS, 2006, : 657 - +
  • [45] A Constructive Method for Data Reduction and Imbalanced Sampling
    Liu, Fei
    Yan, Yuanting
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2023, PT III, 2024, 14489 : 476 - 489
  • [46] A New Combination Sampling Method for Imbalanced Data
    Li, Hu
    Zou, Peng
    Wang, Xiang
    Xia, Rongze
    PROCEEDINGS OF 2013 CHINESE INTELLIGENT AUTOMATION CONFERENCE: INTELLIGENT INFORMATION PROCESSING, 2013, 256 : 547 - 554
  • [47] Hyperspectral data preprocessing to improve performance of classification algorithms
    Subramanian, S
    Gat, N
    Barhen, J
    IMAGING SPECTROMETRY III, 1997, 3118 : 232 - 240
  • [48] An Evolutionary Sampling Approach for Classification with Imbalanced Data
    Fernandes, Everlandio R. Q.
    de Carvalho, Andre C. P. L. F.
    Coelho, Andre L. V.
    2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2015,
  • [49] A Cluster Switching Method for Sampling Imbalanced Data
    Prachuabsupakij, Wanthanee
    Simcharoen, Supaporn
    ISMSI 2018: PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS, METAHEURISTICS & SWARM INTELLIGENCE, 2018, : 12 - 16
  • [50] Using evolutionary sampling to mine imbalanced data
    Drown, Dennis J.
    Khoshgoftaar, Taghi M.
    Narayanan, Rarnaswarny
    ICMLA 2007: SIXTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2007, : 363 - 368