Experimental evaluation of ensemble classifiers for imbalance in Big Data

被引:18
|
作者
Juez-Gil M. [1 ]
Arnaiz-González Á. [1 ]
Rodríguez J.J. [1 ]
García-Osorio C. [1 ]
机构
[1] Escuela Politécnica Superior, University of Burgos, Burgos
关键词
Big Data; Ensemble; Imbalance; Resampling; Spark; Unbalance;
D O I
10.1016/j.asoc.2021.107447
中图分类号
学科分类号
摘要
Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data. © 2021 The Author(s)
引用
收藏
相关论文
共 50 条
  • [21] Ensemble of classifiers approach for NDT data fusion
    Parikh, D
    Kim, MT
    Oagaro, J
    Mandayam, S
    Polikar, R
    2004 IEEE ULTRASONICS SYMPOSIUM, VOLS 1-3, 2004, : 1062 - 1065
  • [22] Feature subspace ensemble classifiers for microarray data
    Yu, Hualong
    Gu, Guochang
    Liu, Haibo
    Shen, Jing
    ICIC Express Letters, 2010, 4 (01): : 143 - 147
  • [23] An ensemble of filters and classifiers for microarray data classification
    Bolon-Canedo, V.
    Sanchez-Marono, N.
    Alonso-Betanzos, A.
    PATTERN RECOGNITION, 2012, 45 (01) : 531 - 539
  • [24] A novel ensemble of classifiers for microarray data classification
    Chen, Yuehui
    Zhao, Yaou
    APPLIED SOFT COMPUTING, 2008, 8 (04) : 1664 - 1669
  • [25] Incremental learning of ensemble classifiers on ECG data
    Macek, J
    18TH IEEE SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, PROCEEDINGS, 2005, : 315 - 320
  • [26] Bagging and Boosting Ensemble Classifiers for Classification of Multispectral, Hyperspectral and PolSAR Data: A Comparative Evaluation
    Jafarzadeh, Hamid
    Mahdianpari, Masoud
    Gill, Eric
    Mohammadimanesh, Fariba
    Homayouni, Saeid
    REMOTE SENSING, 2021, 13 (21)
  • [27] Performance Evaluation of Machine Learning Classifiers for Stock Market Prediction in Big Data Environment
    Kalra, Sneh
    Gupta, Sachin
    Prasad, Jay Shankar
    JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2019, 14 (05): : 295 - 306
  • [28] An Experimental Evaluation of Garbage Collectors on Big Data Applications
    Xu, Lijie
    Guo, Tian
    Dou, Wensheng
    Wang, Wei
    Wei, Jun
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (05): : 570 - 583
  • [29] Experimental Evaluation of Sketching Techniques for Big Spatial Data
    Siddique, A. B.
    Eldawy, Ahmed
    PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 522 - 522
  • [30] Big data processing tools: An experimental performance evaluation
    Rodrigues, Mario
    Santos, Maribel Yasmina
    Bernardino, Jorge
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 9 (02)