Experimental evaluation of ensemble classifiers for imbalance in Big Data

被引:18
|
作者
Juez-Gil M. [1 ]
Arnaiz-González Á. [1 ]
Rodríguez J.J. [1 ]
García-Osorio C. [1 ]
机构
[1] Escuela Politécnica Superior, University of Burgos, Burgos
关键词
Big Data; Ensemble; Imbalance; Resampling; Spark; Unbalance;
D O I
10.1016/j.asoc.2021.107447
中图分类号
学科分类号
摘要
Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data. © 2021 The Author(s)
引用
收藏
相关论文
共 50 条
  • [1] Experimental Study and Comparison of Imbalance Ensemble Classifiers with Dynamic Selection Strategy
    Zhao, Dongxue
    Wang, Xin
    Mu, Yashuang
    Wang, Lidong
    ENTROPY, 2021, 23 (07)
  • [2] An experimental comparison of ensemble of classifiers for biometric data
    Nanni, Loris
    Lumini, Alessandra
    NEUROCOMPUTING, 2006, 69 (13-15) : 1670 - 1673
  • [3] Concept Drift Detection and Adaption in Big Imbalance Industrial IoT Data Using an Ensemble Learning Method of Offline Classifiers
    Lin, Chun-Cheng
    Deng, Der-Jiunn
    Kuo, Chin-Hung
    Chen, Linnan
    IEEE ACCESS, 2019, 7 : 56198 - 56207
  • [4] Large Iterative Multitier Ensemble Classifiers for Security of Big Data
    Abawajy, Jemal H.
    Kelarev, Andrei
    Chowdhury, Morshed
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014, 2 (03) : 352 - 363
  • [5] Ensemble classifiers for biomedical data: performance evaluation
    Elshazly, Hanaa Ismail
    Elkorany, Abeer Mohamed
    Hassanien, Aboul Ella
    Azar, Ahmad Taher
    2013 8TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS (ICCES), 2013, : 184 - 189
  • [6] An Experimental Comparison of Ensemble Classifiers for Evolving Data Streams
    Tambuwal, Ahmad Idris
    Neagu, Daniel
    Gheorghe, Marian
    ARTIFICIAL INTELLIGENCE XXXIV, AI 2017, 2017, 10630 : 156 - 162
  • [7] Hybrid Consensus Pruning of Ensemble Classifiers for Big Data Malware Detection
    Abawajy, Jemal H.
    Chowdhury, Morshed
    Kelarev, Andrei
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2020, 8 (02) : 398 - 407
  • [8] A similarity evaluation technique for data mining with an ensemble of classifiers
    Puuronen, S
    Terziyan, V
    11TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATION, PROCEEDINGS, 2000, : 1155 - 1159
  • [9] Efficient Online Evaluation of Big Data Stream Classifiers
    Bifet, Albert
    Morales, Gianmarco De Francisci
    Read, Jesse
    Holmes, Geoff
    Pfahringer, Bernhard
    KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 59 - 68
  • [10] Data imbalance in classification: Experimental evaluation
    Thabtah, Fadi
    Hammoud, Suhel
    Kamalov, Firuz
    Gonsalves, Amanda
    INFORMATION SCIENCES, 2020, 513 : 429 - 441