Data imbalance in classification: Experimental evaluation

被引:400
|
作者
Thabtah, Fadi [1 ]
Hammoud, Suhel [2 ]
Kamalov, Firuz [3 ]
Gonsalves, Amanda [1 ]
机构
[1] Manukau Inst Technol, Corner Manukau Stn Rd,Davies Ave, Auckland 2104, New Zealand
[2] Univ Kalamoon, Deir Atiyah An Nabek Dist Rif Dimashq Governorate, Deir Atiyah, Syria
[3] Canadian Univ Dubai, Sheikh Zayed Rd, Dubai, U Arab Emirates
关键词
Classification; Class imbalance; Data analysis; Machine learning; Statistical analysis; Supervised learning; FEATURES;
D O I
10.1016/j.ins.2019.11.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The advent of Big Data has ushered a new era of scientific breakthroughs. One of the common issues that affects raw data is class imbalance problem which refers to imbalanced distribution of values of the response variable. This issue is present in fraud detection, network intrusion detection, medical diagnostics, and a number of other fields where negatively labeled instances significantly outnumber positively labeled instances. Modern machine learning techniques struggle to deal with imbalanced data by focusing on minimizing the error rate for the majority class while ignoring the minority class. The goal of our paper is demonstrate the effects of class imbalance on classification models. Concretely, we study the impact of varying class imbalance ratios on classifier accuracy. By highlighting the precise nature of the relationship between the degree of class imbalance and the corresponding effects on classifier performance we hope to help researchers to better tackle the problem. To this end, we carry out extensive experiments using 10-fold cross validation on a large number of datasets. In particular, we determine that the relationship between the class imbalance ratio and the accuracy is convex. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:429 / 441
页数:13
相关论文
共 50 条
  • [1] Experimental evaluation of ensemble classifiers for imbalance in Big Data
    Juez-Gil M.
    Arnaiz-González Á.
    Rodríguez J.J.
    García-Osorio C.
    Applied Soft Computing, 2021, 108
  • [2] Data Imbalance in Autism Pre-Diagnosis Classification Systems: An Experimental Study
    Abdelhamid, Neda
    Padmavathy, Arun
    Peebles, David
    Thabtah, Fadi
    Goulder-Horobin, Daymond
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2020, 19 (01)
  • [3] Class imbalance in gradient boosting classification algorithms: Application to experimental stroke data
    Lyashevska, Olga
    Malone, Fiona
    MacCarthy, Eugene
    Fiehler, Jens
    Buhk, Jan-Hendrik
    Morris, Liam
    STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (03) : 916 - 925
  • [4] Dealing with Data Imbalance in Text Classification
    Padurariu, Cristian
    Breaban, Mihaela Elena
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KES 2019), 2019, 159 : 736 - 745
  • [5] Development of Evaluation Metrics that Consider Data Imbalance between Classes in Facies Classification
    Kim, Dowan
    Choi, Junhwan
    Byun, Joongmoo
    GEOPHYSICS AND GEOPHYSICAL EXPLORATION, 2020, 23 (03): : 131 - 140
  • [6] Cost-sensitive Strategies for Data Imbalance in Bug Severity Classification: Experimental Results
    Roy, Nivir Kanti Singha
    Rossi, Bruno
    2017 43RD EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS (SEAA), 2017, : 426 - 429
  • [7] Improving SVM Classification with Imbalance Data Set
    Zeng, Zhi-Qiang
    Gao, Ji
    NEURAL INFORMATION PROCESSING, PT 1, PROCEEDINGS, 2009, 5863 : 389 - +
  • [8] Transient Detection Modeling as Imbalance Data Classification
    Tabacolde, Aireen B.
    Boongoen, Tossapon
    Iam-On, Natthakan
    Mullaney, James
    Sawangwit, Utane
    Ulaczyk, Krzysztof
    PROCEEDINGS OF THE 2018 1ST IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE INNOVATION AND INVENTION (ICKII 2018), 2018, : 180 - 183
  • [9] RHSBoost: Improving classification performance in imbalance data
    Gong, Joonho
    Kim, Hyunjoong
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2017, 111 : 1 - 13
  • [10] Iterative Metric Learning for Imbalance Data Classification
    Wang, Nan
    Zhao, Xibin
    Jiang, Yu
    Gao, Yue
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 2805 - 2811