An empirical evaluation of sampling methods for the classification of imbalanced data

被引:31
|
作者
Kim, Misuk [1 ]
Hwang, Kyu-Baek [1 ]
机构
[1] Soongsil Univ, Grad Sch, Dept Comp Sci & Engn, Seoul, South Korea
来源
PLOS ONE | 2022年 / 17卷 / 07期
关键词
DATA SETS; PERFORMANCE; CLASSIFIERS; SMOTE;
D O I
10.1371/journal.pone.0271260
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Empirical Study of Sampling Methods for Classification in Imbalanced Clinical Datasets
    Kasem, Asem
    Ghaibeh, A. Ammar
    Moriguchi, Hiroki
    COMPUTATIONAL INTELLIGENCE IN INFORMATION SYSTEMS, CIIS 2016, 2017, 532 : 152 - 162
  • [2] Aided Selection of Sampling Methods for Imbalanced Data Classification
    Sahni, Deep
    Pappu, Satya Jayadev
    Bhatt, Nirav
    CODS-COMAD 2021: PROCEEDINGS OF THE 3RD ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE & MANAGEMENT OF DATA (8TH ACM IKDD CODS & 26TH COMAD), 2021, : 198 - 202
  • [3] How Far Have We Progressed in the Sampling Methods for Imbalanced Data Classification? An Empirical Study
    Sun, Zhongbin
    Zhang, Jingqi
    Zhu, Xiaoyan
    Xu, Donghong
    ELECTRONICS, 2023, 12 (20)
  • [4] Evaluation of Sampling Methods for Learning from Imbalanced Data
    Goel, Garima
    Maguire, Liam
    Li, Yuhua
    McLoone, Sean
    INTELLIGENT COMPUTING THEORIES, 2013, 7995 : 392 - 401
  • [5] Comparison of Sampling Methods for Imbalanced Data Classification in Random Forest
    Paing, May Phu
    Pintavirooj, C.
    Tungjitkusolmun, Supan
    Choomchuay, Somsak
    Hamamoto, Kazuhiko
    2018 11TH BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BMEICON 2018), 2018,
  • [6] An Evolutionary Sampling Approach for Classification with Imbalanced Data
    Fernandes, Everlandio R. Q.
    de Carvalho, Andre C. P. L. F.
    Coelho, Andre L. V.
    2015 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2015,
  • [7] Review of imbalanced data classification methods
    Li Y.-X.
    Chai Y.
    Hu Y.-Q.
    Yin H.-P.
    Kongzhi yu Juece/Control and Decision, 2019, 34 (04): : 673 - 688
  • [8] Empirical evaluation of data normalization methods for molecular classification
    Huang, Huei-Chung
    Qin, Li-Xuan
    PEERJ, 2018, 6
  • [9] A Hybrid Sampling SVM Approach to Imbalanced Data Classification
    Wang, Qiang
    ABSTRACT AND APPLIED ANALYSIS, 2014,
  • [10] Exploring Data Sampling Techniques for Imbalanced Classification Problems
    Sui, Yu
    Zhang, Xiaohui
    Huan, Jiajia
    Hong, Haifeng
    FOURTH INTERNATIONAL WORKSHOP ON PATTERN RECOGNITION, 2019, 11198