Comparison of resampling methods for dealing with imbalanced data in binary classification problem

被引:2
|
作者
Park, Geun U. [1 ]
Jun, Inkyun G. [1 ]
机构
[1] Yonsei Univ, Div Biostat, Dept Biomed Syst Informat, Coll Med, 50-1 Yonsei Ro, Seoul 03722, South Korea
关键词
imbalanced-learn; imbalanced binary data; under-sampling; over-sampling; NEIGHBOR; SMOTE;
D O I
10.5351/KJAS.2019.32.3.349
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.
引用
收藏
页码:349 / 374
页数:26
相关论文
共 50 条
  • [31] Binary classification for imbalanced data using data conformity mechanism
    Zheng, Jian
    Ren, Shumiao
    Zhang, Jingyue
    Wang, Shiyan
    Li, Lin
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [32] Applying Resampling Methods for Imbalanced Datasets to Not So Imbalanced Datasets
    Arbelaitz, Olatz
    Gurrutxaga, Ibai
    Muguerza, Javier
    Maria Perez, Jesus
    ADVANCES IN ARTIFICIAL INTELLIGENCE, CAEPIA 2013, 2013, 8109 : 111 - 120
  • [33] A Decoupling and Bidirectional Resampling Method for Multilabel Classification of Imbalanced Data with Label Concurrence
    Zhou, Shuyue
    Li, Xiaobo
    Dong, Yihong
    Xu, Hao
    SCIENTIFIC PROGRAMMING, 2020, 2020
  • [34] Imbalanced Data Classification Based on Hybrid Resampling and Twin Support Vector Machine
    Cao, Lu
    Shen, Hong
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2017, 14 (03) : 579 - 595
  • [35] On the class overlap problem in imbalanced data classification
    Vuttipittayamongkol, Pattaramon
    Elyan, Eyad
    Petrovski, Andrei
    KNOWLEDGE-BASED SYSTEMS, 2021, 212 (212)
  • [36] A Decoupling and Bidirectional Resampling Method for Multilabel Classification of Imbalanced Data with Label Concurrence
    Zhou, Shuyue
    Li, Xiaobo
    Dong, Yihong
    Xu, Hao
    Scientific Programming, 2020, 2020
  • [37] Imbalanced Data Classification Based on Hybrid Methods
    Zhang, Nai-Nan
    Ye, Shao-Zhen
    Chien, Ting-Ying
    PROCEEDINGS OF THE 2018 2ND INTERNATIONAL CONFERENCE ON BIG DATA RESEARCH (ICBDR 2018), 2018, : 16 - 20
  • [38] A review of boosting methods for imbalanced data classification
    Li, Qiujie
    Mao, Yaobin
    PATTERN ANALYSIS AND APPLICATIONS, 2014, 17 (04) : 679 - 693
  • [39] A review of boosting methods for imbalanced data classification
    Qiujie Li
    Yaobin Mao
    Pattern Analysis and Applications, 2014, 17 : 679 - 693
  • [40] Meta-learning for imbalanced data and classification ensemble in binary classification
    Lin, Sung-Chiang
    Chang, Yuan-chin I.
    Yang, Wei-Ning
    NEUROCOMPUTING, 2009, 73 (1-3) : 484 - 494