Comparison of resampling methods for dealing with imbalanced data in binary classification problem

被引：2

作者：

Park, Geun U. ^{[1
]}

Jun, Inkyun G. ^{[1
]}

机构：

[1] Yonsei Univ, Div Biostat, Dept Biomed Syst Informat, Coll Med, 50-1 Yonsei Ro, Seoul 03722, South Korea

来源：

KOREAN JOURNAL OF APPLIED STATISTICS | 2019年 / 32卷 / 03期

关键词：

imbalanced-learn; imbalanced binary data; under-sampling; over-sampling; NEIGHBOR; SMOTE;

D O I：

10.5351/KJAS.2019.32.3.349

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

引用

页码：349 / 374

页数：26

共 50 条

[21] Toward hierarchical classification of imbalanced data using random resampling algorithms
Pereira, Rodolfo M.
Costa, Yandre M. G.
Silla, Carlos N., Jr.
INFORMATION SCIENCES, 2021, 578 : 344 - 363
[22] Imbalanced educational data classification: an effective approach with resampling and random forest
Vo Thi Ngoc Chau
Nguyen Hua Phung
PROCEEDINGS OF 2013 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES: RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2013, : 135 - 140
[23] Review of imbalanced data classification methods
Li Y.-X.
Chai Y.
Hu Y.-Q.
Yin H.-P.
Kongzhi yu Juece/Control and Decision, 2019, 34 (04): : 673 - 688
[24] Automated semiconductor wafer defect classification dealing with imbalanced data
Lee, Po-Hsuan
Wang, Zhe
Teh, Cho
Hsiao, Yi-Sing
Fang, Wei
METROLOGY, INSPECTION, AND PROCESS CONTROL FOR MICROLITHOGRAPHY XXXIV, 2020, 11325
[25] A method for resampling imbalanced datasets in binary classification tasks for real-world problems
Cateni, Silvia
Colla, Valentina
Vannucci, Marco
NEUROCOMPUTING, 2014, 135 : 32 - 41
[26] A method for resampling imbalanced datasets in binary classification tasks for real-world problems
Cateni, Silvia
Colla, Valentina
Vannucci, Marco
Neurocomputing, 2014, 135 : 32 - 41
[27] A method for resampling imbalanced datasets in binary classification tasks for real-world problems
Cateni, Silvia, 1600, Elsevier B.V., Netherlands (135):
[28] Predicting defects in imbalanced data using resampling methods: an empirical investigation
Malhotra, Ruchika
Jain, Juhi
PEERJ COMPUTER SCIENCE, 2022, 8
[29] CBReT: A Cluster-Based Resampling Technique for dealing with imbalanced data in code smell prediction
Thakur, Praveen Singh
Jadeja, Mahipal
Chouhan, Satyendra Singh
KNOWLEDGE-BASED SYSTEMS, 2024, 286
[30] Stacked generalizations in imbalanced fraud data sets using resampling methods
Kerwin, Kathleen R.
Bastian, Nathaniel D.
JOURNAL OF DEFENSE MODELING AND SIMULATION-APPLICATIONS METHODOLOGY TECHNOLOGY-JDMS, 2021, 18 (03): : 175 - 192

← 1 2 3 4 5 →