An efficient method to determine sample size in oversampling based on classification complexity for imbalanced data

被引：18

作者：

Lee, Dohyun ^{[1
]}

Kim, Kyoungok ^{[2
]}

机构：

[1] Seoul Natl Univ Sci & Technol Seoul, Dept Data Sci, 232 Gongreungno, Seoul 01811, South Korea

[2] Seoul Natl Univ Sci & Technol Seoul, Dept Ind Engn, 232 Gongreungno, Seoul 01811, South Korea

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2021年 / 184卷 / 184期

基金：

新加坡国家研究基金会;

关键词：

Class imbalance; Oversampling; Sampling size; Adaptive boosting; Ensemble learning; DATA-SETS; SMOTE; ENSEMBLES;

D O I：

10.1016/j.eswa.2021.115442

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Resampling, one of the approaches to handle class imbalance, is widely used alone or in combination with other approaches, such as cost-sensitive learning and ensemble learning because of its simplicity and independence in learning algorithms. Oversampling methods, in particular, alleviate class imbalance by increasing the size of the minority class. However, previous studies related to oversampling generally have focused on where to add new samples, how to generate new samples, and how to prevent noise and they rarely have investigated how much sampling is sufficient. In many cases, the oversampling size is set so that the minority class has the same size as the majority class. This setting only considers the size of the classes in sample size determination, and the balanced training set can induce overfitting with the addition of too many minority samples. Moreover, the effectiveness of oversampling can be improved by adding synthetics into the appropriate locations. To address this issue, this study proposes a method to determine the oversampling size less than the sample size needed to obtain a balance between classes, while considering not only the absolute imbalance but also the difficulty of classification in a dataset on the basis of classification complexity. The effectiveness of the proposed sample size in oversampling is evaluated using several boosting algorithms with different oversampling methods for 16 imbalanced datasets. The results show that the proposed sample size achieves better classification performance than the sample size for attaining class balance.

引用

页数：10

共 50 条

[41] A New Segmented Oversampling Method for Imbalanced Data Classification Using Quasi-Linear SVM
Zhou, Bo
Li, Weite
Hu, Jinglu
IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2017, 12 (06) : 891 - 898
[42] Synthetic protein sequence oversampling method for classification and remote homology detection in imbalanced protein data
Beigi, Majid M.
Zell, Andreas
BIOINFORMATICS RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2007, 4414 : 263 - +
[43] A Sampling Method of Imbalanced Data Based on Sample Space
Zhang Y.-Q.
Lu R.-Z.
Qiao S.-J.
Han N.
Gutierrez L.A.
Zhou J.-L.
Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (10): : 2549 - 2563
[44] Oversampling framework based on sample subspace optimization with accelerated binary particle swarm optimization for imbalanced classification
Li, Junnan
APPLIED SOFT COMPUTING, 2024, 162
[45] Data complexity and classification accuracy correlation in oversampling algorithms
Komorniczak, Joanna
Ksieniewicz, Pawel
Wozniak, Michal
FOURTH INTERNATIONAL WORKSHOP ON LEARNING WITH IMBALANCED DOMAINS: THEORY AND APPLICATIONS, VOL 183, 2022, 183 : 175 - 186
[46] A novel oversampling and feature selection hybrid algorithm for imbalanced data classification
Feng, Fang
Li, Kuan-Ching
Yang, Erfu
Zhou, Qingguo
Han, Lihong
Hussain, Amir
Cai, Mingjiang
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (03) : 3231 - 3267
[47] Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines
Mathew, Josey
Pang, Chee Khiang
Luo, Ming
Leong, Weng Hoe
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (09) : 4065 - 4076
[48] Evidence-based adaptive oversampling algorithm for imbalanced classification
Lin, Chen-ju
Leony, Florence
KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) : 2209 - 2233
[49] Perturbation-based oversampling technique for imbalanced classification problems
Zhang, Jianjun
Wang, Ting
Ng, Wing W. Y.
Pedrycz, Witold
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (03) : 773 - 787
[50] Evidence-based adaptive oversampling algorithm for imbalanced classification
Chen-ju Lin
Florence Leony
Knowledge and Information Systems, 2024, 66 : 2209 - 2233

← 1 2 3 4 5 →