Hybrid sampling for imbalanced data

被引:51
|
作者
Seiffert, Chris [1 ]
Khoshgoftaar, Taghi M. [1 ]
Van Hulse, Jason [1 ]
机构
[1] Florida Atlantic Univ, Dept Comp Sci & Engn, Data Min & Machine Learning Lab, Boca Raton, FL 33431 USA
关键词
Class imbalance; classification; sampling; binary classification; hybrid sampling; SMOTE;
D O I
10.3233/ICA-2009-0314
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Building a classification model on imbalanced datasets can be a challenging endeavor. Models built on data where examples of one class are greatly outnumbered by examples of the other class(es) tend to sacrifice accuracy with respect to the underrepresented class in favor of maximizing the overall classification rate. Several methods have been suggested to alleviate the problem of class imbalance. One common technique that has received much attention in recent research is data sampling. Data sampling either adds examples to the minority class (oversampling) or removes examples from the majority class (undersampling) in order to create a more balanced data set. Both oversampling and undersampling have their strengths and drawbacks. In this work we propose a hybrid sampling procedure that uses a combination of two sampling techniques to create a balanced data set. By using more than one sampling technique, we can combine the strengths of the individual techniques while lessening the drawbacks. We perform a comprehensive set of experiments, with more than one million classifiers built, showing that our hybrid sampling procedure almost always outperforms the individual sampling techniques.
引用
收藏
页码:193 / 210
页数:18
相关论文
共 50 条
  • [21] Robust hybrid data-level sampling approach to handle imbalanced data during classification
    Kaur, Prabhjot
    Gosain, Anjana
    SOFT COMPUTING, 2020, 24 (20) : 15715 - 15732
  • [22] Neighbourhood sampling in bagging for imbalanced data
    Blaszczynski, Jerzy
    Stefanowski, Jerzy
    NEUROCOMPUTING, 2015, 150 : 529 - 542
  • [23] A Hybrid Active Sampling Algorithm for Imbalanced Learning
    Gu, Ping
    Lu, Yong
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 600 - 605
  • [24] A Hybrid Re-sampling Method for SVM Learning from Imbalanced Data Sets
    Li, Peng
    Qiao, Pei-Li
    Liu, Yuan-Chao
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 65 - 69
  • [25] GDHS: An efficient hybrid sampling method for multi-class imbalanced data classification
    Yan, Yuanting
    Lv, Yan
    Han, Shuangyue
    Yu, Chengjin
    Zhou, Peng
    Neurocomputing, 2025, 637
  • [26] HSNF: Hybrid sampling with two-step noise filtering for imbalanced data classification
    Duan, Lilong
    Xue, Wei
    Gu, Xiaolei
    Luo, Xiao
    He, Yongsheng
    INTELLIGENT DATA ANALYSIS, 2023, 27 (06) : 1573 - 1593
  • [27] EHSBoost: Enhancing ensembles for imbalanced data-sets by evolutionary hybrid-sampling
    Zhang, Chunkai
    Guo, Jianwei
    Qi, Changqing
    Jiang, Zoe L.
    Liao, Qing
    Yao, Lin
    Wang, Xuan
    2017 INTERNATIONAL CONFERENCE ON SECURITY, PATTERN ANALYSIS, AND CYBERNETICS (SPAC), 2017, : 118 - 123
  • [28] Deep Learning and Data Sampling with Imbalanced Big Data
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    2019 IEEE 20TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2019), 2019, : 175 - 183
  • [29] An evaluation of progressive sampling for imbalanced data sets
    Ng, Willie
    Dash, Manoranjan
    ICDM 2006: SIXTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, WORKSHOPS, 2006, : 657 - +
  • [30] A Constructive Method for Data Reduction and Imbalanced Sampling
    Liu, Fei
    Yan, Yuanting
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2023, PT III, 2024, 14489 : 476 - 489