A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

被引:60
|
作者
Elreedy, Dina [1 ]
Atiya, Amir F. [1 ]
Kamalov, Firuz [2 ]
机构
[1] Cairo Univ, Comp Engn Dept, Giza 12613, Egypt
[2] Canadian Univ Dubai, Dept Elect Engn, Dubai 117781, U Arab Emirates
关键词
SMOTE; Class imbalance; Distribution density; Over-sampling; Minority class; SAMPLING APPROACH; CLASSIFICATION;
D O I
10.1007/s10994-022-06296-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns' probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.
引用
收藏
页码:4903 / 4923
页数:21
相关论文
共 50 条
  • [1] CMO-SMOTE: Misclassification Cost Minimization Oriented Synthetic Minority Oversampling Technique for Imbalanced Learning
    Zhou, Changsheng
    Liu, Bin
    Wang, Shihai
    2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 2, 2016, : 353 - 358
  • [2] Effect of Synthetic Minority Oversampling Technique (SMOTE), Feature Representation, and Classification Algorithm on Imbalanced Sentiment Analysis
    Satriaji, Widi
    Kusumaningrum, Retno
    2018 2ND INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTATIONAL SCIENCES (ICICOS), 2018, : 99 - 103
  • [3] A quantum approach to synthetic minority oversampling technique (SMOTE)
    Mohanty, Nishikanta
    Behera, Bikash K.
    Ferrie, Christopher
    Dash, Pravat
    QUANTUM MACHINE INTELLIGENCE, 2025, 7 (01)
  • [4] SPAW-SMOTE: Space Partitioning Adaptive Weighted Synthetic Minority Oversampling Technique For Imbalanced Data Set Learning
    Zhang, Qiang
    He, Junjiang
    Li, Tao
    Lan, Xiaolong
    Fang, Wenbo
    Li, Yihong
    COMPUTER JOURNAL, 2023, 67 (05): : 1747 - 1762
  • [5] A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning
    Barua, Sukarna
    Islam, Md. Monirul
    Murase, Kazuyuki
    NEURAL INFORMATION PROCESSING, PT II, 2011, 7063 : 735 - +
  • [6] Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data
    Jie Liu
    Soft Computing, 2022, 26 : 1141 - 1163
  • [7] Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data
    Liu, Jie
    SOFT COMPUTING, 2022, 26 (03) : 1141 - 1163
  • [8] A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance
    Elreedy, Dina
    Atiya, Amir F.
    INFORMATION SCIENCES, 2019, 505 : 32 - 64
  • [9] Note on "A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance"
    Ferrer, Carlos A.
    Aragon, Efren
    INFORMATION SCIENCES, 2023, 630 : 322 - 324
  • [10] Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning
    Wang, Zhenfei
    Wang, Hongju
    IEEE ACCESS, 2021, 9 : 44770 - 44783