A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

被引:60
|
作者
Elreedy, Dina [1 ]
Atiya, Amir F. [1 ]
Kamalov, Firuz [2 ]
机构
[1] Cairo Univ, Comp Engn Dept, Giza 12613, Egypt
[2] Canadian Univ Dubai, Dept Elect Engn, Dubai 117781, U Arab Emirates
关键词
SMOTE; Class imbalance; Distribution density; Over-sampling; Minority class; SAMPLING APPROACH; CLASSIFICATION;
D O I
10.1007/s10994-022-06296-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns' probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.
引用
收藏
页码:4903 / 4923
页数:21
相关论文
共 50 条
  • [41] C-SMOTE: Continuous Synthetic Minority Oversampling for Evolving Data Streams
    Bernardo, Alessio
    Gomes, Heitor Murilo
    Montiel, Jacob
    Pfahringer, Bernhard
    Bifet, Albert
    Della Valle, Emanuele
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 483 - 492
  • [42] Machine Learning and Synthetic Minority Oversampling Techniques for Imbalanced Data: Improving Machine Failure Prediction
    Wah, Yap Bee
    Ismail, Azlan
    Azid, Nur Niswah Naslina
    Jaafar, Jafreezal
    Aziz, Izzatdin Abdul
    Hasan, Mohd Hilmi
    Zain, Jasni Mohamad
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 4821 - 4841
  • [43] HHO-SMOTe: Efficient Sampling Rate for Synthetic Minority Oversampling Technique Based on Harris Hawk Optimization
    Raslan, Khaled S. H.
    Alsharkawy, Almohammady S.
    Raslan, K. R.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (10) : 442 - 453
  • [44] Fuzzy-synthetic minority oversampling technique: Oversampling based on fuzzy set theory for Android malware detection in imbalanced datasets
    Xu, Yanping
    Wu, Chunhua
    Zheng, Kangfeng
    Niu, Xinxin
    Yang, Yixian
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2017, 13 (04):
  • [45] Synthetic minority oversampling in addressing imbalanced sarcasm detection in social media
    Arghasree Banerjee
    Mayukh Bhattacharjee
    Kushankur Ghosh
    Sankhadeep Chatterjee
    Multimedia Tools and Applications, 2020, 79 : 35995 - 36031
  • [46] Synthetic minority oversampling in addressing imbalanced sarcasm detection in social media
    Banerjee, Arghasree
    Bhattacharjee, Mayukh
    Ghosh, Kushankur
    Chatterjee, Sankhadeep
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (47-48) : 35995 - 36031
  • [47] Imbalanced Twitter Sentiment Analysis using Minority Oversampling
    Ghosh, Kushankur
    Banerjee, Arghasree
    Chatterjee, Sankhadeep
    Sen, Soumya
    2019 IEEE 10TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST 2019), 2019, : 384 - 388
  • [48] Multiple Kernel Learning With Minority Oversampling for Classifying Imbalanced Data
    Wang, Ling
    Wang, Hongqiao
    Fu, Guangyuan
    IEEE ACCESS, 2021, 9 : 565 - 580
  • [49] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    2002, American Association for Artificial Intelligence (16):
  • [50] A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning
    Xie, Zhipeng
    Jiang, Liyang
    Ye, Tengju
    Li, Xiao-Li
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, PT II, 2015, 9050 : 3 - 18