Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect prediction

被引:1
|
作者
Dar, Abdul Waheed [1 ]
Farooq, Sheikh Umar [1 ]
机构
[1] Univ Kashmir, Dept Comp Sci, North Campus, Srinagar, India
关键词
Class imbalance problem; Machine learning; Software defect prediction; Over-sampling; Under-sampling; PERFORMANCE; MACHINE; SMOTE;
D O I
10.1007/s11334-024-00571-4
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Various techniques in machine learning have been used for building software defect prediction (SDP) models to identify the defective software modules. However, a major challenge to SDP models is the class overlapping and the class imbalance problem of SDP datasets. This study proposes a new SDP model that combines the overlap-based under-sampling framework with the balanced random forest classifier to improve the identification of defective software modules. First, the duplicate instances of the dataset are removed to avoid the over-fitting of the model. Next, the overlapped majority non-defective class instances of the training data are removed by applying an overlap-based under-sampling technique to maximize the presence of minority defective class instances in a region where the two classes overlap. Finally, we use the balanced random forest, which combines the random under-sampling and the ensemble learning techniques on the pre-processed training data for achieving the goal of classification prediction. The efficacy of our proposed SDP model is assessed by comparing its performance against nine state-of-the-art SDP models using 15 imbalanced software defect datasets. Experimental results and the statistical analysis indicate that our proposed SDP model has better predictive performance than other test models in terms of recall, G-mean, F-measure and AUC.
引用
收藏
页数:21
相关论文
共 48 条
  • [31] An Empirical Study on Data Sampling Methods in Addressing Class Imbalance Problem in Software Defect Prediction
    Odejide, Babajide J.
    Bajeh, Amos O.
    Balogun, Abdullateef O.
    Alanamu, Zubair O.
    Adewole, Kayode S.
    Akintola, Abimbola G.
    Salihu, Shakirat A.
    Usman-Hamza, Fatima E.
    Mojeed, Hammed A.
    SOFTWARE ENGINEERING PERSPECTIVES IN SYSTEMS, VOL. 1, 2022, 501 : 594 - 610
  • [32] Software Defect Prediction using Feature Selection and Random Forest Algorithm
    Ibrahim, Dyana Rashid
    Ghnemat, Rawan
    Hudaib, Amjad
    2017 INTERNATIONAL CONFERENCE ON NEW TRENDS IN COMPUTING SCIENCES (ICTCS), 2017, : 252 - 257
  • [33] Protein-Protein Interaction Sites Prediction Based on an Under-Sampling Strategy and Random Forest Algorithm
    Li, Minjie
    Wu, Ziheng
    Wang, Wenyan
    Lu, Kun
    Zhang, Jun
    Zhou, Yuming
    Chen, Zhaoquan
    Li, Dan
    Zheng, Shicheng
    Chen, Peng
    Wang, Bing
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2022, 19 (06) : 3646 - 3654
  • [34] Entropy and improved k-nearest neighbor search based under-sampling (ENU) method to handle class overlap in imbalanced datasets
    Kumar, Anil
    Singh, Dinesh
    Yadav, Rama Shankar
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023,
  • [35] Entropy and improved k-nearest neighbor search based under-sampling (ENU) method to handle class overlap in imbalanced datasets
    Kumar, Anil
    Singh, Dinesh
    Yadav, Rama Shankar
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2024, 36 (02):
  • [36] A Multiple Expert Approach to the Class Imbalance Problem Using Inverse Random under Sampling
    Tahir, Muhammad Atif
    Kittler, Josef
    Mikolajczyk, Krystian
    Yan, Fei
    MULTIPLE CLASSIFIER SYSTEMS, PROCEEDINGS, 2009, 5519 : 82 - 91
  • [37] Churn Prediction for High-Value Players in Freemium Mobile Games: Using Random Under-Sampling
    Wang, Guan-Yuan
    STATISTIKA-STATISTICS AND ECONOMY JOURNAL, 2022, 102 (04) : 443 - 453
  • [38] Handling Imbalanced Data in Customer Churn Prediction Using Combined Sampling and Weighted Random Forest
    Effendy, Veronikha
    Adiwijaya
    Baizal, Z. K. A.
    2014 2ND INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (ICOICT), 2014,
  • [39] Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction
    Arun, C.
    Lakshmi, C.
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2023,
  • [40] Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering
    Gong, Lina
    Jiang, Shujuan
    Jiang, Li
    IEEE ACCESS, 2019, 7 : 145725 - 145737