Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect prediction

被引:1
|
作者
Dar, Abdul Waheed [1 ]
Farooq, Sheikh Umar [1 ]
机构
[1] Univ Kashmir, Dept Comp Sci, North Campus, Srinagar, India
关键词
Class imbalance problem; Machine learning; Software defect prediction; Over-sampling; Under-sampling; PERFORMANCE; MACHINE; SMOTE;
D O I
10.1007/s11334-024-00571-4
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Various techniques in machine learning have been used for building software defect prediction (SDP) models to identify the defective software modules. However, a major challenge to SDP models is the class overlapping and the class imbalance problem of SDP datasets. This study proposes a new SDP model that combines the overlap-based under-sampling framework with the balanced random forest classifier to improve the identification of defective software modules. First, the duplicate instances of the dataset are removed to avoid the over-fitting of the model. Next, the overlapped majority non-defective class instances of the training data are removed by applying an overlap-based under-sampling technique to maximize the presence of minority defective class instances in a region where the two classes overlap. Finally, we use the balanced random forest, which combines the random under-sampling and the ensemble learning techniques on the pre-processed training data for achieving the goal of classification prediction. The efficacy of our proposed SDP model is assessed by comparing its performance against nine state-of-the-art SDP models using 15 imbalanced software defect datasets. Experimental results and the statistical analysis indicate that our proposed SDP model has better predictive performance than other test models in terms of recall, G-mean, F-measure and AUC.
引用
收藏
页数:21
相关论文
共 48 条