A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

被引:21
|
作者
Gong, Lina [1 ]
Zhang, Haoxiang [2 ]
Zhang, Jingxuan [1 ]
Wei, Mingqiang [1 ]
Huang, Zhiqiu [1 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 210095, Jiangsu, Peoples R China
[2] Queens Univ, Sch Comp, Software Anal & Intelligence Lab SAIL, Kingston, ON K7L 3N6, Canada
基金
中国国家自然科学基金;
关键词
Class overlap; data quality; k-nearest neighbourhood; local analysis; software defect prediction; software metrics; FALSE DISCOVERY RATE; CLASSIFICATION; CLASSIFIERS; MACHINE; ERROR;
D O I
10.1109/TSE.2022.3220740
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software Defect Prediction (SDP) is one of the most vital and cost-efficient operations to ensure the software quality. However, there exists the phenomenon of class overlap in the SDP datasets (i.e., defective and non-defective modules are similar in terms of values of metrics), which hinders the performance as well as the use of SDP models. Even though efforts have been made to investigate the impact of removing overlapping technique on the performance of SDP, many open issues are still challenging yet unknown. Therefore, we conduct an empirical study to comprehensively investigate the impact of class overlap on SDP. Specifically, we first propose an overlapping instances identification approach by analyzing the class distribution in the local neighborhood of a given instance. We then investigate the impact of class overlap and two common overlapping instance handling techniques on the performance and the interpretation of seven representative SDP models. Through an extensive case study on 230 diversity datasets, we observe that: i) 70.0% of SDP datasets contain overlapping instances; ii) different levels of class overlap have different impacts on the performance of SDP models; iii) class overlap affects the rank of the important feature list of SDP models, particularly the feature lists at the top 2 and top 3 ranks; IV) Class overlap handling techniques could statistically significantly improve the performance of SDP models trained on datasets with over 12.5% overlap ratios. We suggest that future work should apply our KNN method to identify the overlap ratios of datasets before building SDP models.
引用
收藏
页码:2440 / 2458
页数:19
相关论文
共 50 条
  • [1] IH:mpirical Evaluation of the Impact of Class Overlap on Software Defect Prediction
    Gong, Lina
    Jiang, Shujuan
    Wang, Rongcun
    Jiang, Li
    34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, : 710 - 721
  • [2] Tackling class overlap and imbalance problems in software defect prediction
    Lin Chen
    Bin Fang
    Zhaowei Shang
    Yuanyan Tang
    Software Quality Journal, 2018, 26 : 97 - 125
  • [3] Tackling class overlap and imbalance problems in software defect prediction
    Chen, Lin
    Fang, Bin
    Shang, Zhaowei
    Tang, Yuanyan
    SOFTWARE QUALITY JOURNAL, 2018, 26 (01) : 97 - 125
  • [4] An ensemble model for addressing class imbalance and class overlap in software defect prediction
    Dar, Abdul Waheed
    Farooq, Sheikh Umar
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2024, 15 (12) : 5584 - 5603
  • [5] ROCT: Radius-based Class Overlap Cleaning Technique to Alleviate the Class Overlap Problem in Software Defect Prediction
    Feng, Shuo
    Keung, Jacky
    Liu, Jie
    Xiao, Yan
    Yu, Xiao
    Zhang, Miao
    2021 IEEE 45TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2021), 2021, : 228 - 237
  • [6] A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction
    Song, Qinbao
    Guo, Yuchen
    Shepperd, Martin
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2019, 45 (12) : 1253 - 1269
  • [7] A Software Defect Prediction Method That Simultaneously Addresses Class Overlap and Noise Issues after Oversampling
    Wang, Renliang
    Liu, Feng
    Bai, Yanhui
    ELECTRONICS, 2024, 13 (20)
  • [8] The Impact Study of Class Imbalance on the Performance of Software Defect Prediction Models
    Yu Q.
    Jiang S.-J.
    Zhang Y.-M.
    Wang X.-Y.
    Gao P.-F.
    Qian J.-Y.
    Qian, Jun-Yan (qjy2000@gmail.com), 2018, Science Press (41): : 809 - 824
  • [9] Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect prediction
    Dar, Abdul Waheed
    Farooq, Sheikh Umar
    INNOVATIONS IN SYSTEMS AND SOFTWARE ENGINEERING, 2024,
  • [10] An Empirical Study of the Impact of Class Overlap on the Performance and Interpretability of Cross-Version Defect Prediction
    Han, Hui
    Yu, Qiao
    Zhu, Yi
    Cheng, Shengyi
    Zhang, Yu
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2024, 34 (12) : 1895 - 1918