A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

被引:21
|
作者
Gong, Lina [1 ]
Zhang, Haoxiang [2 ]
Zhang, Jingxuan [1 ]
Wei, Mingqiang [1 ]
Huang, Zhiqiu [1 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 210095, Jiangsu, Peoples R China
[2] Queens Univ, Sch Comp, Software Anal & Intelligence Lab SAIL, Kingston, ON K7L 3N6, Canada
基金
中国国家自然科学基金;
关键词
Class overlap; data quality; k-nearest neighbourhood; local analysis; software defect prediction; software metrics; FALSE DISCOVERY RATE; CLASSIFICATION; CLASSIFIERS; MACHINE; ERROR;
D O I
10.1109/TSE.2022.3220740
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software Defect Prediction (SDP) is one of the most vital and cost-efficient operations to ensure the software quality. However, there exists the phenomenon of class overlap in the SDP datasets (i.e., defective and non-defective modules are similar in terms of values of metrics), which hinders the performance as well as the use of SDP models. Even though efforts have been made to investigate the impact of removing overlapping technique on the performance of SDP, many open issues are still challenging yet unknown. Therefore, we conduct an empirical study to comprehensively investigate the impact of class overlap on SDP. Specifically, we first propose an overlapping instances identification approach by analyzing the class distribution in the local neighborhood of a given instance. We then investigate the impact of class overlap and two common overlapping instance handling techniques on the performance and the interpretation of seven representative SDP models. Through an extensive case study on 230 diversity datasets, we observe that: i) 70.0% of SDP datasets contain overlapping instances; ii) different levels of class overlap have different impacts on the performance of SDP models; iii) class overlap affects the rank of the important feature list of SDP models, particularly the feature lists at the top 2 and top 3 ranks; IV) Class overlap handling techniques could statistically significantly improve the performance of SDP models trained on datasets with over 12.5% overlap ratios. We suggest that future work should apply our KNN method to identify the overlap ratios of datasets before building SDP models.
引用
收藏
页码:2440 / 2458
页数:19
相关论文
共 50 条
  • [21] Empirical investigation of hyperparameter optimization for software defect count prediction
    Nevendra, Meetesh
    Singh, Pradeep
    Expert Systems with Applications, 2022, 191
  • [22] Combat with Class Overlapping in Software Defect Prediction Using Neighbourhood Metric
    Gupta S.
    Richa
    Kumar R.
    Jain K.L.
    SN Computer Science, 4 (5)
  • [23] An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction
    Huda, Shamsul
    Liu, Kevin
    Abdelrazek, Mohamed
    Ibrahim, Amani
    Alyahya, Sultan
    Al-Dossari, Hmood
    Ahmad, Shafiq
    IEEE ACCESS, 2018, 6 : 24184 - 24195
  • [24] A Survey of Different Approaches for the Class Imbalance Problem in Software Defect Prediction
    Dar, Abdul Waheed
    Farooq, Sheikh Umar
    INTERNATIONAL JOURNAL OF SOFTWARE SCIENCE AND COMPUTATIONAL INTELLIGENCE-IJSSCI, 2022, 14 (01):
  • [25] Adapting God Class thresholds for software defect prediction: A case study
    Gradisnik, Mitja
    Beranic, Tina
    Karakatic, Saso
    Mausa, Goran
    2019 42ND INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2019, : 1537 - 1542
  • [26] Assessing the Significant Impact of Concept Drift in Software Defect Prediction
    Kabir, Md Alamgir
    Keung, Jacky W.
    Bennin, Kwabena E.
    Zhang, Miao
    2019 IEEE 43RD ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2019, : 53 - 58
  • [27] The limited impact of individual developer data on software defect prediction
    Robert M. Bell
    Thomas J. Ostrand
    Elaine J. Weyuker
    Empirical Software Engineering, 2013, 18 : 478 - 505
  • [28] The limited impact of individual developer data on software defect prediction
    Bell, Robert M.
    Ostrand, Thomas J.
    Weyuker, Elaine J.
    EMPIRICAL SOFTWARE ENGINEERING, 2013, 18 (03) : 478 - 505
  • [29] Revisiting the Impact of Dependency Network Metrics on Software Defect Prediction
    Gong, Lina
    Rajbahadur, Gopi Krishnan
    Hassan, Ahmed E.
    Jiang, Shujuan
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (12) : 5030 - 5049
  • [30] Impact of Using Information Gain in Software Defect Prediction Models
    Rana, Zeeshan Ali
    Awais, Mian M.
    Shamail, Shafay
    INTELLIGENT COMPUTING THEORY, 2014, 8588 : 637 - 648