A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

被引：21

作者：

Gong, Lina ^{[1
]}

Zhang, Haoxiang ^{[2
]}

Zhang, Jingxuan ^{[1
]}

Wei, Mingqiang ^{[1
]}

Huang, Zhiqiu ^{[1
]}

机构：

[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 210095, Jiangsu, Peoples R China

[2] Queens Univ, Sch Comp, Software Anal & Intelligence Lab SAIL, Kingston, ON K7L 3N6, Canada

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2023年 / 49卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Class overlap; data quality; k-nearest neighbourhood; local analysis; software defect prediction; software metrics; FALSE DISCOVERY RATE; CLASSIFICATION; CLASSIFIERS; MACHINE; ERROR;

D O I：

10.1109/TSE.2022.3220740

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Software Defect Prediction (SDP) is one of the most vital and cost-efficient operations to ensure the software quality. However, there exists the phenomenon of class overlap in the SDP datasets (i.e., defective and non-defective modules are similar in terms of values of metrics), which hinders the performance as well as the use of SDP models. Even though efforts have been made to investigate the impact of removing overlapping technique on the performance of SDP, many open issues are still challenging yet unknown. Therefore, we conduct an empirical study to comprehensively investigate the impact of class overlap on SDP. Specifically, we first propose an overlapping instances identification approach by analyzing the class distribution in the local neighborhood of a given instance. We then investigate the impact of class overlap and two common overlapping instance handling techniques on the performance and the interpretation of seven representative SDP models. Through an extensive case study on 230 diversity datasets, we observe that: i) 70.0% of SDP datasets contain overlapping instances; ii) different levels of class overlap have different impacts on the performance of SDP models; iii) class overlap affects the rank of the important feature list of SDP models, particularly the feature lists at the top 2 and top 3 ranks; IV) Class overlap handling techniques could statistically significantly improve the performance of SDP models trained on datasets with over 12.5% overlap ratios. We suggest that future work should apply our KNN method to identify the overlap ratios of datasets before building SDP models.

引用

页码：2440 / 2458

页数：19

共 50 条

[21] Empirical investigation of hyperparameter optimization for software defect count prediction
Nevendra, Meetesh
Singh, Pradeep
Expert Systems with Applications, 2022, 191
[22] Combat with Class Overlapping in Software Defect Prediction Using Neighbourhood Metric
Gupta S.
Richa
Kumar R.
Jain K.L.
SN Computer Science, 4 (5)
[23] An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction
Huda, Shamsul
Liu, Kevin
Abdelrazek, Mohamed
Ibrahim, Amani
Alyahya, Sultan
Al-Dossari, Hmood
Ahmad, Shafiq
IEEE ACCESS, 2018, 6 : 24184 - 24195
[24] A Survey of Different Approaches for the Class Imbalance Problem in Software Defect Prediction
Dar, Abdul Waheed
Farooq, Sheikh Umar
INTERNATIONAL JOURNAL OF SOFTWARE SCIENCE AND COMPUTATIONAL INTELLIGENCE-IJSSCI, 2022, 14 (01):
[25] Adapting God Class thresholds for software defect prediction: A case study
Gradisnik, Mitja
Beranic, Tina
Karakatic, Saso
Mausa, Goran
2019 42ND INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2019, : 1537 - 1542
[26] Assessing the Significant Impact of Concept Drift in Software Defect Prediction
Kabir, Md Alamgir
Keung, Jacky W.
Bennin, Kwabena E.
Zhang, Miao
2019 IEEE 43RD ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2019, : 53 - 58
[27] The limited impact of individual developer data on software defect prediction
Robert M. Bell
Thomas J. Ostrand
Elaine J. Weyuker
Empirical Software Engineering, 2013, 18 : 478 - 505
[28] The limited impact of individual developer data on software defect prediction
Bell, Robert M.
Ostrand, Thomas J.
Weyuker, Elaine J.
EMPIRICAL SOFTWARE ENGINEERING, 2013, 18 (03) : 478 - 505
[29] Revisiting the Impact of Dependency Network Metrics on Software Defect Prediction
Gong, Lina
Rajbahadur, Gopi Krishnan
Hassan, Ahmed E.
Jiang, Shujuan
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (12) : 5030 - 5049
[30] Impact of Using Information Gain in Software Defect Prediction Models
Rana, Zeeshan Ali
Awais, Mian M.
Shamail, Shafay
INTELLIGENT COMPUTING THEORY, 2014, 8588 : 637 - 648

← 1 2 3 4 5 →