Improving Software Defect Prediction in Noisy Imbalanced Datasets

被引:4
|
作者
Shi, Haoxiang [1 ]
Ai, Jun [1 ]
Liu, Jingyu [1 ]
Xu, Jiaxi [2 ]
机构
[1] Beihang Univ, Sch Reliabil & Syst Engn, Beijing 100191, Peoples R China
[2] China Elect Prod Reliabil & Environm Testing Res I, Guangzhou 510610, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 18期
关键词
software defect prediction; class imbalance; undersampling; propensity score matching; oversampling; noise reduction; SAMPLING METHOD; SMOTE;
D O I
10.3390/app131810466
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Software defect prediction is a popular method for optimizing software testing and improving software quality and reliability. However, software defect datasets usually have quality problems, such as class imbalance and data noise. Oversampling by generating the minority class samples is one of the most well-known methods to improving the quality of datasets; however, it often introduces overfitting noise to datasets. To better improve the quality of these datasets, this paper proposes a method called US-PONR, which uses undersampling to remove duplicate samples from version iterations and then uses oversampling through propensity score matching to reduce class imbalance and noise samples in datasets. The effectiveness of this method was validated in a software prediction experiment that involved 24 versions of software data in 11 projects from PROMISE in noisy environments that varied from 0% to 30% noise level. The experiments showed a significant improvement in the quality of datasets pre-processed by US-PONR in noisy imbalanced datasets, especially the noisiest ones, compared with 12 other advanced dataset processing methods. The experiments also demonstrated that the US-PONR method can effectively identify the label noise samples and remove them.
引用
收藏
页数:24
相关论文
共 50 条
  • [1] Improving Prediction Accuracy for Logistic Regression On Imbalanced Datasets
    Zhang, Hao
    Li, Zhuolin
    Shahriar, Hossain
    Tao, Lixin
    Bhattacharya, Prabir
    Qian, Ying
    2019 IEEE 43RD ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2019, : 918 - 919
  • [2] SMOTEFRIS-INFFC: Handling the challenge of borderline and noisy examples in imbalanced learning for software defect prediction
    Bashir, Kamal
    Li, Tianrui
    Yohannese, Chubato Wondaferaw
    Yahaya, Mahama
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (01) : 917 - 933
  • [3] The Consolidated Tree Construction Algorithm in Imbalanced Defect Prediction Datasets
    Ibarguren, Igor
    Perez, Jesus M.
    Mugerza, Javier
    Rodriguez, Daniel
    Harrison, Rachel
    2017 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2017, : 2656 - 2660
  • [4] Imbalanced Data Processing Model for Software Defect Prediction
    Zhou, Lijuan
    Li, Ran
    Zhang, Shudong
    Wang, Hua
    WIRELESS PERSONAL COMMUNICATIONS, 2018, 102 (02) : 937 - 950
  • [5] Feature Selection with Imbalanced Data for Software Defect Prediction
    Khoshgoftaar, Taghi M.
    Gao, Kehan
    EIGHTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2009, : 235 - +
  • [6] Imbalanced Data Processing Model for Software Defect Prediction
    Lijuan Zhou
    Ran Li
    Shudong Zhang
    Hua Wang
    Wireless Personal Communications, 2018, 102 : 937 - 950
  • [7] An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data
    Malhotra, Ruchika
    Kamal, Shine
    NEUROCOMPUTING, 2019, 343 : 120 - 140
  • [8] Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets
    Li, Zhong
    Pan, Minxue
    Pei, Yu
    Zhang, Tian
    Wang, Linzhang
    Li, Xuandong
    PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
  • [9] Underwater object detection in noisy imbalanced datasets
    Chen, Long
    Li, Tengyue
    Zhou, Andy
    Wang, Shengke
    Dong, Junyu
    Zhou, Huiyu
    PATTERN RECOGNITION, 2024, 155
  • [10] Software Defect Prediction on Unlabelled Datasets: A Comparative Study
    Ronchieri, Elisabetta
    Canaparo, Marco
    Belgiovine, Mauro
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2020, PT II, 2020, 12250 : 333 - 353