An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data

被引:76
|
作者
Malhotra, Ruchika [1 ]
Kamal, Shine [1 ]
机构
[1] Delhi Technol Univ, Dept Comp Sci & Engn, Discipline Software Engn, Delhi, India
关键词
Defect prediction; Imbalanced data; Oversampling methods; MetaCost learners; Machine learning techniques; Procedural metrics; SAMPLING APPROACH; NEURAL-NETWORKS; CLASSIFICATION; SMOTE; QUALITY;
D O I
10.1016/j.neucom.2018.04.090
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality software product. Though, previous studies have successfully used machine learning techniques for software defect prediction, these techniques yield biased results when applied on imbalanced data sets. An imbalanced data set has non-uniform class distribution with very few instances of a specific class as compared to that of the other class. Use of imbalanced datasets leads to off-target predictions of the minority class, which is generally considered to be more important than the majority class. Thus, handling imbalanced data effectively is crucial for successful development of a competent defect prediction model. This study evaluates the effectiveness of machine learning classifiers for software defect prediction on twelve imbalanced NASA datasets by application of sampling methods and cost sensitive classifiers. We investigate five existing oversampling methods, which replicate the instances of minority class and also propose a new method SPIDER3 by suggesting modifications in SPIDER2 oversampling method. Furthermore, the work evaluates the performance of MetaCost learners for cost sensitive learning on imbalanced datasets. The results show improvement in the prediction capability of machine learning classifiers with the use of oversampling methods. Furthermore, the proposed SPIDER3 method shows promising results. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:120 / 140
页数:21
相关论文
共 50 条
  • [1] Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data
    Benala, Tirimula Rao
    Tantati, Karunya
    INNOVATIONS IN SYSTEMS AND SOFTWARE ENGINEERING, 2023, 19 (03) : 247 - 263
  • [2] Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data
    Tirimula Rao Benala
    Karunya Tantati
    Innovations in Systems and Software Engineering, 2023, 19 : 247 - 263
  • [3] Generative Oversampling Methods for Handling Imbalanced Data in Software Fault Prediction
    Rathore, Santosh Singh
    Chouhan, Satyendra Singh
    Jain, Dixit Kumar
    Vachhani, Aakash Gopal
    IEEE TRANSACTIONS ON RELIABILITY, 2022, 71 (02) : 747 - 762
  • [4] Oversampling boosting for classification of imbalanced software defect data
    Li, Guangling
    Wang, Shihai
    PROCEEDINGS OF THE 35TH CHINESE CONTROL CONFERENCE 2016, 2016, : 4149 - 4154
  • [5] An empirical study for software change prediction using imbalanced data
    Ruchika Malhotra
    Megha Khanna
    Empirical Software Engineering, 2017, 22 : 2806 - 2851
  • [6] An empirical study for software change prediction using imbalanced data
    Malhotra, Ruchika
    Khanna, Megha
    EMPIRICAL SOFTWARE ENGINEERING, 2017, 22 (06) : 2806 - 2851
  • [7] Improving Software Defect Prediction in Noisy Imbalanced Datasets
    Shi, Haoxiang
    Ai, Jun
    Liu, Jingyu
    Xu, Jiaxi
    APPLIED SCIENCES-BASEL, 2023, 13 (18):
  • [8] Tool to Handle Imbalancing Problem in Software Defect Prediction Using Oversampling Methods
    Malhotra, Ruchika
    Kamal, Shine
    2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 906 - 912
  • [9] Handling Imbalanced Data using Ensemble Learning in Software Defect Prediction
    Malhotra, Ruchika
    Jain, Juhi
    PROCEEDINGS OF THE CONFLUENCE 2020: 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING, 2020, : 300 - 304
  • [10] Imbalanced Data Processing Model for Software Defect Prediction
    Zhou, Lijuan
    Li, Ran
    Zhang, Shudong
    Wang, Hua
    WIRELESS PERSONAL COMMUNICATIONS, 2018, 102 (02) : 937 - 950