An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data

被引:76
|
作者
Malhotra, Ruchika [1 ]
Kamal, Shine [1 ]
机构
[1] Delhi Technol Univ, Dept Comp Sci & Engn, Discipline Software Engn, Delhi, India
关键词
Defect prediction; Imbalanced data; Oversampling methods; MetaCost learners; Machine learning techniques; Procedural metrics; SAMPLING APPROACH; NEURAL-NETWORKS; CLASSIFICATION; SMOTE; QUALITY;
D O I
10.1016/j.neucom.2018.04.090
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality software product. Though, previous studies have successfully used machine learning techniques for software defect prediction, these techniques yield biased results when applied on imbalanced data sets. An imbalanced data set has non-uniform class distribution with very few instances of a specific class as compared to that of the other class. Use of imbalanced datasets leads to off-target predictions of the minority class, which is generally considered to be more important than the majority class. Thus, handling imbalanced data effectively is crucial for successful development of a competent defect prediction model. This study evaluates the effectiveness of machine learning classifiers for software defect prediction on twelve imbalanced NASA datasets by application of sampling methods and cost sensitive classifiers. We investigate five existing oversampling methods, which replicate the instances of minority class and also propose a new method SPIDER3 by suggesting modifications in SPIDER2 oversampling method. Furthermore, the work evaluates the performance of MetaCost learners for cost sensitive learning on imbalanced datasets. The results show improvement in the prediction capability of machine learning classifiers with the use of oversampling methods. Furthermore, the proposed SPIDER3 method shows promising results. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:120 / 140
页数:21
相关论文
共 50 条
  • [31] Applying Weighted Particle Swarm Optimization to Imbalanced Data in Software Defect Prediction
    Brezocnik, Lucija
    Podgorelec, Vili
    NEW TECHNOLOGIES, DEVELOPMENT AND APPLICATION, 2019, 42 : 289 - 296
  • [32] An Empirical Study on the Stability of Explainable Software Defect Prediction
    Shin, Jiho
    Aleithan, Reem
    Nam, Jaechang
    Wang, Junjie
    Harzevili, Nima Shiri
    Wang, Song
    PROCEEDINGS OF THE 2023 30TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, APSEC 2023, 2023, : 141 - 150
  • [33] Empirical Study of Software Defect Prediction: A Systematic Mapping
    Le Hoang Son
    Pritam, Nakul
    Khari, Manju
    Kumar, Raghvendra
    Pham Thi Minh Phuong
    Pham Huy Thong
    SYMMETRY-BASEL, 2019, 11 (02):
  • [34] Oversampling Methods Combined Clustering and Data Cleaning for Imbalanced Network Data
    Yang, Yang
    Zhao, Qian
    Ruan, Linna
    Gao, Zhipeng
    Huo, Yonghua
    Qiu, Xuesong
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2020, 26 (05): : 1139 - 1155
  • [35] Online Defect Prediction for Imbalanced Data
    Tan, Ming
    Tan, Lin
    Dara, Sashank
    Mayeux, Caleb
    2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 2, 2015, : 99 - 108
  • [36] An Empirical Study on Software Defect Prediction Using Over-Sampling by SMOTE
    Pak, Cholmyong
    Wang, Tian Tian
    Su, Xiao Hong
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2018, 28 (06) : 811 - 830
  • [37] Comparative Study on Defect Prediction Algorithms of Supervised Learning Software Based on Imbalanced Classification Data Sets
    Ge, Jianxin
    Liu, Jiaomin
    Liu, Wenyuan
    2018 19TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2018, : 399 - 406
  • [38] Software defect prediction: A study on software metrics using statistical and machine learning methods
    Canaparo, Marco
    Ronchierr, Elisabetta
    Bertaccini, Gianluca
    INTERNATIONAL SYMPOSIUM ON GRIDS & CLOUDS 2022, 2022,
  • [39] Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting
    Cahyana, Nurheri
    Khomsah, Siti
    Aribowo, Agus Sasmito
    2019 5TH INTERNATIONAL CONFERENCE ON SCIENCE ININFORMATION TECHNOLOGY (ICSITECH): EMBRACING INDUSTRY 4.0 - TOWARDS INNOVATION IN CYBER PHYSICAL SYSTEM, 2019, : 217 - 222
  • [40] Oversampling Methods for Classification of Imbalanced Breast Cancer Malignancy Data
    Krawczyk, Bartosz
    Jelen, Lukasz
    Krzyzak, Adam
    Fevens, Thomas
    COMPUTER VISION AND GRAPHICS, 2012, 7594 : 483 - 490