CLAMI: Defect Prediction on Unlabeled Datasets

被引:140
|
作者
Nam, Jaechang [1 ]
Kim, Sunghun [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China
关键词
STATIC CODE ATTRIBUTES; SOFTWARE; FAULTS; SELECTION; METRICS;
D O I
10.1109/ASE.2015.56
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Defect prediction on new projects or projects with limited historical data is an interesting problem in software engineering. This is largely because it is difficult to collect defect information to label a dataset for training a prediction model. Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction models built by other projects that have enough historical data. However, CPDP does not always build a strong prediction model because of the different distributions among datasets. Approaches for defect prediction on unlabeled datasets have also tried to address the problem by adopting unsupervised learning but it has one major limitation, the necessity for manual effort. In this study, we propose novel approaches, CLA and CLAMI, that show the potential for defect prediction on unlabeled datasets in an automated manner without need for manual effort. The key idea of the CLA and CLAMI approaches is to label an unlabeled dataset by using the magnitude of metric values. In our empirical study on seven open-source projects, the CLAMI approach led to the promising prediction performances, 0.636 and 0.723 in average f-measure and AUC, that are comparable to those of defect prediction based on supervised learning.
引用
收藏
页码:452 / 463
页数:12
相关论文
共 50 条
  • [1] Defect Prediction on Unlabeled Datasets by Using Unsupervised Clustering
    Yang, Jun
    Qian, Hongbing
    PROCEEDINGS OF 2016 IEEE 18TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS; IEEE 14TH INTERNATIONAL CONFERENCE ON SMART CITY; IEEE 2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2016, : 465 - 472
  • [2] Snoring: a Noise in Defect Prediction Datasets
    Ahluwalia, Aalok
    Falessi, Davide
    Di Penta, Massimiliano
    2019 IEEE/ACM 16TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2019), 2019, : 63 - 67
  • [3] A Study of Redundant Metrics in Defect Prediction Datasets
    Jiarpakdee, Jirayus
    Tantithamthavorn, Chakkrit
    Ihara, Akinori
    Matsumoto, Kenichi
    2016 IEEE 27TH INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING WORKSHOPS (ISSREW), 2016, : 51 - 52
  • [4] Software Defect Prediction on Unlabelled Datasets: A Comparative Study
    Ronchieri, Elisabetta
    Canaparo, Marco
    Belgiovine, Mauro
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2020, PT II, 2020, 12250 : 333 - 353
  • [5] Improving Software Defect Prediction in Noisy Imbalanced Datasets
    Shi, Haoxiang
    Ai, Jun
    Liu, Jingyu
    Xu, Jiaxi
    APPLIED SCIENCES-BASEL, 2023, 13 (18):
  • [6] Inheritance metrics feats in unsupervised learning to classify unlabeled datasets and clusters in fault prediction
    Aziz, Syed Rashid
    Khan, Tamim Ahmed
    Nadeem, Aamer
    PEERJ COMPUTER SCIENCE, 2021, 7
  • [7] Inheritance metrics feats in unsupervised learning to classify unlabeled datasets and clusters in fault prediction
    Aziz S.R.
    Khan T.A.
    Nadeem A.
    PeerJ Computer Science, 2021, 7
  • [8] Automatic Evaluation of Cluster in Unlabeled Datasets
    Krishnamoorthi, M.
    INFORMATION AND NETWORK TECHNOLOGY, 2011, 4 : 120 - 124
  • [9] The Consolidated Tree Construction Algorithm in Imbalanced Defect Prediction Datasets
    Ibarguren, Igor
    Perez, Jesus M.
    Mugerza, Javier
    Rodriguez, Daniel
    Harrison, Rachel
    2017 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2017, : 2656 - 2660
  • [10] An approach to software defect prediction for small-sized datasets
    Bal, Pravas Ranjan
    Shukla, Suyash
    Kumar, Sandeep
    APPLIED INTELLIGENCE, 2025, 55 (06)