CLAMI: Defect Prediction on Unlabeled Datasets

被引:140
|
作者
Nam, Jaechang [1 ]
Kim, Sunghun [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China
关键词
STATIC CODE ATTRIBUTES; SOFTWARE; FAULTS; SELECTION; METRICS;
D O I
10.1109/ASE.2015.56
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Defect prediction on new projects or projects with limited historical data is an interesting problem in software engineering. This is largely because it is difficult to collect defect information to label a dataset for training a prediction model. Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction models built by other projects that have enough historical data. However, CPDP does not always build a strong prediction model because of the different distributions among datasets. Approaches for defect prediction on unlabeled datasets have also tried to address the problem by adopting unsupervised learning but it has one major limitation, the necessity for manual effort. In this study, we propose novel approaches, CLA and CLAMI, that show the potential for defect prediction on unlabeled datasets in an automated manner without need for manual effort. The key idea of the CLA and CLAMI approaches is to label an unlabeled dataset by using the magnitude of metric values. In our empirical study on seven open-source projects, the CLAMI approach led to the promising prediction performances, 0.636 and 0.723 in average f-measure and AUC, that are comparable to those of defect prediction based on supervised learning.
引用
收藏
页码:452 / 463
页数:12
相关论文
共 50 条
  • [31] Predicting Classification Accuracy of Unlabeled Datasets Using Multiple Deep Neural Networks
    You, Shingchern D.
    Liu, Hsiao-Chung
    Liu, Chien-Hung
    IEEE ACCESS, 2022, 10 : 44627 - 44637
  • [32] Binary Classification from Multiple Unlabeled Datasets via Surrogate Set Classification
    Lu, Nan
    Lei, Shida
    Niu, Gang
    Sato, Issei
    Sugiyama, Masashi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [33] Positive-Unlabeled Learning for Network Link Prediction
    Gan, Shengfeng
    Alshahrani, Mohammed
    Liu, Shichao
    MATHEMATICS, 2022, 10 (18)
  • [34] Sequence Prediction with Unlabeled Data by Reward Function Learning
    Wu, Lijun
    Zhao, Li
    Qin, Tao
    Lai, Jianhuang
    Liu, Tie-Yan
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3098 - 3104
  • [35] Designing Pre-training Datasets from Unlabeled Data for EEG Classification with Transformers
    Bary, Tim
    Macq, Benoit
    2024 IEEE 22ND MEDITERRANEAN ELECTROTECHNICAL CONFERENCE, MELECON 2024, 2024, : 25 - 30
  • [36] Gene function prediction using labeled and unlabeled data
    Xing-Ming Zhao
    Yong Wang
    Luonan Chen
    Kazuyuki Aihara
    BMC Bioinformatics, 9
  • [37] Taming Overconfident Prediction on Unlabeled Data From Hindsight
    Li, Jing
    Pan, Yuangang
    Tsang, Ivor W.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) : 14151 - 14163
  • [38] Gene function prediction using labeled and unlabeled data
    Zhao, Xing-Ming
    Wang, Yong
    Chen, Luonan
    Aihara, Kazuyuki
    BMC BIOINFORMATICS, 2008, 9 (1)
  • [39] Positive-Unlabeled Learning for Pupylation Sites Prediction
    Jiang, Ming
    Cao, Jun-Zhe
    BIOMED RESEARCH INTERNATIONAL, 2016, 2016
  • [40] Mitigating Overfitting in Supervised Classification from Two Unlabeled Datasets: A Consistent Risk Correction Approach
    Lu, Nan
    Zhang, Tianyi
    Niu, Gang
    Sugiyama, Masashi
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 1115 - 1124