CLAMI: Defect Prediction on Unlabeled Datasets

被引:140
|
作者
Nam, Jaechang [1 ]
Kim, Sunghun [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China
关键词
STATIC CODE ATTRIBUTES; SOFTWARE; FAULTS; SELECTION; METRICS;
D O I
10.1109/ASE.2015.56
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Defect prediction on new projects or projects with limited historical data is an interesting problem in software engineering. This is largely because it is difficult to collect defect information to label a dataset for training a prediction model. Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction models built by other projects that have enough historical data. However, CPDP does not always build a strong prediction model because of the different distributions among datasets. Approaches for defect prediction on unlabeled datasets have also tried to address the problem by adopting unsupervised learning but it has one major limitation, the necessity for manual effort. In this study, we propose novel approaches, CLA and CLAMI, that show the potential for defect prediction on unlabeled datasets in an automated manner without need for manual effort. The key idea of the CLA and CLAMI approaches is to label an unlabeled dataset by using the magnitude of metric values. In our empirical study on seven open-source projects, the CLAMI approach led to the promising prediction performances, 0.636 and 0.723 in average f-measure and AUC, that are comparable to those of defect prediction based on supervised learning.
引用
收藏
页码:452 / 463
页数:12
相关论文
共 50 条
  • [41] Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software A Study with Unlabelled Datasets and Machine Learning Techniques
    Ronchieri, Elisabetta
    Canaparo, Marco
    Belgiovine, Mauro
    Salomoni, Davide
    Martelli, Barbara
    24TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2019), 2020, 245
  • [42] Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets
    Calderon-Ramirez S.
    Oala L.
    Torrents-Barrena J.
    Yang S.
    Elizondo D.
    Moemeni A.
    Colreavy-Donnelly S.
    Samek W.
    Molina-Cabello M.A.
    Lopez-Rubio E.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (02): : 282 - 291
  • [43] A systematic literature review on software defect prediction using artificial intelligence: Datasets, Data Validation Methods, Approaches, and Tools
    Pachouly, Jalaj
    Ahirrao, Swati
    Kotecha, Ketan
    Selvachandran, Ganeshsree
    Abraham, Ajith
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 111
  • [44] Epileptic Seizure Prediction for Imbalanced Datasets
    Cosgun, Ercan
    Celebi, Anil
    Gullu, M. Kemal
    2019 MEDICAL TECHNOLOGIES CONGRESS (TIPTEKNO), 2019, : 290 - 293
  • [45] An investigation of bankruptcy prediction in imbalanced datasets
    Veganzones, David
    Severin, Eric
    DECISION SUPPORT SYSTEMS, 2018, 112 : 111 - 124
  • [46] Supervised Link Prediction in Social Networks with Positive and Unlabeled Examples
    Phi Vu Tran
    MILITARY OPERATIONS RESEARCH, 2013, 18 (03) : 53 - 62
  • [47] SDPTool: A tool for creating datasets and software defect predictions
    Pachouly, Jalaj
    Ahirrao, Swati
    Kotecha, Ketan
    SOFTWAREX, 2022, 18
  • [48] Valuation of Partly Disclosed Datasets for Prediction
    Tsubaki, Hiroe
    2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2013, : 733 - 734
  • [49] Crystal synthesizability prediction using contrastive positive unlabeled learning
    Sun, Tao
    Yuan, Jianmei
    COMPUTER PHYSICS COMMUNICATIONS, 2025, 308
  • [50] Leveraging Unlabeled Data for Glioma Molecular Subtype and Survival Prediction
    Nuechterlein, Nicholas
    Li, Beibin
    Seyfioglu, Mehmet Saygin
    Mehta, Sachin
    Cimino, Patrick J.
    Shapiro, Linda
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7149 - 7156