CLAMI: Defect Prediction on Unlabeled Datasets

被引：140

作者：

Nam, Jaechang ^{[1
]}

Kim, Sunghun ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China

来源：

2015 30TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE) | 2015年

关键词：

STATIC CODE ATTRIBUTES; SOFTWARE; FAULTS; SELECTION; METRICS;

D O I：

10.1109/ASE.2015.56

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Defect prediction on new projects or projects with limited historical data is an interesting problem in software engineering. This is largely because it is difficult to collect defect information to label a dataset for training a prediction model. Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction models built by other projects that have enough historical data. However, CPDP does not always build a strong prediction model because of the different distributions among datasets. Approaches for defect prediction on unlabeled datasets have also tried to address the problem by adopting unsupervised learning but it has one major limitation, the necessity for manual effort. In this study, we propose novel approaches, CLA and CLAMI, that show the potential for defect prediction on unlabeled datasets in an automated manner without need for manual effort. The key idea of the CLA and CLAMI approaches is to label an unlabeled dataset by using the magnitude of metric values. In our empirical study on seven open-source projects, the CLAMI approach led to the promising prediction performances, 0.636 and 0.723 in average f-measure and AUC, that are comparable to those of defect prediction based on supervised learning.

引用

页码：452 / 463

页数：12

共 50 条

[41] Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software A Study with Unlabelled Datasets and Machine Learning Techniques
Ronchieri, Elisabetta
Canaparo, Marco
Belgiovine, Mauro
Salomoni, Davide
Martelli, Barbara
24TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2019), 2020, 245
[42] Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets
Calderon-Ramirez S.
Oala L.
Torrents-Barrena J.
Yang S.
Elizondo D.
Moemeni A.
Colreavy-Donnelly S.
Samek W.
Molina-Cabello M.A.
Lopez-Rubio E.
IEEE Transactions on Artificial Intelligence, 2023, 4 (02): : 282 - 291
[43] A systematic literature review on software defect prediction using artificial intelligence: Datasets, Data Validation Methods, Approaches, and Tools
Pachouly, Jalaj
Ahirrao, Swati
Kotecha, Ketan
Selvachandran, Ganeshsree
Abraham, Ajith
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 111
[44] Epileptic Seizure Prediction for Imbalanced Datasets
Cosgun, Ercan
Celebi, Anil
Gullu, M. Kemal
2019 MEDICAL TECHNOLOGIES CONGRESS (TIPTEKNO), 2019, : 290 - 293
[45] An investigation of bankruptcy prediction in imbalanced datasets
Veganzones, David
Severin, Eric
DECISION SUPPORT SYSTEMS, 2018, 112 : 111 - 124
[46] Supervised Link Prediction in Social Networks with Positive and Unlabeled Examples
Phi Vu Tran
MILITARY OPERATIONS RESEARCH, 2013, 18 (03) : 53 - 62
[47] SDPTool: A tool for creating datasets and software defect predictions
Pachouly, Jalaj
Ahirrao, Swati
Kotecha, Ketan
SOFTWAREX, 2022, 18
[48] Valuation of Partly Disclosed Datasets for Prediction
Tsubaki, Hiroe
2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2013, : 733 - 734
[49] Crystal synthesizability prediction using contrastive positive unlabeled learning
Sun, Tao
Yuan, Jianmei
COMPUTER PHYSICS COMMUNICATIONS, 2025, 308
[50] Leveraging Unlabeled Data for Glioma Molecular Subtype and Survival Prediction
Nuechterlein, Nicholas
Li, Beibin
Seyfioglu, Mehmet Saygin
Mehta, Sachin
Cimino, Patrick J.
Shapiro, Linda
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7149 - 7156

← 1 2 3 4 5 →