CLAMI: Defect Prediction on Unlabeled Datasets

被引:140
|
作者
Nam, Jaechang [1 ]
Kim, Sunghun [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China
关键词
STATIC CODE ATTRIBUTES; SOFTWARE; FAULTS; SELECTION; METRICS;
D O I
10.1109/ASE.2015.56
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Defect prediction on new projects or projects with limited historical data is an interesting problem in software engineering. This is largely because it is difficult to collect defect information to label a dataset for training a prediction model. Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction models built by other projects that have enough historical data. However, CPDP does not always build a strong prediction model because of the different distributions among datasets. Approaches for defect prediction on unlabeled datasets have also tried to address the problem by adopting unsupervised learning but it has one major limitation, the necessity for manual effort. In this study, we propose novel approaches, CLA and CLAMI, that show the potential for defect prediction on unlabeled datasets in an automated manner without need for manual effort. The key idea of the CLA and CLAMI approaches is to label an unlabeled dataset by using the magnitude of metric values. In our empirical study on seven open-source projects, the CLAMI approach led to the promising prediction performances, 0.636 and 0.723 in average f-measure and AUC, that are comparable to those of defect prediction based on supervised learning.
引用
收藏
页码:452 / 463
页数:12
相关论文
共 50 条
  • [21] CFIWSE: A Hybrid Preprocessing Approach for Defect Prediction on Imbalance Real-World Datasets
    Xu, Jiaxi
    Shang, Jingwei
    Huang, Zhichang
    2022 IEEE 22ND INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY, AND SECURITY COMPANION, QRS-C, 2022, : 392 - 401
  • [22] On the Reproducibility of Software Defect Datasets
    Zhu, Hao-Nan
    Rubio-Gonzalez, Cindy
    2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, 2023, : 2324 - 2335
  • [23] Unlabeled Data Improves Word Prediction
    Loeff, Nicolas
    Farhadi, Ali
    Endres, Ian
    Forsyth, David A.
    2009 IEEE 12TH INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2009, : 956 - 962
  • [24] StyleDiff: Attribute comparison between unlabeled datasets in latent disentangled space
    Kawano, Keisuke
    Kutsuna, Takuro
    Tokuhisa, Ryoko
    Nakamura, Akihiro
    Esaki, Yasushi
    IMAGE AND VISION COMPUTING, 2023, 138
  • [25] So You Need More Method Level Datasets for Your Software Defect Prediction?: Voila!
    Shippey, Thomas
    Hall, Tracy
    Counsell, Steve
    Bowes, David
    ESEM'16: PROCEEDINGS OF THE 10TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT, 2016,
  • [26] Learning from Software defect datasets
    Singh, Pradeep
    PROCEEDINGS OF 2019 5TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMPUTING AND CONTROL (ISPCC 2K19), 2019, : 58 - 63
  • [27] Metal Surface Defect Detection Based on Few Defect Datasets
    Li, Ruoming
    2019 5TH INTERNATIONAL CONFERENCE ON GREEN POWER, MATERIALS AND MANUFACTURING TECHNOLOGY AND APPLICATIONS (GPMMTA 2019), 2019, 2185
  • [28] Comprehensive Bibliographic Survey and Forward-Looking Recommendations for Software Defect Prediction: Datasets, Validation Methodologies, Prediction Approaches, and Tools
    Mustaqeem, Mohd
    Alam, Mahfooz
    Mustajab, Suhel
    Alshanketi, Faisal
    Alam, Shadab
    Shuaib, Mohammed
    IEEE ACCESS, 2025, 13 : 866 - 903
  • [29] Improving Chemical Reaction Prediction with Unlabeled Data
    Xie, Yu
    Zhang, Yuyang
    Wong, Ka-Chun
    Shi, Meixia
    Peng, Chengbin
    MOLECULES, 2022, 27 (18):
  • [30] Software Fault Prediction of Unlabeled Program Modules
    Catal, C.
    Sevim, U.
    Diri, B.
    WORLD CONGRESS ON ENGINEERING 2009, VOLS I AND II, 2009, : 212 - +