Comparing φ and the F-measure as performance metrics for software-related classifications

被引:10
|
作者
Lavazza, Luigi [1 ]
Morasca, Sandro [1 ]
机构
[1] Univ Insubria, Dipartimento Sci Teor & Applicate, Varese, Italy
关键词
Binary classification; Software defect prediction; Performance evaluation; Performance metrics; Matthews Correlation Coefficient; F-measure; F-score;
D O I
10.1007/s10664-022-10199-2
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Context The F-measure has been widely used as a performance metric when selecting binary classifiers for prediction, but it has also been widely criticized, especially given the availability of alternatives such as phi (also known as Matthews Correlation Coefficient). Objectives Our goals are to (1) investigate possible issues related to the F-measure in depth and show how phi can address them, and (2) explore the relationships between the F-measure and phi. Method Based on the definitions of phi and the F-measure, we derive a few mathematical properties of these two performance metrics and of the relationships between them. To demonstrate the practical effects of these mathematical properties, we illustrate the outcomes of an empirical study involving 70 Empirical Software Engineering datasets and 837 classifiers. Results We show that phi can be defined as a function of Precision and Recall, which are the only two performance metrics used to define the F-measure, and the rate of actually positive software modules in a dataset. Also, phi can be expressed as a function of the F-measure and the rates of actual and estimated positive software modules. We derive the minimum and maximum value of phi for any given value of the F-measure, and the conditions under which both the F-measure and phi rank two classifiers in the same order. Conclusions Our results show that phi is a sensible and useful metric for assessing the performance of binary classifiers. We also recommend that the F-measure should not be used by itself to assess the performance of a classifier, but that the rate of positives should always be specified as well, at least to assess if and to what extent a classifier performs better than random classification. The mathematical relationships described here can also be used to re-interpret the conclusions of previously published papers that relied mainly on the F-measure as a performance metric.
引用
收藏
页数:38
相关论文
共 12 条
  • [1] Comparing ϕ and the F-measure as performance metrics for software-related classifications
    Luigi Lavazza
    Sandro Morasca
    Empirical Software Engineering, 2022, 27
  • [2] Common Problems With the Usage of F-Measure and Accuracy Metrics in Medical Research
    Lavazza, Luigi
    Morasca, Sandro
    IEEE ACCESS, 2023, 11 : 51515 - 51526
  • [3] Exemplifying the Effects of Distance Metrics on Clustering Techniques: F-measure, Accuracy and Efficiency
    Nizam, Tasleem
    Hassan, Sayed Imtiyaz
    PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT (INDIACOM-2020), 2019, : 39 - 44
  • [4] F-Measure Curves for Visualizing Classifier Performance with Imbalanced Data
    Soleymani, Roghayeh
    Granger, Eric
    Fumera, Giorgio
    ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, ANNPR 2018, 2018, 11081 : 165 - 177
  • [5] F-measure curves: A tool to visualize classifier performance under imbalance
    Soleymani, Roghayeh
    Granger, Eric
    Fumera, Giorgio
    PATTERN RECOGNITION, 2020, 100
  • [6] On extending F-measure and G-mean metrics to multi-class problems
    Espíndola, RP
    Ebecken, NFF
    Data Mining VI: Data Mining, Text Mining and Their Business Applications, 2005, : 25 - 34
  • [7] Linear Approximation of F-Measure for the Performance Evaluation of Classification Algorithms on Imbalanced Data Sets
    Wong, Tzu-Tsung
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (02) : 753 - 763
  • [8] The Effects of Technology-Related and Software-Related Factors on Neurocognitive Baseline Test Performance Using ImPACT
    Schatz, P.
    Cameron, N.
    ARCHIVES OF CLINICAL NEUROPSYCHOLOGY, 2011, 26 (06) : 521 - 521
  • [9] F-MEASURE: A FORECASTING-LED TIME SERIES DISTANCE MEASURE IN LARGE-SCALE FORECASTING OF VIDEO SERVICES PERFORMANCE
    Zhuo, Yu
    You, Jiali
    Xue, Hanxing
    Wang, Jinlin
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2018, 14 (06): : 2175 - 2188
  • [10] Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores
    Hinojosa Lee, Maria Cristina
    Braet, Johan
    Springael, Johan
    APPLIED SCIENCES-BASEL, 2024, 14 (21):