On the Reliability of the Area Under the ROC Curve in Empirical Software Engineering

Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering Pub Date : 2023-06-14 DOI:10.1145/3593434.3593456

L. Lavazza, S. Morasca, Gabriele Rotoloni

{"title":"On the Reliability of the Area Under the ROC Curve in Empirical Software Engineering","authors":"L. Lavazza, S. Morasca, Gabriele Rotoloni","doi":"10.1145/3593434.3593456","DOIUrl":null,"url":null,"abstract":"Binary classifiers are commonly used in software engineering research to estimate several software qualities, e.g., defectiveness or vulnerability. Thus, it is important to adequately evaluate how well binary classifiers perform, before they are used in practice. The Area Under the Curve (AUC) of Receiver Operating Characteristic curves has often been used to this end. However, AUC has been the target of some criticisms, so it is necessary to evaluate under what conditions and to what extent AUC can be a reliable performance metric. We analyze AUC in relation to ϕ (also known as Matthews Correlation Coefficient), often considered a more reliable performance metric, by building the lines in the ROC space with constant value of ϕ, for several values of ϕ, and computing the corresponding values of AUC. By their very definitions, AUC and ϕ depend on the prevalence ρ of a dataset, which is the proportion of its positive instances (e.g., the defective software modules). Hence, so does the relationship between AUC and ϕ. It turns out that AUC and ϕ are very well correlated, and therefore provide concordant indications, for balanced datasets (those with ρ ≃ 0.5). Instead, AUC tends to become quite large, and hence provide over-optimistic indications, for very imbalanced datasets (those with ρ ≃ 0 or ρ ≃ 1). We use examples from the software engineering literature to illustrate the analytical relationship linking AUC, ϕ, and ρ. We show that, for some values of ρ, the evaluation of performance based exclusively on AUC can be deceiving. In conclusion, this paper provides some guidelines for an informed usage and interpretation of AUC.","PeriodicalId":178596,"journal":{"name":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3593434.3593456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Binary classifiers are commonly used in software engineering research to estimate several software qualities, e.g., defectiveness or vulnerability. Thus, it is important to adequately evaluate how well binary classifiers perform, before they are used in practice. The Area Under the Curve (AUC) of Receiver Operating Characteristic curves has often been used to this end. However, AUC has been the target of some criticisms, so it is necessary to evaluate under what conditions and to what extent AUC can be a reliable performance metric. We analyze AUC in relation to ϕ (also known as Matthews Correlation Coefficient), often considered a more reliable performance metric, by building the lines in the ROC space with constant value of ϕ, for several values of ϕ, and computing the corresponding values of AUC. By their very definitions, AUC and ϕ depend on the prevalence ρ of a dataset, which is the proportion of its positive instances (e.g., the defective software modules). Hence, so does the relationship between AUC and ϕ. It turns out that AUC and ϕ are very well correlated, and therefore provide concordant indications, for balanced datasets (those with ρ ≃ 0.5). Instead, AUC tends to become quite large, and hence provide over-optimistic indications, for very imbalanced datasets (those with ρ ≃ 0 or ρ ≃ 1). We use examples from the software engineering literature to illustrate the analytical relationship linking AUC, ϕ, and ρ. We show that, for some values of ρ, the evaluation of performance based exclusively on AUC can be deceiving. In conclusion, this paper provides some guidelines for an informed usage and interpretation of AUC.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

实证软件工程中ROC曲线下面积的可靠性研究

在软件工程研究中，二元分类器通常用于评估几种软件质量，例如，缺陷或漏洞。因此，在实践中使用二元分类器之前，充分评估它们的性能是很重要的。接收机工作特性曲线的曲线下面积(AUC)常用于此目的。然而，AUC一直是一些批评的目标，因此有必要评估在什么条件下以及在多大程度上AUC可以成为可靠的性能指标。我们分析了与ϕ(也称为马修斯相关系数)相关的AUC，通常被认为是一个更可靠的性能指标，通过在ROC空间中构建具有恒定值的ϕ的线，并计算相应的AUC值。根据它们的定义，AUC和ϕ取决于数据集的流行率ρ，即其积极实例(例如，有缺陷的软件模块)的比例。因此，AUC和φ之间的关系也是如此。事实证明，AUC和ϕ非常相关，因此为平衡数据集(ρ≃0.5)提供了一致的指示。相反，对于非常不平衡的数据集(ρ≃0或ρ≃1)，AUC往往变得相当大，因此提供了过于乐观的迹象。我们使用软件工程文献中的例子来说明AUC， ϕ和ρ之间的分析关系。我们表明，对于某些ρ值，完全基于AUC的性能评价可能具有欺骗性。总之，本文为AUC的合理使用和解释提供了一些指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering

自引率

0.00%

发文量