Metrics of calibration for probabilistic predictions

J. Mach. Learn. Res. Pub Date : 2022-05-19 DOI:10.48550/arXiv.2205.09680

Imanol Arrieta Ibarra, Paman Gujral, Jonathan Tannen, M. Tygert, Cherie Xu

{"title":"Metrics of calibration for probabilistic predictions","authors":"Imanol Arrieta Ibarra, Paman Gujral, Jonathan Tannen, M. Tygert, Cherie Xu","doi":"10.48550/arXiv.2205.09680","DOIUrl":null,"url":null,"abstract":"Predictions are often probabilities; e.g., a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes,\"reliability diagrams\"help detect and diagnose statistically significant discrepancies -- so-called\"miscalibration\"-- between the predictions and the outcomes. The canonical reliability diagrams histogram the observed and expected values of the predictions; replacing the hard histogram binning with soft kernel density estimation is another common practice. But, which widths of bins or kernels are best? Plots of the cumulative differences between the observed and expected values largely avoid this question, by displaying miscalibration directly as the slopes of secant lines for the graphs. Slope is easy to perceive with quantitative precision, even when the constant offsets of the secant lines are irrelevant; there is no need to bin or perform kernel density estimation. The existing standard metrics of miscalibration each summarize a reliability diagram as a single scalar statistic. The cumulative plots naturally lead to scalar metrics for the deviation of the graph of cumulative differences away from zero; good calibration corresponds to a horizontal, flat graph which deviates little from zero. The cumulative approach is currently unconventional, yet offers many favorable statistical properties, guaranteed via mathematical theory backed by rigorous proofs and illustrative numerical examples. In particular, metrics based on binning or kernel density estimation unavoidably must trade-off statistical confidence for the ability to resolve variations as a function of the predicted probability or vice versa. Widening the bins or kernels averages away random noise while giving up some resolving power. Narrowing the bins or kernels enhances resolving power while not averaging away as much noise.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"54 1","pages":"351:1-351:54"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Mach. Learn. Res.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2205.09680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Predictions are often probabilities; e.g., a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes,"reliability diagrams"help detect and diagnose statistically significant discrepancies -- so-called"miscalibration"-- between the predictions and the outcomes. The canonical reliability diagrams histogram the observed and expected values of the predictions; replacing the hard histogram binning with soft kernel density estimation is another common practice. But, which widths of bins or kernels are best? Plots of the cumulative differences between the observed and expected values largely avoid this question, by displaying miscalibration directly as the slopes of secant lines for the graphs. Slope is easy to perceive with quantitative precision, even when the constant offsets of the secant lines are irrelevant; there is no need to bin or perform kernel density estimation. The existing standard metrics of miscalibration each summarize a reliability diagram as a single scalar statistic. The cumulative plots naturally lead to scalar metrics for the deviation of the graph of cumulative differences away from zero; good calibration corresponds to a horizontal, flat graph which deviates little from zero. The cumulative approach is currently unconventional, yet offers many favorable statistical properties, guaranteed via mathematical theory backed by rigorous proofs and illustrative numerical examples. In particular, metrics based on binning or kernel density estimation unavoidably must trade-off statistical confidence for the ability to resolve variations as a function of the predicted probability or vice versa. Widening the bins or kernels averages away random noise while giving up some resolving power. Narrowing the bins or kernels enhances resolving power while not averaging away as much noise.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

概率预测的校准度量

预测往往是概率;例如，预测明天的降水，但只有30%的可能性。将这种概率预测与实际结果结合起来，“可靠性图”有助于检测和诊断预测与结果之间的统计显著差异——即所谓的“误校准”。典型信度图直方图表示预测的观测值和期望值;用软核密度估计代替硬直方图分类是另一种常见的做法。但是，哪种宽度的桶或核是最好的呢?观测值和期望值之间的累积差异图通过直接显示割线的斜率，在很大程度上避免了这个问题。即使割线的恒定偏移量不相关，斜率也很容易以定量精度感知;不需要bin或执行核密度估计。现有的误校正标准度量都将可靠性图总结为单个标量统计量。累积图自然会产生标量度量，用于累积差值图偏离零的偏差;良好的校准对应于一个水平的，与零偏差很小的平面图形。累积方法目前是非常规的，但提供了许多有利的统计特性，通过严格的证明和说明性数值例子支持的数学理论来保证。特别是，基于分组或核密度估计的度量不可避免地必须权衡统计置信度，以便将变化作为预测概率的函数来解决，反之亦然。扩大箱子或核平均去除随机噪声，同时放弃一些分辨率。缩小箱子或核可以提高分辨率，同时不会平均掉太多的噪音。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

J. Mach. Learn. Res.

自引率

0.00%

发文量

期刊最新文献

Scalable Computation of Causal Bounds A Unified Framework for Factorizing Distributional Value Functions for Multi-Agent Reinforcement Learning Adaptive False Discovery Rate Control with Privacy Guarantee Fairlearn: Assessing and Improving Fairness of AI Systems Generalization Bounds for Adversarial Contrastive Learning