Statistical Validity of Neural-Net Benchmarks

IEEE Open Journal of the Computer Society Pub Date : 2024-12-26 DOI:10.1109/OJCS.2024.3523183

Alain Hadges;Srikar Bellur

引用次数: 0

Abstract

Claims of better, faster or more efficient neural-net designs often hinge on low single digit percentage improvements (or less) in accuracy or speed compared to others. Current benchmark differences used for comparison have been based on a number of different metrics such as recall, the best of five-runs, the median of five runs, Top-1, Top-5, BLEU, ROC, RMS, etc. These metrics implicitly assert comparable distributions of metrics. Conspicuous by their absence are measures of statistical validity of these benchmark comparisons. This study examined neural-net benchmark metric distributions and determined there are researcher degrees of freedom that may affect comparison validity. An essay is developed and proposed for benchmarking and comparing reasonably expected neural-net performance metrics that minimizes researcher degrees of freedom. The essay includes an estimate of the effects and the interactions of hyper-parameter settings on the benchmark metrics of a neural-net as a measure of its optimization complexity.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

神经网络基准的统计有效性

声称更好、更快或更高效的神经网络设计往往取决于与其他设计相比，在准确性或速度方面的低个位数百分比改进（或更少）。目前用于比较的基准差异是基于许多不同的指标，如召回率、五次运行的最佳结果、五次运行的中位数、Top-1、Top-5、BLEU、ROC、RMS等。这些指标隐含地断言了指标的可比较分布。值得注意的是，缺乏这些基准比较的统计有效性度量。本研究检验了神经网络基准度量分布，并确定存在可能影响比较效度的研究者自由度。开发并提出了一篇论文，用于基准测试和比较合理预期的神经网络性能指标，以最大限度地减少研究人员的自由度。本文包括对神经网络基准度量的超参数设置的影响和相互作用的估计，作为其优化复杂性的度量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Open Journal of the Computer Society

CiteScore

12.60

自引率

0.00%

发文量