RandALO: Out-of-sample risk estimation in no time flat

arXiv - STAT - Statistics Theory Pub Date : 2024-09-15 DOI:arxiv-2409.09781

Parth T. Nobel, Daniel LeJeune, Emmanuel J. Candès

引用次数: 0

Abstract

Estimating out-of-sample risk for models trained on large high-dimensional datasets is an expensive but essential part of the machine learning process, enabling practitioners to optimally tune hyperparameters. Cross-validation (CV) serves as the de facto standard for risk estimation but poorly trades off high bias ($K$-fold CV) for computational cost (leave-one-out CV). We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than $K$-fold CV. We support our claims with extensive simulations on synthetic and real data and provide a user-friendly Python package implementing RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

RandALO：快速进行样本外风险评估

估算在大型高维数据集上训练的模型的样本外风险是机器学习过程中一个昂贵但重要的部分，它使实践者能够优化调整超参数。交叉验证（CV）是风险估计的事实标准，但在高偏差（$K$-fold CV）与计算成本（leave-one-out CV）之间的权衡并不理想。我们提出了随机化近似撇除（RandALO）风险估计器，它不仅是高维度风险的一致估计器，而且计算成本低于 K$-fold CV。我们在合成数据和真实数据上进行了大量模拟，为我们的主张提供了支持，并提供了一个实现 RandALO 的用户友好型 Python 软件包，可在 PyPI 上以 randalo 的形式下载，也可在 https://github.com/cvxgrp/randalo 上下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - STAT - Statistics Theory

自引率

0.00%

发文量

期刊最新文献

Cyclicity Analysis of the Ornstein-Uhlenbeck Process Linear hypothesis testing in high-dimensional heteroscedastics via random integration Asymptotics for conformal inference Sparse Factor Analysis for Categorical Data with the Group-Sparse Generalized Singular Value Decomposition Incremental effects for continuous exposures