Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments

arXiv - CS - Mathematical Software Pub Date : 2024-01-24 DOI:arxiv-2401.13185

Ole-Christian Galbo Engstrøm

{"title":"Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\\mathbf{X}^\\mathbf{T}\\mathbf{X}$ and $\\mathbf{X}^\\mathbf{T}\\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments","authors":"Ole-Christian Galbo Engstrøm","doi":"arxiv-2401.13185","DOIUrl":null,"url":null,"abstract":"Cross-validation is a widely used technique for assessing the performance of\npredictive models on unseen data. Many predictive models, such as Kernel-Based\nPartial Least-Squares (PLS) models, require the computation of\n$\\mathbf{X}^{\\mathbf{T}}\\mathbf{X}$ and $\\mathbf{X}^{\\mathbf{T}}\\mathbf{Y}$\nusing only training set samples from the input and output matrices,\n$\\mathbf{X}$ and $\\mathbf{Y}$, respectively. In this work, we present three\nalgorithms that efficiently compute these matrices. The first one allows no\ncolumn-wise preprocessing. The second one allows column-wise centering around\nthe training set means. The third one allows column-wise centering and\ncolumn-wise scaling around the training set means and standard deviations.\nDemonstrating correctness and superior computational complexity, they offer\nsignificant cross-validation speedup compared with straight-forward\ncross-validation and previous work on fast cross-validation - all without data\nleakage. Their suitability for parallelization is highlighted with an\nopen-source Python implementation combining our algorithms with Improved Kernel\nPLS.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.13185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-validation is a widely used technique for assessing the performance of predictive models on unseen data. Many predictive models, such as Kernel-Based Partial Least-Squares (PLS) models, require the computation of $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ and $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ using only training set samples from the input and output matrices, $\mathbf{X}$ and $\mathbf{Y}$, respectively. In this work, we present three algorithms that efficiently compute these matrices. The first one allows no column-wise preprocessing. The second one allows column-wise centering around the training set means. The third one allows column-wise centering and column-wise scaling around the training set means and standard deviations. Demonstrating correctness and superior computational complexity, they offer significant cross-validation speedup compared with straight-forward cross-validation and previous work on fast cross-validation - all without data leakage. Their suitability for parallelization is highlighted with an open-source Python implementation combining our algorithms with Improved Kernel PLS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

交叉验证捷径：在不重新计算矩阵乘积或统计矩的情况下，高效地得出列居中和缩放的训练集 $\mathbf{X}^\mathbf{T}\mathbf{X}$ 和 $\mathbf{X}^\mathbf{T}\mathbf{Y}$

交叉验证是一种广泛使用的技术，用于评估预测模型在未见数据上的性能。许多预测模型，如基于核的局部最小二乘（PLS）模型，需要分别使用输入矩阵 $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ 和输出矩阵 $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ 的训练集样本来计算。在这项工作中，我们提出了三种能高效计算这些矩阵的算法。第一种算法允许进行无列预处理。第二种算法允许围绕训练集均值进行列居中。这些算法证明了其正确性和出色的计算复杂性，与直接向前交叉验证和以前的快速交叉验证相比，交叉验证的速度有了显著提高，而且没有数据损失。结合我们的算法和改进型 KernelPLS 的开源 Python 实现突出了它们的并行化适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Mathematical Software

自引率

0.00%

发文量