Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments
{"title":"Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\\mathbf{X}^\\mathbf{T}\\mathbf{X}$ and $\\mathbf{X}^\\mathbf{T}\\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments","authors":"Ole-Christian Galbo Engstrøm","doi":"arxiv-2401.13185","DOIUrl":null,"url":null,"abstract":"Cross-validation is a widely used technique for assessing the performance of\npredictive models on unseen data. Many predictive models, such as Kernel-Based\nPartial Least-Squares (PLS) models, require the computation of\n$\\mathbf{X}^{\\mathbf{T}}\\mathbf{X}$ and $\\mathbf{X}^{\\mathbf{T}}\\mathbf{Y}$\nusing only training set samples from the input and output matrices,\n$\\mathbf{X}$ and $\\mathbf{Y}$, respectively. In this work, we present three\nalgorithms that efficiently compute these matrices. The first one allows no\ncolumn-wise preprocessing. The second one allows column-wise centering around\nthe training set means. The third one allows column-wise centering and\ncolumn-wise scaling around the training set means and standard deviations.\nDemonstrating correctness and superior computational complexity, they offer\nsignificant cross-validation speedup compared with straight-forward\ncross-validation and previous work on fast cross-validation - all without data\nleakage. Their suitability for parallelization is highlighted with an\nopen-source Python implementation combining our algorithms with Improved Kernel\nPLS.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.13185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-validation is a widely used technique for assessing the performance of
predictive models on unseen data. Many predictive models, such as Kernel-Based
Partial Least-Squares (PLS) models, require the computation of
$\mathbf{X}^{\mathbf{T}}\mathbf{X}$ and $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$
using only training set samples from the input and output matrices,
$\mathbf{X}$ and $\mathbf{Y}$, respectively. In this work, we present three
algorithms that efficiently compute these matrices. The first one allows no
column-wise preprocessing. The second one allows column-wise centering around
the training set means. The third one allows column-wise centering and
column-wise scaling around the training set means and standard deviations.
Demonstrating correctness and superior computational complexity, they offer
significant cross-validation speedup compared with straight-forward
cross-validation and previous work on fast cross-validation - all without data
leakage. Their suitability for parallelization is highlighted with an
open-source Python implementation combining our algorithms with Improved Kernel
PLS.