{"title":"Inference for Large Scale Regression Models with Dependent Errors","authors":"Lionel Voirol, Haotian Xu, Yuming Zhang, Luca Insolia, Roberto Molinari, Stéphane Guerrier","doi":"arxiv-2409.05160","DOIUrl":null,"url":null,"abstract":"The exponential growth in data sizes and storage costs has brought\nconsiderable challenges to the data science community, requiring solutions to\nrun learning methods on such data. While machine learning has scaled to achieve\npredictive accuracy in big data settings, statistical inference and uncertainty\nquantification tools are still lagging. Priority scientific fields collect vast\ndata to understand phenomena typically studied with statistical methods like\nregression. In this setting, regression parameter estimation can benefit from\nefficient computational procedures, but the main challenge lies in computing\nerror process parameters with complex covariance structures. Identifying and\nestimating these structures is essential for inference and often used for\nuncertainty quantification in machine learning with Gaussian Processes.\nHowever, estimating these structures becomes burdensome as data scales,\nrequiring approximations that compromise the reliability of outputs. These\napproximations are even more unreliable when complexities like long-range\ndependencies or missing data are present. This work defines and proves the\nstatistical properties of the Generalized Method of Wavelet Moments with\nExogenous variables (GMWMX), a highly scalable, stable, and statistically valid\nmethod for estimating and delivering inference for linear models using\nstochastic processes in the presence of data complexities like latent\ndependence structures and missing data. Applied examples from Earth Sciences\nand extensive simulations highlight the advantages of the GMWMX.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05160","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The exponential growth in data sizes and storage costs has brought
considerable challenges to the data science community, requiring solutions to
run learning methods on such data. While machine learning has scaled to achieve
predictive accuracy in big data settings, statistical inference and uncertainty
quantification tools are still lagging. Priority scientific fields collect vast
data to understand phenomena typically studied with statistical methods like
regression. In this setting, regression parameter estimation can benefit from
efficient computational procedures, but the main challenge lies in computing
error process parameters with complex covariance structures. Identifying and
estimating these structures is essential for inference and often used for
uncertainty quantification in machine learning with Gaussian Processes.
However, estimating these structures becomes burdensome as data scales,
requiring approximations that compromise the reliability of outputs. These
approximations are even more unreliable when complexities like long-range
dependencies or missing data are present. This work defines and proves the
statistical properties of the Generalized Method of Wavelet Moments with
Exogenous variables (GMWMX), a highly scalable, stable, and statistically valid
method for estimating and delivering inference for linear models using
stochastic processes in the presence of data complexities like latent
dependence structures and missing data. Applied examples from Earth Sciences
and extensive simulations highlight the advantages of the GMWMX.