We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structures from noisy and high-dimensional observations, where the data sets are assumed to be sampled from a nonlinear manifold model and corrupted by high-dimensional noise. The algorithm employs an adaptive bandwidth selection procedure which does not rely on prior knowledge of the underlying manifold. The obtained low-dimensional embeddings can be further utilized for downstream purposes such as data visualization, clustering and prediction. Our method is theoretically justified and practically interpretable. Specifically, for a general class of kernel functions, we establish the convergence of the final embeddings to their noiseless counterparts when the dimension grows polynomially with the size, and characterize the effect of the signal-to-noise ratio on the rate of convergence and phase transition. We also prove the convergence of the embeddings to the eigenfunctions of an integral operator defined by the kernel map of some reproducing kernel Hilbert space capturing the underlying nonlinear structures. Our results hold even when the dimension of the manifold grows with the sample size. Numerical simulations and analysis of real data sets show the superior empirical performance of the proposed method, compared to many existing methods, on learning various nonlinear manifolds in diverse applications.
{"title":"Learning low-dimensional nonlinear structures from high-dimensional noisy data: An integral operator approach","authors":"Xiucai Ding, Rong Ma","doi":"10.1214/23-aos2306","DOIUrl":"https://doi.org/10.1214/23-aos2306","url":null,"abstract":"We propose a kernel-spectral embedding algorithm for learning low-dimensional nonlinear structures from noisy and high-dimensional observations, where the data sets are assumed to be sampled from a nonlinear manifold model and corrupted by high-dimensional noise. The algorithm employs an adaptive bandwidth selection procedure which does not rely on prior knowledge of the underlying manifold. The obtained low-dimensional embeddings can be further utilized for downstream purposes such as data visualization, clustering and prediction. Our method is theoretically justified and practically interpretable. Specifically, for a general class of kernel functions, we establish the convergence of the final embeddings to their noiseless counterparts when the dimension grows polynomially with the size, and characterize the effect of the signal-to-noise ratio on the rate of convergence and phase transition. We also prove the convergence of the embeddings to the eigenfunctions of an integral operator defined by the kernel map of some reproducing kernel Hilbert space capturing the underlying nonlinear structures. Our results hold even when the dimension of the manifold grows with the sample size. Numerical simulations and analysis of real data sets show the superior empirical performance of the proposed method, compared to many existing methods, on learning various nonlinear manifolds in diverse applications.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135055287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Convex Gaussian Min–Max Theorem (CGMT) has emerged as a prominent theoretical tool for analyzing the precise stochastic behavior of various statistical estimators in the so-called high-dimensional proportional regime, where the sample size and the signal dimension are of the same order. However, a well-recognized limitation of the existing CGMT machinery rests in its stringent requirement on the exact Gaussianity of the design matrix, therefore rendering the obtained precise high-dimensional asymptotics, largely a specific Gaussian theory in various important statistical models. This paper provides a structural universality framework for a broad class of regularized regression estimators that is particularly compatible with the CGMT machinery. Here, universality means that if a “structure” is satisfied by the regression estimator μˆG for a standard Gaussian design G, then it will also be satisfied by μˆA for a general non-Gaussian design A with independent entries. In particular, we show that with a good enough ℓ∞ bound for the regression estimator μˆA, any “structural property” that can be detected via the CGMT for μˆG also holds for μˆA under a general design A with independent entries. As a proof of concept, we demonstrate our new universality framework in three key examples of regularized regression estimators: the Ridge, Lasso and regularized robust regression estimators, where new universality properties of risk asymptotics and/or distributions of regression estimators and other related quantities are proved. As a major statistical implication of the Lasso universality results, we validate inference procedures using the degrees-of-freedom adjusted debiased Lasso under general design and error distributions. We also provide a counterexample, showing that universality properties for regularized regression estimators do not extend to general isotropic designs. The proof of our universality results relies on new comparison inequalities for the optimum of a broad class of cost functions and Gordon’s max–min (or min–max) costs, over arbitrary structure sets subject to ℓ∞ constraints. These results may be of independent interest and broader applicability.
{"title":"Universality of regularized regression estimators in high dimensions","authors":"Qiyang Han, Yandi Shen","doi":"10.1214/23-aos2309","DOIUrl":"https://doi.org/10.1214/23-aos2309","url":null,"abstract":"The Convex Gaussian Min–Max Theorem (CGMT) has emerged as a prominent theoretical tool for analyzing the precise stochastic behavior of various statistical estimators in the so-called high-dimensional proportional regime, where the sample size and the signal dimension are of the same order. However, a well-recognized limitation of the existing CGMT machinery rests in its stringent requirement on the exact Gaussianity of the design matrix, therefore rendering the obtained precise high-dimensional asymptotics, largely a specific Gaussian theory in various important statistical models. This paper provides a structural universality framework for a broad class of regularized regression estimators that is particularly compatible with the CGMT machinery. Here, universality means that if a “structure” is satisfied by the regression estimator μˆG for a standard Gaussian design G, then it will also be satisfied by μˆA for a general non-Gaussian design A with independent entries. In particular, we show that with a good enough ℓ∞ bound for the regression estimator μˆA, any “structural property” that can be detected via the CGMT for μˆG also holds for μˆA under a general design A with independent entries. As a proof of concept, we demonstrate our new universality framework in three key examples of regularized regression estimators: the Ridge, Lasso and regularized robust regression estimators, where new universality properties of risk asymptotics and/or distributions of regression estimators and other related quantities are proved. As a major statistical implication of the Lasso universality results, we validate inference procedures using the degrees-of-freedom adjusted debiased Lasso under general design and error distributions. We also provide a counterexample, showing that universality properties for regularized regression estimators do not extend to general isotropic designs. The proof of our universality results relies on new comparison inequalities for the optimum of a broad class of cost functions and Gordon’s max–min (or min–max) costs, over arbitrary structure sets subject to ℓ∞ constraints. These results may be of independent interest and broader applicability.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135055619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the standard Gaussian linear measurement model Y=Xμ0+ξ∈Rm with a fixed noise level σ>0, we consider the problem of estimating the unknown signal μ0 under a convex constraint μ0∈K, where K is a closed convex set in Rn. We show that the risk of the natural convex constrained least squares estimator (LSE) μˆ(σ) can be characterized exactly in high-dimensional limits, by that of the convex constrained LSE μˆKseq in the corresponding Gaussian sequence model at a different noise level. Formally, we show that ‖μˆ(σ)−μ0‖2/(nrn2)→1in probability, where rn 2>0 solves the fixed-point equation E‖μˆKseq( (rn2+σ2)/(m/n))−μ0‖2=nrn2. This characterization holds (uniformly) for risks rn2 in the maximal regime that ranges from constant order all the way down to essentially the parametric rate, as long as certain necessary nondegeneracy condition is satisfied for μˆ(σ). The precise risk characterization reveals a fundamental difference between noiseless (or low noise limit) and noisy linear inverse problems in terms of the sample complexity for signal recovery. A concrete example is given by the isotonic regression problem: While exact recovery of a general monotone signal requires m≫n1/3 samples in the noiseless setting, consistent signal recovery in the noisy setting requires as few as m≫logn samples. Such a discrepancy occurs when the low and high noise risk behavior of μˆKseq differ significantly. In statistical languages, this occurs when μˆKseq estimates 0 at a faster “adaptation rate” than the slower “worst-case rate” for general signals. Several other examples, including nonnegative least squares and generalized Lasso (in constrained forms), are also worked out to demonstrate the concrete applicability of the theory in problems of different types. The proof relies on a collection of new analytic and probabilistic results concerning estimation error, log likelihood ratio test statistics and degree-of-freedom associated with μˆKseq, regarded as stochastic processes indexed by the noise level. These results are of independent interest in and of themselves.
{"title":"Noisy linear inverse problems under convex constraints: Exact risk asymptotics in high dimensions","authors":"Qiyang Han","doi":"10.1214/23-aos2301","DOIUrl":"https://doi.org/10.1214/23-aos2301","url":null,"abstract":"In the standard Gaussian linear measurement model Y=Xμ0+ξ∈Rm with a fixed noise level σ>0, we consider the problem of estimating the unknown signal μ0 under a convex constraint μ0∈K, where K is a closed convex set in Rn. We show that the risk of the natural convex constrained least squares estimator (LSE) μˆ(σ) can be characterized exactly in high-dimensional limits, by that of the convex constrained LSE μˆKseq in the corresponding Gaussian sequence model at a different noise level. Formally, we show that ‖μˆ(σ)−μ0‖2/(nrn2)→1in probability, where rn 2>0 solves the fixed-point equation E‖μˆKseq( (rn2+σ2)/(m/n))−μ0‖2=nrn2. This characterization holds (uniformly) for risks rn2 in the maximal regime that ranges from constant order all the way down to essentially the parametric rate, as long as certain necessary nondegeneracy condition is satisfied for μˆ(σ). The precise risk characterization reveals a fundamental difference between noiseless (or low noise limit) and noisy linear inverse problems in terms of the sample complexity for signal recovery. A concrete example is given by the isotonic regression problem: While exact recovery of a general monotone signal requires m≫n1/3 samples in the noiseless setting, consistent signal recovery in the noisy setting requires as few as m≫logn samples. Such a discrepancy occurs when the low and high noise risk behavior of μˆKseq differ significantly. In statistical languages, this occurs when μˆKseq estimates 0 at a faster “adaptation rate” than the slower “worst-case rate” for general signals. Several other examples, including nonnegative least squares and generalized Lasso (in constrained forms), are also worked out to demonstrate the concrete applicability of the theory in problems of different types. The proof relies on a collection of new analytic and probabilistic results concerning estimation error, log likelihood ratio test statistics and degree-of-freedom associated with μˆKseq, regarded as stochastic processes indexed by the noise level. These results are of independent interest in and of themselves.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135055877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose NonStGM, a general nonparametric graphical modeling framework, for studying dynamic associations among the components of a nonstationary multivariate time series. It builds on the framework of Gaussian graphical models (GGM) and stationary time series graphical models (StGM) and complements existing works on parametric graphical models based on change point vector autoregressions (VAR). Analogous to StGM, the proposed framework captures conditional noncorrelations (both intertemporal and contemporaneous) in the form of an undirected graph. In addition, to describe the more nuanced nonstationary relationships among the components of the time series, we introduce the new notion of conditional nonstationarity/stationarity and incorporate it within the graph. This can be used to search for small subnetworks that serve as the “source” of nonstationarity in a large system. We explicitly connect conditional noncorrelation and stationarity between and within components of the multivariate time series to zero and Toeplitz embeddings of an infinite-dimensional inverse covariance operator. In the Fourier domain, conditional stationarity and noncorrelation relationships in the inverse covariance operator are encoded with a specific sparsity structure of its integral kernel operator. We show that these sparsity patterns can be recovered from finite-length time series by nodewise regression of discrete Fourier transforms (DFT) across different Fourier frequencies. We demonstrate the feasibility of learning NonStGM structure from data using simulation studies.
{"title":"Graphical models for nonstationary time series","authors":"Sumanta Basu, Suhasini Subba Rao","doi":"10.1214/22-aos2205","DOIUrl":"https://doi.org/10.1214/22-aos2205","url":null,"abstract":"We propose NonStGM, a general nonparametric graphical modeling framework, for studying dynamic associations among the components of a nonstationary multivariate time series. It builds on the framework of Gaussian graphical models (GGM) and stationary time series graphical models (StGM) and complements existing works on parametric graphical models based on change point vector autoregressions (VAR). Analogous to StGM, the proposed framework captures conditional noncorrelations (both intertemporal and contemporaneous) in the form of an undirected graph. In addition, to describe the more nuanced nonstationary relationships among the components of the time series, we introduce the new notion of conditional nonstationarity/stationarity and incorporate it within the graph. This can be used to search for small subnetworks that serve as the “source” of nonstationarity in a large system. We explicitly connect conditional noncorrelation and stationarity between and within components of the multivariate time series to zero and Toeplitz embeddings of an infinite-dimensional inverse covariance operator. In the Fourier domain, conditional stationarity and noncorrelation relationships in the inverse covariance operator are encoded with a specific sparsity structure of its integral kernel operator. We show that these sparsity patterns can be recovered from finite-length time series by nodewise regression of discrete Fourier transforms (DFT) across different Fourier frequencies. We demonstrate the feasibility of learning NonStGM structure from data using simulation studies.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134951510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single index models provide an effective dimension reduction tool in regression, especially for high-dimensional data, by projecting a general multivariate predictor onto a direction vector. We propose a novel single-index model for regression models where metric space-valued random object responses are coupled with multivariate Euclidean predictors. The responses in this regression model include complex, non-Euclidean data, including covariance matrices, graph Laplacians of networks and univariate probability distribution functions, among other complex objects that lie in abstract metric spaces. While Fréchet regression has proved useful for modeling the conditional mean of such random objects given multivariate Euclidean vectors, it does not provide for regression parameters such as slopes or intercepts, since the metric space-valued responses are not amenable to linear operations. As a consequence, distributional results for Fréchet regression have been elusive. We show here that for the case of multivariate Euclidean predictors, the parameters that define a single index and projection vector can be used to substitute for the inherent absence of parameters in Fréchet regression. Specifically, we derive the asymptotic distribution of suitable estimates of these parameters, which then can be utilized to test linear hypotheses for the parameters, subject to an identifiability condition. Consistent estimation of the link function of the single index Fréchet regression model is obtained through local linear Fréchet regression. We demonstrate the finite sample performance of estimation and inference for the proposed single index Fréchet regression model through simulation studies, including the special cases where responses are probability distributions and graph adjacency matrices. The method is illustrated for resting-state functional Magnetic Resonance Imaging (fMRI) data from the ADNI study.
{"title":"Single index Fréchet regression","authors":"Satarupa Bhattacharjee, Hans-Georg Müller","doi":"10.1214/23-aos2307","DOIUrl":"https://doi.org/10.1214/23-aos2307","url":null,"abstract":"Single index models provide an effective dimension reduction tool in regression, especially for high-dimensional data, by projecting a general multivariate predictor onto a direction vector. We propose a novel single-index model for regression models where metric space-valued random object responses are coupled with multivariate Euclidean predictors. The responses in this regression model include complex, non-Euclidean data, including covariance matrices, graph Laplacians of networks and univariate probability distribution functions, among other complex objects that lie in abstract metric spaces. While Fréchet regression has proved useful for modeling the conditional mean of such random objects given multivariate Euclidean vectors, it does not provide for regression parameters such as slopes or intercepts, since the metric space-valued responses are not amenable to linear operations. As a consequence, distributional results for Fréchet regression have been elusive. We show here that for the case of multivariate Euclidean predictors, the parameters that define a single index and projection vector can be used to substitute for the inherent absence of parameters in Fréchet regression. Specifically, we derive the asymptotic distribution of suitable estimates of these parameters, which then can be utilized to test linear hypotheses for the parameters, subject to an identifiability condition. Consistent estimation of the link function of the single index Fréchet regression model is obtained through local linear Fréchet regression. We demonstrate the finite sample performance of estimation and inference for the proposed single index Fréchet regression model through simulation studies, including the special cases where responses are probability distributions and graph adjacency matrices. The method is illustrated for resting-state functional Magnetic Resonance Imaging (fMRI) data from the ADNI study.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134951505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolas Verzelen, Magalie Fromont, Matthieu Lerasle, Patricia Reynaud-Bouret
Given a times series Y in Rn, with a piecewise constant mean and independent components, the twin problems of change-point detection and change-point localization, respectively amount to detecting the existence of times where the mean varies and estimating the positions of those change-points. In this work, we tightly characterize optimal rates for both problems and uncover the phase transition phenomenon from a global testing problem to a local estimation problem. Introducing a suitable definition of the energy of a change-point, we first establish in the single change-point setting that the optimal detection threshold is 2loglog(n). When the energy is just above the detection threshold, then the problem of localizing the change-point becomes purely parametric: it only depends on the difference in means and not on the position of the change-point anymore. Interestingly, for most change-point positions, including all those away from the endpoints of the time series, it is possible to detect and localize them at a much smaller energy level. In the multiple change-point setting, we establish the energy detection threshold and show similarly that the optimal localization error of a specific change-point becomes purely parametric. Along the way, tight minimax rates for Hausdorff and l 1 estimation losses of the vector of all change-points positions are also established. Two procedures achieving these optimal rates are introduced. The first one is a least-squares estimator with a new multiscale penalty that favours well spread change-points. The second one is a two-step multiscale post-processing procedure whose computational complexity can be as low as O(nlog(n)). Notably, these two procedures accommodate with the presence of possibly many low-energy and therefore undetectable change-points and are still able to detect and localize high-energy change-points even with the presence of those nuisance parameters.
{"title":"Optimal change-point detection and localization","authors":"Nicolas Verzelen, Magalie Fromont, Matthieu Lerasle, Patricia Reynaud-Bouret","doi":"10.1214/23-aos2297","DOIUrl":"https://doi.org/10.1214/23-aos2297","url":null,"abstract":"Given a times series Y in Rn, with a piecewise constant mean and independent components, the twin problems of change-point detection and change-point localization, respectively amount to detecting the existence of times where the mean varies and estimating the positions of those change-points. In this work, we tightly characterize optimal rates for both problems and uncover the phase transition phenomenon from a global testing problem to a local estimation problem. Introducing a suitable definition of the energy of a change-point, we first establish in the single change-point setting that the optimal detection threshold is 2loglog(n). When the energy is just above the detection threshold, then the problem of localizing the change-point becomes purely parametric: it only depends on the difference in means and not on the position of the change-point anymore. Interestingly, for most change-point positions, including all those away from the endpoints of the time series, it is possible to detect and localize them at a much smaller energy level. In the multiple change-point setting, we establish the energy detection threshold and show similarly that the optimal localization error of a specific change-point becomes purely parametric. Along the way, tight minimax rates for Hausdorff and l 1 estimation losses of the vector of all change-points positions are also established. Two procedures achieving these optimal rates are introduced. The first one is a least-squares estimator with a new multiscale penalty that favours well spread change-points. The second one is a two-step multiscale post-processing procedure whose computational complexity can be as low as O(nlog(n)). Notably, these two procedures accommodate with the presence of possibly many low-energy and therefore undetectable change-points and are still able to detect and localize high-energy change-points even with the presence of those nuisance parameters.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135065833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Roycraft, Johannes Krebs, Wolfgang Polonik
We investigate multivariate bootstrap procedures for general stabilizing statistics, with specific application to topological data analysis. The work relates to other general results in the area of stabilizing statistics, including central limit theorems for geometric and topological functionals of Poisson and binomial processes in the critical regime, where limit theorems prove difficult to use in practice, motivating the use of a bootstrap approach. A smoothed bootstrap procedure is shown to give consistent estimation in these settings. Specific statistics considered include the persistent Betti numbers of Čech and Vietoris–Rips complexes over point sets in Rd, along with Euler characteristics, and the total edge length of the k-nearest neighbor graph. Special emphasis is given to weakening the necessary conditions needed to establish bootstrap consistency. In particular, the assumption of a continuous underlying density is not required. Numerical studies illustrate the performance of the proposed method.
{"title":"Bootstrapping persistent Betti numbers and other stabilizing statistics","authors":"Benjamin Roycraft, Johannes Krebs, Wolfgang Polonik","doi":"10.1214/23-aos2277","DOIUrl":"https://doi.org/10.1214/23-aos2277","url":null,"abstract":"We investigate multivariate bootstrap procedures for general stabilizing statistics, with specific application to topological data analysis. The work relates to other general results in the area of stabilizing statistics, including central limit theorems for geometric and topological functionals of Poisson and binomial processes in the critical regime, where limit theorems prove difficult to use in practice, motivating the use of a bootstrap approach. A smoothed bootstrap procedure is shown to give consistent estimation in these settings. Specific statistics considered include the persistent Betti numbers of Čech and Vietoris–Rips complexes over point sets in Rd, along with Euler characteristics, and the total edge length of the k-nearest neighbor graph. Special emphasis is given to weakening the necessary conditions needed to establish bootstrap consistency. In particular, the assumption of a continuous underlying density is not required. Numerical studies illustrate the performance of the proposed method.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135222531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a prespecified bound. This shows to which extent the bias-variance trade-off is unavoidable and allows to quantify the loss of performance for methods that do not obey it. The approach is based on a number of abstract lower bounds for the variance involving the change of expectation with respect to different probability measures as well as information measures such as the Kullback-Leibler or chi-square divergence. Some of these inequalities rely on a new concept of information matrices. In a second part of the article, the abstract lower bounds are applied to several statistical models including the Gaussian white noise model, a boundary estimation problem, the Gaussian sequence model and the high-dimensional linear regression model. For these specific statistical applications, different types of bias-variance trade-offs occur that vary considerably in their strength. For the trade-off between integrated squared bias and integrated variance in the Gaussian white noise model, we propose to combine the general strategy for lower bounds with a reduction technique. This allows us to reduce the original problem to a lower bound on the bias-variance trade-off for estimators with additional symmetry properties in a simpler statistical model. To highlight possible extensions of the proposed framework, we moreover briefly discuss the trade-off between bias and mean absolute deviation.
{"title":"On lower bounds for the bias-variance trade-off","authors":"Alexis Derumigny, Johannes Schmidt-Hieber","doi":"10.1214/23-aos2279","DOIUrl":"https://doi.org/10.1214/23-aos2279","url":null,"abstract":"It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a prespecified bound. This shows to which extent the bias-variance trade-off is unavoidable and allows to quantify the loss of performance for methods that do not obey it. The approach is based on a number of abstract lower bounds for the variance involving the change of expectation with respect to different probability measures as well as information measures such as the Kullback-Leibler or chi-square divergence. Some of these inequalities rely on a new concept of information matrices. In a second part of the article, the abstract lower bounds are applied to several statistical models including the Gaussian white noise model, a boundary estimation problem, the Gaussian sequence model and the high-dimensional linear regression model. For these specific statistical applications, different types of bias-variance trade-offs occur that vary considerably in their strength. For the trade-off between integrated squared bias and integrated variance in the Gaussian white noise model, we propose to combine the general strategy for lower bounds with a reduction technique. This allows us to reduce the original problem to a lower bound on the bias-variance trade-off for estimators with additional symmetry properties in a simpler statistical model. To highlight possible extensions of the proposed framework, we moreover briefly discuss the trade-off between bias and mean absolute deviation.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134951950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy, given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length) with an exponent that depends on the overlap of the target and behavior policies as well as the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax, given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes but strictly easier than model-free off-policy evaluation.
{"title":"Off-policy evaluation in partially observed Markov decision processes under sequential ignorability","authors":"Yuchen Hu, Stefan Wager","doi":"10.1214/23-aos2287","DOIUrl":"https://doi.org/10.1214/23-aos2287","url":null,"abstract":"We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy, given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length) with an exponent that depends on the overlap of the target and behavior policies as well as the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax, given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes but strictly easier than model-free off-policy evaluation.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135055876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We extend extreme value statistics to independent data with possibly very different distributions. In particular, we present novel asymptotic normality results for the Hill estimator, which now estimates the extreme value index of the average distribution. Due to the heterogeneity, the asymptotic variance can be substantially smaller than that in the i.i.d. case. As a special case, we consider a heterogeneous scales model where the asymptotic variance can be calculated explicitly. The primary tool for the proofs is the functional central limit theorem for a weighted tail empirical process. We also present asymptotic normality results for the extreme quantile estimator. A simulation study shows the good finite-sample behavior of our limit theorems. We also present applications to assess the tail heaviness of earthquake energies and of cross-sectional stock market losses.
{"title":"Extreme value inference for heterogeneous power law data","authors":"John H.J. Einmahl, Yi He","doi":"10.1214/23-aos2294","DOIUrl":"https://doi.org/10.1214/23-aos2294","url":null,"abstract":"We extend extreme value statistics to independent data with possibly very different distributions. In particular, we present novel asymptotic normality results for the Hill estimator, which now estimates the extreme value index of the average distribution. Due to the heterogeneity, the asymptotic variance can be substantially smaller than that in the i.i.d. case. As a special case, we consider a heterogeneous scales model where the asymptotic variance can be calculated explicitly. The primary tool for the proofs is the functional central limit theorem for a weighted tail empirical process. We also present asymptotic normality results for the extreme quantile estimator. A simulation study shows the good finite-sample behavior of our limit theorems. We also present applications to assess the tail heaviness of earthquake energies and of cross-sectional stock market losses.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135046050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}