首页 > 最新文献

Annals of Statistics最新文献

英文 中文
Inference on the maximal rank of time-varying covariance matrices using high-frequency data 基于高频数据的时变协方差矩阵最大秩的推断
1区 数学 Q1 Mathematics Pub Date : 2023-04-01 DOI: 10.1214/23-aos2273
Markus Reiss, Lars Winkelmann
We study the rank of the instantaneous or spot covariance matrix ΣX(t) of a multidimensional process X(t). Given high-frequency observations X(i/n), i=0,…,n, we test the null hypothesis rank(ΣX(t))≤r for all t against local alternatives where the average (r+1)st eigenvalue is larger than some signal detection rate vn. A major problem is that the inherent averaging in local covariance statistics produces a bias that distorts the rank statistics. We show that the bias depends on the regularity and spectral gap of ΣX(t). We establish explicit matrix perturbation and concentration results that provide nonasymptotic uniform critical values and optimal signal detection rates vn. This leads to a rank estimation method via sequential testing. For a class of stochastic volatility models, we determine data-driven critical values via normed p-variations of estimated local covariance matrices. The methods are illustrated by simulations and an application to high-frequency data of U.S. government bonds.
我们研究了多维过程X(t)的瞬时或点协方差矩阵ΣX(t)的秩。给定高频观测值X(i/n), i=0,…,n,我们对所有t针对局部替代方案检验零假设秩(ΣX(t))≤r,其中平均(r+1)st特征值大于某些信号检测率vn。一个主要问题是局部协方差统计中固有的平均会产生偏差,从而扭曲秩统计。我们表明,偏差取决于ΣX(t)的规律性和谱间隙。我们建立了显式矩阵摄动和集中结果,提供了非渐近一致临界值和最佳信号检测率vn。这导致了通过顺序测试的秩估计方法。对于一类随机波动模型,我们通过估计的局部协方差矩阵的归一化p变来确定数据驱动的临界值。通过模拟和对美国政府债券高频数据的应用说明了这些方法。
{"title":"Inference on the maximal rank of time-varying covariance matrices using high-frequency data","authors":"Markus Reiss, Lars Winkelmann","doi":"10.1214/23-aos2273","DOIUrl":"https://doi.org/10.1214/23-aos2273","url":null,"abstract":"We study the rank of the instantaneous or spot covariance matrix ΣX(t) of a multidimensional process X(t). Given high-frequency observations X(i/n), i=0,…,n, we test the null hypothesis rank(ΣX(t))≤r for all t against local alternatives where the average (r+1)st eigenvalue is larger than some signal detection rate vn. A major problem is that the inherent averaging in local covariance statistics produces a bias that distorts the rank statistics. We show that the bias depends on the regularity and spectral gap of ΣX(t). We establish explicit matrix perturbation and concentration results that provide nonasymptotic uniform critical values and optimal signal detection rates vn. This leads to a rank estimation method via sequential testing. For a class of stochastic volatility models, we determine data-driven critical values via normed p-variations of estimated local covariance matrices. The methods are illustrated by simulations and an application to high-frequency data of U.S. government bonds.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135673417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimally tackling covariate shift in RKHS-based nonparametric regression 基于rkhs的非参数回归中协变量移位的优化处理
1区 数学 Q1 Mathematics Pub Date : 2023-04-01 DOI: 10.1214/23-aos2268
Cong Ma, Reese Pathak, Martin J. Wainwright
We study the covariate shift problem in the context of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We focus on two natural families of covariate shift problems defined using the likelihood ratios between the source and target distributions. When the likelihood ratios are uniformly bounded, we prove that the kernel ridge regression (KRR) estimator with a carefully chosen regularization parameter is minimax rate-optimal (up to a log factor) for a large family of RKHSs with regular kernel eigenvalues. Interestingly, KRR does not require full knowledge of the likelihood ratio apart from an upper bound on it. In striking contrast to the standard statistical setting without covariate shift, we also demonstrate that a naïve estimator, which minimizes the empirical risk over the function class, is strictly suboptimal under covariate shift as compared to KRR. We then address the larger class of covariate shift problems where likelihood ratio is possibly unbounded yet has a finite second moment. Here, we propose a reweighted KRR estimator that weights samples based on a careful truncation of the likelihood ratios. Again, we are able to show that this estimator is minimax optimal, up to logarithmic factors.
研究了非参数回归在再现核希尔伯特空间(RKHS)上的协变量移位问题。我们关注两个自然的协变量移位问题族,使用源分布和目标分布之间的似然比来定义。当似然比一致有界时,我们证明了具有正则核特征值的核脊回归(KRR)估计器具有精心选择的正则化参数是最小最大率最优的(高达一个对数因子)。有趣的是,KRR不需要完全了解似然比,除了它的上界。与没有协变量移位的标准统计设置形成鲜明对比,我们还证明了与KRR相比,在协变量移位下,将函数类的经验风险最小化的naïve估计器是严格次优的。然后,我们处理更大的一类协变量移位问题,其中似然比可能是无界的,但具有有限的第二矩。在这里,我们提出了一个重新加权的KRR估计器,该估计器基于仔细截断似然比来对样本进行加权。再一次,我们能够证明这个估计器是最小最大最优的,直到对数因子。
{"title":"Optimally tackling covariate shift in RKHS-based nonparametric regression","authors":"Cong Ma, Reese Pathak, Martin J. Wainwright","doi":"10.1214/23-aos2268","DOIUrl":"https://doi.org/10.1214/23-aos2268","url":null,"abstract":"We study the covariate shift problem in the context of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We focus on two natural families of covariate shift problems defined using the likelihood ratios between the source and target distributions. When the likelihood ratios are uniformly bounded, we prove that the kernel ridge regression (KRR) estimator with a carefully chosen regularization parameter is minimax rate-optimal (up to a log factor) for a large family of RKHSs with regular kernel eigenvalues. Interestingly, KRR does not require full knowledge of the likelihood ratio apart from an upper bound on it. In striking contrast to the standard statistical setting without covariate shift, we also demonstrate that a naïve estimator, which minimizes the empirical risk over the function class, is strictly suboptimal under covariate shift as compared to KRR. We then address the larger class of covariate shift problems where likelihood ratio is possibly unbounded yet has a finite second moment. Here, we propose a reweighted KRR estimator that weights samples based on a careful truncation of the likelihood ratios. Again, we are able to show that this estimator is minimax optimal, up to logarithmic factors.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135673416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On high-dimensional Poisson models with measurement error: Hypothesis testing for nonlinear nonconvex optimization. 具有测量误差的高维泊松模型:非线性非凸优化的假设检验。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2023-02-01 DOI: 10.1214/22-aos2248
Fei Jiang, Yeqing Zhou, Jianxuan Liu, Yanyuan Ma

We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.

本文研究了含噪声高维协变量的泊松回归模型的估计和检验,该模型在噪声大数据分析中有广泛的应用。校正由协变量噪声引起的估计偏差导致非凸目标函数最小化。进一步处理高维问题会使我们对目标函数增加一个可接受的惩罚项。我们提出通过最小化惩罚目标函数来估计回归参数。我们得到了估计器的L1和L2收敛速率,并证明了变量选择的一致性。我们进一步建立了参数的任意子集的渐近正态性,只要其基数增长足够慢,该子集可以有无限多个分量。基于估计量的渐近正态性,我们开发了Wald和score检验,它允许对子集的成员的线性函数进行检验。我们通过广泛的模拟来检验所提出的测试的有限样本性能。最后,该方法成功应用于阿尔茨海默病神经影像学倡议研究,初步推动了本工作的开展。
{"title":"On high-dimensional Poisson models with measurement error: Hypothesis testing for nonlinear nonconvex optimization.","authors":"Fei Jiang,&nbsp;Yeqing Zhou,&nbsp;Jianxuan Liu,&nbsp;Yanyuan Ma","doi":"10.1214/22-aos2248","DOIUrl":"https://doi.org/10.1214/22-aos2248","url":null,"abstract":"<p><p>We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the <i>L</i><sub>1</sub> and <i>L</i><sub>2</sub> convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10438917/pdf/nihms-1868138.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10054730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On High dimensional Poisson models with measurement error: hypothesis testing for nonlinear nonconvex optimization 具有测量误差的高维泊松模型:非线性非凸优化的假设检验
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2022-12-31 DOI: 10.48550/arXiv.2301.00139
Fei Jiang, Yeqing Zhou, Jianxuan Liu, Yanyuan Ma
We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.
本文研究了含噪声高维协变量的泊松回归模型的估计和检验,该模型在噪声大数据分析中有广泛的应用。校正由协变量噪声引起的估计偏差导致非凸目标函数最小化。进一步处理高维问题会使我们对目标函数增加一个可接受的惩罚项。我们提出通过最小化惩罚目标函数来估计回归参数。我们得到了估计器的L1和L2收敛速率,并证明了变量选择的一致性。我们进一步建立了参数的任意子集的渐近正态性,只要其基数增长足够慢,该子集可以有无限多个分量。基于估计量的渐近正态性,我们开发了Wald和score检验,它允许对子集的成员的线性函数进行检验。我们通过广泛的模拟来检验所提出的测试的有限样本性能。最后,该方法成功应用于阿尔茨海默病神经影像学倡议研究,初步推动了本工作的开展。
{"title":"On High dimensional Poisson models with measurement error: hypothesis testing for nonlinear nonconvex optimization","authors":"Fei Jiang, Yeqing Zhou, Jianxuan Liu, Yanyuan Ma","doi":"10.48550/arXiv.2301.00139","DOIUrl":"https://doi.org/10.48550/arXiv.2301.00139","url":null,"abstract":"We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45193479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES. 平均奖励马尔可夫决策过程中的批量策略学习。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2022-12-01 DOI: 10.1214/22-aos2231
Peng Liao, Zhengling Qi, Runzhe Wan, Predrag Klasnja, Susan A Murphy

We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.

研究了无限视界马尔可夫决策过程中的批量(离线)策略学习问题。在移动医疗应用程序的激励下,我们专注于学习一种使长期平均回报最大化的策略。我们提出了一种双鲁棒的平均奖励估计器,并证明它达到了半参数效率。在此基础上,提出了一种优化算法来计算参数化随机策略类的最优策略。估计策略的性能通过策略类中最优平均奖励与估计策略的平均奖励之间的差来衡量,并建立有限样本后悔保证。通过模拟研究和对促进身体活动的移动健康研究的分析说明了该方法的性能。
{"title":"BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.","authors":"Peng Liao,&nbsp;Zhengling Qi,&nbsp;Runzhe Wan,&nbsp;Predrag Klasnja,&nbsp;Susan A Murphy","doi":"10.1214/22-aos2231","DOIUrl":"https://doi.org/10.1214/22-aos2231","url":null,"abstract":"<p><p>We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072865/pdf/nihms-1837036.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9270218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
LINEAR BIOMARKER COMBINATION FOR CONSTRAINED CLASSIFICATION. 约束分类的线性生物标志物组合。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2022-10-01 Epub Date: 2022-10-27 DOI: 10.1214/22-aos2210
Yijian Huang, Martin G Sanda

Multiple biomarkers are often combined to improve disease diagnosis. The uniformly optimal combination, i.e., with respect to all reasonable performance metrics, unfortunately requires excessive distributional modeling, to which the estimation can be sensitive. An alternative strategy is rather to pursue local optimality with respect to a specific performance metric. Nevertheless, existing methods may not target clinical utility of the intended medical test, which usually needs to operate above a certain sensitivity or specificity level, or do not have their statistical properties well studied and understood. In this article, we develop and investigate a linear combination method to maximize the clinical utility empirically for such a constrained classification. The combination coefficient is shown to have cube root asymptotics. The convergence rate and limiting distribution of the predictive performance are subsequently established, exhibiting robustness of the method in comparison with others. An algorithm with sound statistical justification is devised for efficient and high-quality computation. Simulations corroborate the theoretical results, and demonstrate good statistical and computational performance. Illustration with a clinical study on aggressive prostate cancer detection is provided.

多种生物标志物经常联合使用以改善疾病诊断。不幸的是,统一的最优组合,即,关于所有合理的性能指标,需要过多的分布建模,这对估计可能很敏感。另一种策略是针对特定的性能指标追求局部最优性。然而,现有的方法可能无法针对预期医学测试的临床应用,通常需要在一定的灵敏度或特异性水平以上操作,或者对其统计特性没有很好的研究和理解。在本文中,我们开发和研究了一种线性组合方法,以最大限度地提高这种约束分类的临床效用。证明了组合系数具有立方根渐近性。随后建立了预测性能的收敛率和极限分布,与其他方法相比,显示了该方法的鲁棒性。为了提高计算效率和质量,设计了一种具有良好统计合理性的算法。仿真结果证实了理论结果,并显示出良好的统计性能和计算性能。提供了一项侵袭性前列腺癌检测的临床研究的例证。
{"title":"LINEAR BIOMARKER COMBINATION FOR CONSTRAINED CLASSIFICATION.","authors":"Yijian Huang,&nbsp;Martin G Sanda","doi":"10.1214/22-aos2210","DOIUrl":"https://doi.org/10.1214/22-aos2210","url":null,"abstract":"<p><p>Multiple biomarkers are often combined to improve disease diagnosis. The uniformly optimal combination, i.e., with respect to all reasonable performance metrics, unfortunately requires excessive distributional modeling, to which the estimation can be sensitive. An alternative strategy is rather to pursue local optimality with respect to a specific performance metric. Nevertheless, existing methods may not target clinical utility of the intended medical test, which usually needs to operate above a certain sensitivity or specificity level, or do not have their statistical properties well studied and understood. In this article, we develop and investigate a linear combination method to maximize the clinical utility empirically for such a constrained classification. The combination coefficient is shown to have cube root asymptotics. The convergence rate and limiting distribution of the predictive performance are subsequently established, exhibiting robustness of the method in comparison with others. An algorithm with sound statistical justification is devised for efficient and high-quality computation. Simulations corroborate the theoretical results, and demonstrate good statistical and computational performance. Illustration with a clinical study on aggressive prostate cancer detection is provided.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9635489/pdf/nihms-1819429.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40449706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
DOUBLY DEBIASED LASSO: HIGH-DIMENSIONAL INFERENCE UNDER HIDDEN CONFOUNDING. 双重去偏套索:隐藏混杂下的高维推理。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2022-06-01 Epub Date: 2022-06-16 DOI: 10.1214/21-aos2152
Zijian Guo, Domagoj Ćevid, Peter Bühlmann

Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden confounding and propose the Doubly Debiased Lasso estimator for individual components of the regression coefficient vector. Our advocated method simultaneously corrects both the bias due to estimation of high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application.

从观测数据推断因果关系或相关关联可能因隐藏混淆的存在而无效。我们专注于一个高维线性回归设置,其中测量的协变量受到隐藏混淆的影响,并提出了回归系数向量的各个分量的双去偏Lasso估计器。我们提出的方法同时修正了由于高维参数估计引起的偏差和由于隐藏混杂引起的偏差。我们建立了它的渐近正态性,并证明了它在高斯-马尔可夫意义上是有效的。我们的方法的有效性依赖于一个密集的混杂假设,即每个混杂变量影响许多协变量。有限样本性能通过广泛的模拟研究和基因组应用来说明。
{"title":"DOUBLY DEBIASED LASSO: HIGH-DIMENSIONAL INFERENCE UNDER HIDDEN CONFOUNDING.","authors":"Zijian Guo,&nbsp;Domagoj Ćevid,&nbsp;Peter Bühlmann","doi":"10.1214/21-aos2152","DOIUrl":"https://doi.org/10.1214/21-aos2152","url":null,"abstract":"<p><p>Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden confounding and propose the <i>Doubly Debiased Lasso</i> estimator for individual components of the regression coefficient vector. Our advocated method simultaneously corrects both the bias due to estimation of high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9365063/pdf/nihms-1824950.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40608265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION. 高维无脊最小二乘插值中的奇异值。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2022-04-01 Epub Date: 2022-04-07 DOI: 10.1214/21-aos2133
Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J Tibshirani

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i d , W p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

插值器——实现零训练误差的估计器——在机器学习中引起了越来越多的关注,主要是因为最先进的神经网络似乎就是这种类型的模型。在本文中,我们研究最小ℓ2规范(“ridgeless”)插值最小二乘回归,关注的高维政权的未知参数p是相同的样品订单数量n。我们考虑两种不同的模型特性分布:一个线性模型,特征向量x我∈ℝp是通过应用一个线性变换的向量i.i.d.条目,x =Σ1/2 z (z我∈ℝp);和一个非线性模型,其中特征向量是通过一个随机的单层神经网络传递输入得到的,xi = φ(Wz i)(其中,z i∈h, W∈h, p × d是一个包含i个元素的矩阵,φ是一个分量作用于Wz i的激活函数)。我们以一种精确的定量方式恢复了在大规模神经网络和核机器中观察到的几种现象,包括预测风险的“双重下降”行为,以及过度参数化的潜在好处。
{"title":"SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION.","authors":"Trevor Hastie,&nbsp;Andrea Montanari,&nbsp;Saharon Rosset,&nbsp;Ryan J Tibshirani","doi":"10.1214/21-aos2133","DOIUrl":"https://doi.org/10.1214/21-aos2133","url":null,"abstract":"<p><p>Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum <i>ℓ</i> <sub>2</sub> norm (\"ridgeless\") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters <i>p</i> is of the same order as the number of samples <i>n</i>. We consider two different models for the feature distribution: a linear model, where the feature vectors <math> <mrow><msub><mi>x</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>p</mi></msup> </mrow> </math> are obtained by applying a linear transform to a vector of i.i.d. entries, <i>x</i> <sub><i>i</i></sub> = Σ<sup>1/2</sup> <i>z</i> <sub><i>i</i></sub> (with <math> <mrow><msub><mi>z</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>p</mi></msup> </mrow> </math> ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, <i>x<sub>i</sub></i> = <i>φ</i>(<i>Wz</i> <sub><i>i</i></sub> ) (with <math> <mrow><msub><mi>z</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>d</mi></msup> </mrow> </math> , <math><mrow><mi>W</mi> <mo>∈</mo> <msup><mi>ℝ</mi> <mrow><mi>p</mi> <mo>×</mo> <mi>d</mi></mrow> </msup> </mrow> </math> a matrix of i.i.d. entries, and <i>φ</i> an activation function acting componentwise on <i>Wz</i> <sub><i>i</i></sub> ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the \"double descent\" behavior of the prediction risk, and the potential benefits of overparametrization.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9481183/pdf/nihms-1830540.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40367700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 579
OPTIMAL FALSE DISCOVERY RATE CONTROL FOR LARGE SCALE MULTIPLE TESTING WITH AUXILIARY INFORMATION. 基于辅助信息的大规模多重测试的最优错误发现率控制。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2022-04-01 DOI: 10.1214/21-aos2128
Hongyuan Cao, Jun Chen, Xianyang Zhang

Large-scale multiple testing is a fundamental problem in high dimensional statistical inference. It is increasingly common that various types of auxiliary information, reflecting the structural relationship among the hypotheses, are available. Exploiting such auxiliary information can boost statistical power. To this end, we propose a framework based on a two-group mixture model with varying probabilities of being null for different hypotheses a priori, where a shape-constrained relationship is imposed between the auxiliary information and the prior probabilities of being null. An optimal rejection rule is designed to maximize the expected number of true positives when average false discovery rate is controlled. Focusing on the ordered structure, we develop a robust EM algorithm to estimate the prior probabilities of being null and the distribution of p-values under the alternative hypothesis simultaneously. We show that the proposed method has better power than state-of-the-art competitors while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method. Datasets from genome-wide association studies are used to illustrate the new methodology.

大规模多重检验是高维统计推理中的一个基本问题。反映假设之间结构关系的各种类型的辅助信息越来越普遍。利用这些辅助信息可以提高统计能力。为此,我们提出了一个基于两组混合模型的框架,该模型对不同的先验假设具有不同的为零概率,其中辅助信息与为零的先验概率之间施加了形状约束关系。在控制平均错误发现率的情况下,设计了一个最优拒绝规则,使真阳性的期望数量最大化。针对有序结构,我们开发了一种鲁棒的EM算法来同时估计备择假设下为零的先验概率和p值的分布。我们从经验和理论两方面证明了所提出的方法在控制错误发现率的同时具有比最先进的竞争对手更好的能力。大量的仿真实验证明了该方法的优越性。来自全基因组关联研究的数据集被用来说明新的方法。
{"title":"OPTIMAL FALSE DISCOVERY RATE CONTROL FOR LARGE SCALE MULTIPLE TESTING WITH AUXILIARY INFORMATION.","authors":"Hongyuan Cao,&nbsp;Jun Chen,&nbsp;Xianyang Zhang","doi":"10.1214/21-aos2128","DOIUrl":"https://doi.org/10.1214/21-aos2128","url":null,"abstract":"<p><p>Large-scale multiple testing is a fundamental problem in high dimensional statistical inference. It is increasingly common that various types of auxiliary information, reflecting the structural relationship among the hypotheses, are available. Exploiting such auxiliary information can boost statistical power. To this end, we propose a framework based on a two-group mixture model with varying probabilities of being null for different hypotheses <i>a priori</i>, where a shape-constrained relationship is imposed between the auxiliary information and the prior probabilities of being null. An optimal rejection rule is designed to maximize the expected number of true positives when average false discovery rate is controlled. Focusing on the ordered structure, we develop a robust EM algorithm to estimate the prior probabilities of being null and the distribution of <i>p</i>-values under the alternative hypothesis simultaneously. We show that the proposed method has better power than state-of-the-art competitors while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method. Datasets from genome-wide association studies are used to illustrate the new methodology.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10153594/pdf/nihms-1840915.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9776938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Testability of high-dimensional linear models with nonsparse structures. 非稀疏结构高维线性模型的可测试性。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2022-04-01 Epub Date: 2022-04-07 DOI: 10.1214/19-aos1932
Jelena Bradic, Jianqing Fan, Yinchu Zhu

Understanding statistical inference under possibly non-sparse high-dimensional models has gained much interest recently. For a given component of the regression coefficient, we show that the difficulty of the problem depends on the sparsity of the corresponding row of the precision matrix of the covariates, not the sparsity of the regression coefficients. We develop new concepts of uniform and essentially uniform non-testability that allow the study of limitations of tests across a broad set of alternatives. Uniform non-testability identifies a collection of alternatives such that the power of any test, against any alternative in the group, is asymptotically at most equal to the nominal size. Implications of the new constructions include new minimax testability results that, in sharp contrast to the current results, do not depend on the sparsity of the regression parameters. We identify new tradeoffs between testability and feature correlation. In particular, we show that, in models with weak feature correlations, minimax lower bound can be attained by a test whose power has the n rate, regardless of the size of the model sparsity.

在可能非稀疏的高维模型下理解统计推断最近引起了人们极大的兴趣。对于回归系数的给定分量,我们表明问题的难度取决于协变量的精度矩阵的相应行的稀疏性,而不是回归系数的稀疏性。我们开发了统一和本质上统一的不可测试性的新概念,允许在广泛的替代方案中研究测试的局限性。一致不可测试性标识了一组选择,使得任何测试对组中任何选择的幂,渐近地至多等于标称大小。新结构的含义包括新的极大极小可测试性结果,与当前结果形成鲜明对比,不依赖于回归参数的稀疏性。我们在可测试性和特征相关性之间找到了新的权衡。特别是,我们表明,在具有弱特征相关性的模型中,无论模型稀疏度的大小,极小极大下界都可以通过幂次为n的测试获得。
{"title":"Testability of high-dimensional linear models with nonsparse structures.","authors":"Jelena Bradic,&nbsp;Jianqing Fan,&nbsp;Yinchu Zhu","doi":"10.1214/19-aos1932","DOIUrl":"https://doi.org/10.1214/19-aos1932","url":null,"abstract":"<p><p>Understanding statistical inference under possibly non-sparse high-dimensional models has gained much interest recently. For a given component of the regression coefficient, we show that the difficulty of the problem depends on the sparsity of the corresponding row of the precision matrix of the covariates, not the sparsity of the regression coefficients. We develop new concepts of uniform and essentially uniform non-testability that allow the study of limitations of tests across a broad set of alternatives. Uniform non-testability identifies a collection of alternatives such that the power of any test, against any alternative in the group, is asymptotically at most equal to the nominal size. Implications of the new constructions include new minimax testability results that, in sharp contrast to the current results, do not depend on the sparsity of the regression parameters. We identify new tradeoffs between testability and feature correlation. In particular, we show that, in models with weak feature correlations, minimax lower bound can be attained by a test whose power has the <math> <mrow><msqrt><mi>n</mi></msqrt> </mrow> </math> rate, regardless of the size of the model sparsity.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9266975/pdf/nihms-1639563.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40580296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
Annals of Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1