首页 > 最新文献

Annals of Statistics最新文献

英文 中文
FEATURE ELIMINATION IN KERNEL MACHINES IN MODERATELY HIGH DIMENSIONS. 中等高维内核机中的特征消除。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2019-02-01 DOI: 10.1214/18-AOS1696
Sayan Dasgupta, Yair Goldberg, Michael R Kosorok

We develop an approach for feature elimination in statistical learning with kernel machines, based on recursive elimination of features. We present theoretical properties of this method and show that it is uniformly consistent in finding the correct feature space under certain generalized assumptions. We present a few case studies to show that the assumptions are met in most practical situations and present simulation results to demonstrate performance of the proposed approach.

我们开发了一种基于递归特征消除的核机统计学习特征消除方法。我们给出了该方法的理论性质,并证明了在一定的广义假设下,该方法在寻找正确的特征空间方面是一致的。我们提出了一些案例研究,以表明这些假设在大多数实际情况下都是满足的,并给出了仿真结果来证明所提出方法的性能。
{"title":"FEATURE ELIMINATION IN KERNEL MACHINES IN MODERATELY HIGH DIMENSIONS.","authors":"Sayan Dasgupta,&nbsp;Yair Goldberg,&nbsp;Michael R Kosorok","doi":"10.1214/18-AOS1696","DOIUrl":"https://doi.org/10.1214/18-AOS1696","url":null,"abstract":"<p><p>We develop an approach for feature elimination in statistical learning with kernel machines, based on recursive elimination of features. We present theoretical properties of this method and show that it is uniformly consistent in finding the correct feature space under certain generalized assumptions. We present a few case studies to show that the assumptions are met in most practical situations and present simulation results to demonstrate performance of the proposed approach.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOS1696","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36792835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
NONPARAMETRIC TESTING FOR MULTIPLE SURVIVAL FUNCTIONS WITH NON-INFERIORITY MARGINS. 具有非劣效边际的多个生存函数的非参数检验。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2019-02-01 Epub Date: 2018-11-30 DOI: 10.1214/18-AOS1686
Hsin-Wen Chang, Ian W McKeague

New nonparametric tests for the ordering of multiple survival functions are developed with the possibility of right censorship taken into account. The motivation comes from non-inferiority trials with multiple treatments. The proposed tests are based on nonparametric likelihood ratio statistics, which are known to provide more powerful tests than Wald-type procedures, but in this setting have only been studied for pairs of survival functions or in the absence of censoring. We introduce a novel type of pool adjacent violator algorithm that leads to a complete solution of the problem. The limit distributions can be expressed as weighted sums of squares involving projections of certain Gaussian processes onto the given ordered alternative. A simulation study shows that the new procedures have superior power to a competing combined-pairwise Cox model approach. We illustrate the proposed methods using data from a three-arm non-inferiority trial.

考虑到权利审查的可能性,开发了多个生存函数排序的新的非参数检验。动机来自多种治疗的非劣效性试验。所提出的检验是基于非参数似然比统计的,已知该统计比Wald型程序提供了更强大的检验,但在这种情况下,仅对生存函数对或在没有审查的情况下进行了研究。我们介绍了一种新型的池相邻违规者算法,该算法可以完全解决该问题。极限分布可以表示为涉及某些高斯过程在给定有序备选方案上的投影的加权平方和。一项模拟研究表明,与竞争性的组合成对Cox模型方法相比,新方法具有更高的性能。我们使用三组非劣效性试验的数据来说明所提出的方法。
{"title":"NONPARAMETRIC TESTING FOR MULTIPLE SURVIVAL FUNCTIONS WITH NON-INFERIORITY MARGINS.","authors":"Hsin-Wen Chang,&nbsp;Ian W McKeague","doi":"10.1214/18-AOS1686","DOIUrl":"10.1214/18-AOS1686","url":null,"abstract":"<p><p>New nonparametric tests for the ordering of multiple survival functions are developed with the possibility of right censorship taken into account. The motivation comes from non-inferiority trials with multiple treatments. The proposed tests are based on nonparametric likelihood ratio statistics, which are known to provide more powerful tests than Wald-type procedures, but in this setting have only been studied for pairs of survival functions or in the absence of censoring. We introduce a novel type of pool adjacent violator algorithm that leads to a complete solution of the problem. The limit distributions can be expressed as weighted sums of squares involving projections of certain Gaussian processes onto the given ordered alternative. A simulation study shows that the new procedures have superior power to a competing combined-pairwise Cox model approach. We illustrate the proposed methods using data from a three-arm non-inferiority trial.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOS1686","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37341004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
SPECTRAL METHOD AND REGULARIZED MLE ARE BOTH OPTIMAL FOR TOP-K RANKING. 谱方法和正则化MLE都是TOP-K排序的最优方法。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2019-01-01 Epub Date: 2019-05-21 DOI: 10.1214/18-AOS1745
Yuxin Chen, Jianqing Fan, Cong Ma, Kaizheng Wang

This paper is concerned with the problem of top-K ranking from pairwise comparisons. Given a collection of n items and a few pairwise comparisons across them, one wishes to identify the set of K items that receive the highest ranks. To tackle this problem, we adopt the logistic parametric model - the Bradley-Terry-Luce model, where each item is assigned a latent preference score, and where the outcome of each pairwise comparison depends solely on the relative scores of the two items involved. Recent works have made significant progress towards characterizing the performance (e.g. the mean square error for estimating the scores) of several classical methods, including the spectral method and the maximum likelihood estimator (MLE). However, where they stand regarding top-K ranking remains unsettled. We demonstrate that under a natural random sampling model, the spectral method alone, or the regularized MLE alone, is minimax optimal in terms of the sample complexity - the number of paired comparisons needed to ensure exact top-K identification, for the fixed dynamic range regime. This is accomplished via optimal control of the entrywise error of the score estimates. We complement our theoretical studies by numerical experiments, confirming that both methods yield low entrywise errors for estimating the underlying scores. Our theory is established via a novel leave-one-out trick, which proves effective for analyzing both iterative and non-iterative procedures. Along the way, we derive an elementary eigenvector perturbation bound for probability transition matrices, which parallels the Davis-Kahan Θ theorem for symmetric matrices. This also allows us to close the gap between the l 2 error upper bound for the spectral method and the minimax lower limit.

本文研究了由成对比较得到的前K排序问题。给定n个项目的集合和它们之间的一些成对比较,希望识别接收最高秩的K个项目的集。为了解决这个问题,我们采用了逻辑参数模型——Bradley Terry-Luce模型,其中每个项目都被分配了一个潜在的偏好得分,并且每个配对比较的结果仅取决于所涉及的两个项目的相对得分。最近的工作在表征几种经典方法的性能(例如,用于估计分数的均方误差)方面取得了重大进展,包括谱方法和最大似然估计器(MLE)。然而,他们在排名前K的问题上的立场仍然悬而未决。我们证明,在自然随机采样模型下,就样本复杂性而言,单独的谱方法或单独的正则化MLE是最小-最大最优的。样本复杂性是在固定的动态范围内,确保精确的top-K识别所需的配对比较数。这是通过对得分估计的入口误差的最优控制来实现的。我们通过数值实验补充了我们的理论研究,证实了这两种方法在估计基本分数时都会产生较低的入口误差。我们的理论是通过一种新颖的省略一技巧建立的,该技巧被证明对分析迭代和非迭代过程都是有效的。在此过程中,我们导出了概率转移矩阵的基本特征向量扰动界,这与对称矩阵的Davis-Kahanθ定理相似。这也使我们能够缩小光谱方法的l2误差上限和最小-最大下限之间的差距。
{"title":"SPECTRAL METHOD AND REGULARIZED MLE ARE BOTH OPTIMAL FOR TOP-<i>K</i> RANKING.","authors":"Yuxin Chen,&nbsp;Jianqing Fan,&nbsp;Cong Ma,&nbsp;Kaizheng Wang","doi":"10.1214/18-AOS1745","DOIUrl":"https://doi.org/10.1214/18-AOS1745","url":null,"abstract":"<p><p>This paper is concerned with the problem of top-<i>K</i> ranking from pairwise comparisons. Given a collection of <i>n</i> items and a few pairwise comparisons across them, one wishes to identify the set of <i>K</i> items that receive the highest ranks. To tackle this problem, we adopt the logistic parametric model - the Bradley-Terry-Luce model, where each item is assigned a latent preference score, and where the outcome of each pairwise comparison depends solely on the relative scores of the two items involved. Recent works have made significant progress towards characterizing the performance (e.g. the mean square error for estimating the scores) of several classical methods, including the spectral method and the maximum likelihood estimator (MLE). However, where they stand regarding top-<i>K</i> ranking remains unsettled. We demonstrate that under a natural random sampling model, the spectral method alone, or the regularized MLE alone, is minimax optimal in terms of the sample complexity - the number of paired comparisons needed to ensure exact top-<i>K</i> identification, for the fixed dynamic range regime. This is accomplished via optimal control of the entrywise error of the score estimates. We complement our theoretical studies by numerical experiments, confirming that both methods yield low entrywise errors for estimating the underlying scores. Our theory is established via a novel leave-one-out trick, which proves effective for analyzing both iterative and non-iterative procedures. Along the way, we derive an elementary eigenvector perturbation bound for probability transition matrices, which parallels the Davis-Kahan <math><mtext>Θ</mtext></math> theorem for symmetric matrices. This also allows us to close the gap between the <math><msub><mi>l</mi> <mn>2</mn></msub> </math> error upper bound for the spectral method and the minimax lower limit.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOS1745","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41189337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 102
HYPOTHESIS TESTING ON LINEAR STRUCTURES OF HIGH DIMENSIONAL COVARIANCE MATRIX. 高维协方差矩阵线性结构的假设检验。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2019-01-01 Epub Date: 2019-10-31 DOI: 10.1214/18-AOS1779
Shurong Zheng, Zhao Chen, Hengjian Cui, Runze Li

This paper is concerned with test of significance on high dimensional covariance structures, and aims to develop a unified framework for testing commonly-used linear covariance structures. We first construct a consistent estimator for parameters involved in the linear covariance structure, and then develop two tests for the linear covariance structures based on entropy loss and quadratic loss used for covariance matrix estimation. To study the asymptotic properties of the proposed tests, we study related high dimensional random matrix theory, and establish several highly useful asymptotic results. With the aid of these asymptotic results, we derive the limiting distributions of these two tests under the null and alternative hypotheses. We further show that the quadratic loss based test is asymptotically unbiased. We conduct Monte Carlo simulation study to examine the finite sample performance of the two tests. Our simulation results show that the limiting null distributions approximate their null distributions quite well, and the corresponding asymptotic critical values keep Type I error rate very well. Our numerical comparison implies that the proposed tests outperform existing ones in terms of controlling Type I error rate and power. Our simulation indicates that the test based on quadratic loss seems to have better power than the test based on entropy loss.

本文对高维协方差结构的显著性检验进行了研究,旨在建立一个统一的检验常用线性协方差结构的框架。首先对线性协方差结构中涉及的参数构造一致估计量,然后对用于协方差矩阵估计的基于熵损失和二次损失的线性协方差结构进行了两种检验。为了研究所提检验的渐近性质,我们研究了相关的高维随机矩阵理论,并建立了几个非常有用的渐近结果。利用这些渐近结果,我们得到了这两个检验在零假设和备假设下的极限分布。我们进一步证明了基于二次损失的检验是渐近无偏的。我们进行了蒙特卡罗模拟研究,以检验这两种测试的有限样本性能。仿真结果表明,极限零分布很好地逼近了它们的零分布,相应的渐近临界值很好地保持了I型错误率。我们的数值比较表明,所提出的测试在控制I型错误率和功率方面优于现有的测试。我们的仿真表明,基于二次损失的测试似乎比基于熵损失的测试更有效。
{"title":"HYPOTHESIS TESTING ON LINEAR STRUCTURES OF HIGH DIMENSIONAL COVARIANCE MATRIX.","authors":"Shurong Zheng,&nbsp;Zhao Chen,&nbsp;Hengjian Cui,&nbsp;Runze Li","doi":"10.1214/18-AOS1779","DOIUrl":"https://doi.org/10.1214/18-AOS1779","url":null,"abstract":"<p><p>This paper is concerned with test of significance on high dimensional covariance structures, and aims to develop a unified framework for testing commonly-used linear covariance structures. We first construct a consistent estimator for parameters involved in the linear covariance structure, and then develop two tests for the linear covariance structures based on entropy loss and quadratic loss used for covariance matrix estimation. To study the asymptotic properties of the proposed tests, we study related high dimensional random matrix theory, and establish several highly useful asymptotic results. With the aid of these asymptotic results, we derive the limiting distributions of these two tests under the null and alternative hypotheses. We further show that the quadratic loss based test is asymptotically unbiased. We conduct Monte Carlo simulation study to examine the finite sample performance of the two tests. Our simulation results show that the limiting null distributions approximate their null distributions quite well, and the corresponding asymptotic critical values keep Type I error rate very well. Our numerical comparison implies that the proposed tests outperform existing ones in terms of controlling Type I error rate and power. Our simulation indicates that the test based on quadratic loss seems to have better power than the test based on entropy loss.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6910252/pdf/nihms-1022732.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37459228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK. Z估计框架中许多函数参数的一致有效正则化后置信域。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2018-12-01 Epub Date: 2018-09-11 DOI: 10.1214/17-AOS1671
Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, Ying Wei

In this paper, we develop procedures to construct simultaneous confidence bands for p ˜ potentially infinite-dimensional parameters after model selection for general moment condition models where p ˜ is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the p ˜ parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for p ˜ n ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.

在本文中,我们开发了在一般矩条件模型的模型选择后,为p~潜在无限维参数同时构建置信带的程序,其中p~可能远大于可用数据的样本量n。这使我们能够用函数响应数据覆盖设置,其中每个p~参数都是一个函数。该过程基于近似满足奈曼正交性条件的得分函数的构造。所提出的同时置信带依赖于高维向量的一致中心极限定理(而不是我们考虑p~n时的Donsker自变量)。为了构建带,我们采用了一种乘法器自举程序,该程序在计算上是高效的,因为它只涉及对估计的得分函数进行重新采样(并且不需要解决高维优化问题)。我们将一般理论正式应用于具有逻辑环节的分布回归模型中回归系数过程的推断,并详细分析了两种实现方式。提供了模拟和对真实数据的应用,以帮助说明结果的适用性。
{"title":"UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK.","authors":"Alexandre Belloni,&nbsp;Victor Chernozhukov,&nbsp;Denis Chetverikov,&nbsp;Ying Wei","doi":"10.1214/17-AOS1671","DOIUrl":"10.1214/17-AOS1671","url":null,"abstract":"<p><p>In this paper, we develop procedures to construct simultaneous confidence bands for <math><mover><mi>p</mi> <mo>˜</mo></mover> </math> potentially infinite-dimensional parameters after model selection for general moment condition models where <math> <mrow><mover><mi>p</mi> <mo>˜</mo></mover> </mrow> </math> is potentially much larger than the sample size of available data, <i>n</i>. This allows us to cover settings with functional response data where each of the <math> <mrow><mover><mi>p</mi> <mo>˜</mo></mover> </mrow> </math> parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for <math> <mrow><mover><mi>p</mi> <mo>˜</mo></mover> <mo>≫</mo> <mi>n</mi></mrow> </math> ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOS1671","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37129329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT. 使用角度分解点评估分类的稳健性。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2018-12-01 Epub Date: 2018-09-11 DOI: 10.1214/17-AOS1661
Junlong Zhao, Guan Yu, Yufeng Liu

Robustness is a desirable property for many statistical techniques. As an important measure of robustness, breakdown point has been widely used for regression problems and many other settings. Despite the existing development, we observe that the standard breakdown point criterion is not directly applicable for many classification problems. In this paper, we propose a new breakdown point criterion, namely angular breakdown point, to better quantify the robustness of different classification methods. Using this new breakdown point criterion, we study the robustness of binary large margin classification techniques, although the idea is applicable to general classification methods. Both bounded and unbounded loss functions with linear and kernel learning are considered. These studies provide useful insights on the robustness of different classification methods. Numerical results further confirm our theoretical findings.

对于许多统计技术来说,稳健性是一个理想的特性。分解点作为衡量稳健性的重要指标,已被广泛用于回归问题和许多其他设置。尽管有现有的发展,我们观察到标准分解点标准并不能直接适用于许多分类问题。在本文中,我们提出了一种新的击穿点准则,即角击穿点,以更好地量化不同分类方法的稳健性。使用这个新的分解点准则,我们研究了二进制大幅度分类技术的鲁棒性,尽管该思想适用于一般的分类方法。考虑了具有线性学习和核学习的有界和无界损失函数。这些研究为不同分类方法的稳健性提供了有用的见解。数值结果进一步证实了我们的理论发现。
{"title":"ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT.","authors":"Junlong Zhao,&nbsp;Guan Yu,&nbsp;Yufeng Liu","doi":"10.1214/17-AOS1661","DOIUrl":"10.1214/17-AOS1661","url":null,"abstract":"<p><p>Robustness is a desirable property for many statistical techniques. As an important measure of robustness, breakdown point has been widely used for regression problems and many other settings. Despite the existing development, we observe that the standard breakdown point criterion is not directly applicable for many classification problems. In this paper, we propose a new breakdown point criterion, namely angular breakdown point, to better quantify the robustness of different classification methods. Using this new breakdown point criterion, we study the robustness of binary large margin classification techniques, although the idea is applicable to general classification methods. Both bounded and unbounded loss functions with linear and kernel learning are considered. These studies provide useful insights on the robustness of different classification methods. Numerical results further confirm our theoretical findings.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOS1661","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36564699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A NEW PERSPECTIVE ON ROBUST M-ESTIMATION: FINITE SAMPLE THEORY AND APPLICATIONS TO DEPENDENCE-ADJUSTED MULTIPLE TESTING. 稳健M估计的新视角:有限样本理论及其在依赖调整多重检验中的应用。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2018-10-01 Epub Date: 2018-08-17 DOI: 10.1214/17-AOS1606
Wen-Xin Zhou, Koushiki Bose, Jianqing Fan, Han Liu

Heavy-tailed errors impair the accuracy of the least squares estimate, which can be spoiled by a single grossly outlying observation. As argued in the seminal work of Peter Huber in 1973 [Ann. Statist.1 (1973) 799-821], robust alternatives to the method of least squares are sorely needed. To achieve robustness against heavy-tailed sampling distributions, we revisit the Huber estimator from a new perspective by letting the tuning parameter involved diverge with the sample size. In this paper, we develop nonasymptotic concentration results for such an adaptive Huber estimator, namely, the Huber estimator with the tuning parameter adapted to sample size, dimension, and the variance of the noise. Specifically, we obtain a sub-Gaussian-type deviation inequality and a nonasymptotic Bahadur representation when noise variables only have finite second moments. The nonasymptotic results further yield two conventional normal approximation results that are of independent interest, the Berry-Esseen inequality and Cramér-type moderate deviation. As an important application to large-scale simultaneous inference, we apply these robust normal approximation results to analyze a dependence-adjusted multiple testing procedure for moderately heavy-tailed data. It is shown that the robust dependence-adjusted procedure asymptotically controls the overall false discovery proportion at the nominal level under mild moment conditions. Thorough numerical results on both simulated and real datasets are also provided to back up our theory.

重尾误差会削弱最小二乘估计的准确性,而最小二乘估计可能会被单个严重偏离的观测破坏。正如Peter Huber在1973年的开创性工作[Ann.Statist.1(1973)799-821]中所指出的那样,迫切需要最小二乘法的稳健替代方案。为了实现对重尾采样分布的鲁棒性,我们从一个新的角度重新审视Huber估计器,让所涉及的调谐参数随着样本大小而发散。在本文中,我们为这样一个自适应Huber估计器,即具有适应样本大小、维度和噪声方差的调谐参数的Huber估计量,开发了非同调集中结果。具体地说,当噪声变量只有有限的二阶矩时,我们得到了一个亚高斯型偏差不等式和一个非同调Bahadur表示。非共形结果进一步产生了两个独立感兴趣的常规正态近似结果,Berry-Esseen不等式和Cramér型中等偏差。作为大规模同时推理的一个重要应用,我们将这些稳健的正态近似结果应用于分析中等重尾数据的依赖性调整多重测试过程。结果表明,在温和矩条件下,鲁棒依赖性调整过程将总体错误发现比例渐近控制在标称水平。在模拟和真实数据集上也提供了全面的数值结果来支持我们的理论。
{"title":"A NEW PERSPECTIVE ON ROBUST <i>M</i>-ESTIMATION: FINITE SAMPLE THEORY AND APPLICATIONS TO DEPENDENCE-ADJUSTED MULTIPLE TESTING.","authors":"Wen-Xin Zhou, Koushiki Bose, Jianqing Fan, Han Liu","doi":"10.1214/17-AOS1606","DOIUrl":"10.1214/17-AOS1606","url":null,"abstract":"<p><p>Heavy-tailed errors impair the accuracy of the least squares estimate, which can be spoiled by a single grossly outlying observation. As argued in the seminal work of Peter Huber in 1973 [<i>Ann. Statist.</i><b>1</b> (1973) 799-821], robust alternatives to the method of least squares are sorely needed. To achieve robustness against heavy-tailed sampling distributions, we revisit the Huber estimator from a new perspective by letting the tuning parameter involved diverge with the sample size. In this paper, we develop nonasymptotic concentration results for such an adaptive Huber estimator, namely, the Huber estimator with the tuning parameter adapted to sample size, dimension, and the variance of the noise. Specifically, we obtain a sub-Gaussian-type deviation inequality and a nonasymptotic Bahadur representation when noise variables only have finite second moments. The nonasymptotic results further yield two conventional normal approximation results that are of independent interest, the Berry-Esseen inequality and Cramér-type moderate deviation. As an important application to large-scale simultaneous inference, we apply these robust normal approximation results to analyze a dependence-adjusted multiple testing procedure for moderately heavy-tailed data. It is shown that the robust dependence-adjusted procedure asymptotically controls the overall false discovery proportion at the nominal level under mild moment conditions. Thorough numerical results on both simulated and real datasets are also provided to back up our theory.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6133288/pdf/nihms926033.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36491731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ANALYSIS OF "LEARN-AS-YOU-GO" (LAGO) STUDIES. “随学随走”(lago)研究分析。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2018-08-20 DOI: 10.1214/20-AOS1978
D. Nevo, J. Lok, D. Spiegelman
In Learn-As-you-GO (LAGO) adaptive studies, the intervention is a complex multicomponent package, and is adapted in stages during the study based on past outcome data. This design formalizes standard practice in public health intervention studies. An effective intervention package is sought, while minimizing intervention package cost. In LAGO study data, the interventions in later stages depend upon the outcomes in the previous stages, violating standard statistical theory. We develop an estimator for the intervention effects, and prove consistency and asymptotic normality using a novel coupling argument, ensuring the validity of the test for the hypothesis of no overall intervention effect. We develop a confidence set for the optimal intervention package and confidence bands for the success probabilities under alternative package compositions. We illustrate our methods in the BetterBirth Study, which aimed to improve maternal and neonatal outcomes among 157,689 births in Uttar Pradesh, India through a multicomponent intervention package.
在“随做随学”(LAGO)适应性研究中,干预措施是一个复杂的多组件包,并且在研究过程中根据过去的结果数据分阶段进行调整。本设计正式确立了公共卫生干预研究的标准做法。寻求有效的干预方案,同时最小化干预方案的成本。在LAGO研究数据中,后期的干预取决于前阶段的结果,违反了标准的统计理论。我们建立了干预效应的估计量,并使用一个新的耦合参数证明了一致性和渐近正态性,从而保证了对没有总体干预效应假设的检验的有效性。我们建立了最优干预方案的置信集和不同干预方案组合下成功概率的置信带。我们在“更好的出生研究”中阐述了我们的方法,该研究旨在通过多组分干预方案改善印度北方邦157,689名新生儿的孕产妇和新生儿结局。
{"title":"ANALYSIS OF \"LEARN-AS-YOU-GO\" (LAGO) STUDIES.","authors":"D. Nevo, J. Lok, D. Spiegelman","doi":"10.1214/20-AOS1978","DOIUrl":"https://doi.org/10.1214/20-AOS1978","url":null,"abstract":"In Learn-As-you-GO (LAGO) adaptive studies, the intervention is a complex multicomponent package, and is adapted in stages during the study based on past outcome data. This design formalizes standard practice in public health intervention studies. An effective intervention package is sought, while minimizing intervention package cost. In LAGO study data, the interventions in later stages depend upon the outcomes in the previous stages, violating standard statistical theory. We develop an estimator for the intervention effects, and prove consistency and asymptotic normality using a novel coupling argument, ensuring the validity of the test for the hypothesis of no overall intervention effect. We develop a confidence set for the optimal intervention package and confidence bands for the success probabilities under alternative package compositions. We illustrate our methods in the BetterBirth Study, which aimed to improve maternal and neonatal outcomes among 157,689 births in Uttar Pradesh, India through a multicomponent intervention package.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2018-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43532425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
LARGE COVARIANCE ESTIMATION THROUGH ELLIPTICAL FACTOR MODELS. 通过椭圆因子模型的大协方差估计。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2018-08-01 Epub Date: 2018-06-27 DOI: 10.1214/17-AOS1588
Jianqing Fan, Han Liu, Weichen Wang

We propose a general Principal Orthogonal complEment Thresholding (POET) framework for large-scale covariance matrix estimation based on the approximate factor model. A set of high level sufficient conditions for the procedure to achieve optimal rates of convergence under different matrix norms is established to better understand how POET works. Such a framework allows us to recover existing results for sub-Gaussian data in a more transparent way that only depends on the concentration properties of the sample covariance matrix. As a new theoretical contribution, for the first time, such a framework allows us to exploit conditional sparsity covariance structure for the heavy-tailed data. In particular, for the elliptical distribution, we propose a robust estimator based on the marginal and spatial Kendall's tau to satisfy these conditions. In addition, we study conditional graphical model under the same framework. The technical tools developed in this paper are of general interest to high dimensional principal component analysis. Thorough numerical results are also provided to back up the developed theory.

基于近似因子模型,我们提出了一种用于大规模协方差矩阵估计的通用主正交复数阈值(POET)框架。为了更好地理解POET是如何工作的,建立了该过程在不同矩阵范数下实现最优收敛率的一组高层充分条件。这样的框架允许我们以更透明的方式恢复亚高斯数据的现有结果,该方式仅取决于样本协方差矩阵的浓度特性。作为一种新的理论贡献,这种框架首次允许我们利用重尾数据的条件稀疏性协方差结构。特别是,对于椭圆分布,我们提出了一个基于边缘和空间Kendallτ的鲁棒估计器来满足这些条件。此外,我们还在同一框架下研究了条件图形模型。本文开发的技术工具对高维主成分分析具有普遍的兴趣。文中还提供了较为详尽的数值结果来支持这一理论。
{"title":"LARGE COVARIANCE ESTIMATION THROUGH ELLIPTICAL FACTOR MODELS.","authors":"Jianqing Fan,&nbsp;Han Liu,&nbsp;Weichen Wang","doi":"10.1214/17-AOS1588","DOIUrl":"10.1214/17-AOS1588","url":null,"abstract":"<p><p>We propose a general Principal Orthogonal complEment Thresholding (POET) framework for large-scale covariance matrix estimation based on the approximate factor model. A set of high level sufficient conditions for the procedure to achieve optimal rates of convergence under different matrix norms is established to better understand how POET works. Such a framework allows us to recover existing results for sub-Gaussian data in a more transparent way that only depends on the concentration properties of the sample covariance matrix. As a new theoretical contribution, for the first time, such a framework allows us to exploit conditional sparsity covariance structure for the heavy-tailed data. In particular, for the elliptical distribution, we propose a robust estimator based on the marginal and spatial Kendall's tau to satisfy these conditions. In addition, we study conditional graphical model under the same framework. The technical tools developed in this paper are of general interest to high dimensional principal component analysis. Thorough numerical results are also provided to back up the developed theory.</p>","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOS1588","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36490928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 84
Consistency and convergence rate of phylogenetic inference via regularization. 基于正则化的系统发育推理的一致性和收敛率。
IF 4.5 1区 数学 Q1 Mathematics Pub Date : 2018-08-01 Epub Date: 2018-06-27 DOI: 10.1214/17-AOS1592
Vu Dinh, Lam Si Tung Ho, Marc A Suchard, Frederick A Matsen
It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is "adaptive fast converging," meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.
在系统发育学中,有一些关于一组生物的整体进化树的信息,也许是部分信息,并希望找到这些生物的特定基因的进化树,这是很常见的。基因序列中可能没有足够的信息来准确地重建正确的“基因树”。尽管由于各种遗传过程,基因树可能偏离“物种树”,但在缺乏相反证据的情况下,假设它们一致是吝啬的。在这种情况下,一种常见的统计方法是制定一个可能性惩罚,以纳入这些额外的信息。最近的研究利用模拟和经验数据表明,与单独使用序列数据相比,量化物种树一致性的似然惩罚可以显著提高基因树重建的准确性。但是,这种方法的一致性尚未确定,收敛速度也没有限定。因为系统发育是一个非标准的推理问题,所以标准理论并不适用。在本文中,我们提出了一种用于基因树重建的惩罚极大似然估计,其中惩罚是基因树到物种树的Billera-Holmes-Vogtmann测地距离的平方。我们证明了这种方法是一致的,并推导了同时估计离散基因树结构和连续边缘长度(表示该分支上发生的进化量)的收敛速度。我们发现正则化估计器是“自适应快速收敛”的,这意味着它可以从多项式长度的基因序列中重建长度大于任何给定阈值的所有边。我们的方法不需要确切地知道物种树;事实上,我们的渐近理论适用于任何这样的导树。
{"title":"Consistency and convergence rate of phylogenetic inference via regularization.","authors":"Vu Dinh,&nbsp;Lam Si Tung Ho,&nbsp;Marc A Suchard,&nbsp;Frederick A Matsen","doi":"10.1214/17-AOS1592","DOIUrl":"https://doi.org/10.1214/17-AOS1592","url":null,"abstract":"It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct \"gene tree.\" Although the gene tree may deviate from the \"species tree\" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is \"adaptive fast converging,\" meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOS1592","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36592809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
Annals of Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1