Annals of Statistics最新文献_第10页

SPECTRAL METHOD AND REGULARIZED MLE ARE BOTH OPTIMAL FOR TOP-K RANKING. 谱方法和正则化MLE都是TOP-K排序的最优方法。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2019-01-01 Epub Date: 2019-05-21 DOI: 10.1214/18-AOS1745

Yuxin Chen, Jianqing Fan, Cong Ma, Kaizheng Wang

This paper is concerned with the problem of top-K ranking from pairwise comparisons. Given a collection of n items and a few pairwise comparisons across them, one wishes to identify the set of K items that receive the highest ranks. To tackle this problem, we adopt the logistic parametric model - the Bradley-Terry-Luce model, where each item is assigned a latent preference score, and where the outcome of each pairwise comparison depends solely on the relative scores of the two items involved. Recent works have made significant progress towards characterizing the performance (e.g. the mean square error for estimating the scores) of several classical methods, including the spectral method and the maximum likelihood estimator (MLE). However, where they stand regarding top-K ranking remains unsettled. We demonstrate that under a natural random sampling model, the spectral method alone, or the regularized MLE alone, is minimax optimal in terms of the sample complexity - the number of paired comparisons needed to ensure exact top-K identification, for the fixed dynamic range regime. This is accomplished via optimal control of the entrywise error of the score estimates. We complement our theoretical studies by numerical experiments, confirming that both methods yield low entrywise errors for estimating the underlying scores. Our theory is established via a novel leave-one-out trick, which proves effective for analyzing both iterative and non-iterative procedures. Along the way, we derive an elementary eigenvector perturbation bound for probability transition matrices, which parallels the Davis-Kahan $Θ$ theorem for symmetric matrices. This also allows us to close the gap between the $l_{2}$ error upper bound for the spectral method and the minimax lower limit.

本文研究了由成对比较得到的前K排序问题。给定n个项目的集合和它们之间的一些成对比较，希望识别接收最高秩的K个项目的集。为了解决这个问题，我们采用了逻辑参数模型——Bradley Terry-Luce模型，其中每个项目都被分配了一个潜在的偏好得分，并且每个配对比较的结果仅取决于所涉及的两个项目的相对得分。最近的工作在表征几种经典方法的性能（例如，用于估计分数的均方误差）方面取得了重大进展，包括谱方法和最大似然估计器（MLE）。然而，他们在排名前K的问题上的立场仍然悬而未决。我们证明，在自然随机采样模型下，就样本复杂性而言，单独的谱方法或单独的正则化MLE是最小-最大最优的。样本复杂性是在固定的动态范围内，确保精确的top-K识别所需的配对比较数。这是通过对得分估计的入口误差的最优控制来实现的。我们通过数值实验补充了我们的理论研究，证实了这两种方法在估计基本分数时都会产生较低的入口误差。我们的理论是通过一种新颖的省略一技巧建立的，该技巧被证明对分析迭代和非迭代过程都是有效的。在此过程中，我们导出了概率转移矩阵的基本特征向量扰动界，这与对称矩阵的Davis-Kahanθ定理相似。这也使我们能够缩小光谱方法的l2误差上限和最小-最大下限之间的差距。

{"title":"SPECTRAL METHOD AND REGULARIZED MLE ARE BOTH OPTIMAL FOR TOP-K RANKING.","authors":"Yuxin Chen, Jianqing Fan, Cong Ma, Kaizheng Wang","doi":"10.1214/18-AOS1745","DOIUrl":"https://doi.org/10.1214/18-AOS1745","url":null,"abstract":"This paper is concerned with the problem of top-K ranking from pairwise comparisons. Given a collection of n items and a few pairwise comparisons across them, one wishes to identify the set of K items that receive the highest ranks. To tackle this problem, we adopt the logistic parametric model - the Bradley-Terry-Luce model, where each item is assigned a latent preference score, and where the outcome of each pairwise comparison depends solely on the relative scores of the two items involved. Recent works have made significant progress towards characterizing the performance (e.g. the mean square error for estimating the scores) of several classical methods, including the spectral method and the maximum likelihood estimator (MLE). However, where they stand regarding top-K ranking remains unsettled. We demonstrate that under a natural random sampling model, the spectral method alone, or the regularized MLE alone, is minimax optimal in terms of the sample complexity - the number of paired comparisons needed to ensure exact top-K identification, for the fixed dynamic range regime. This is accomplished via optimal control of the entrywise error of the score estimates. We complement our theoretical studies by numerical experiments, confirming that both methods yield low entrywise errors for estimating the underlying scores. Our theory is established via a novel leave-one-out trick, which proves effective for analyzing both iterative and non-iterative procedures. Along the way, we derive an elementary eigenvector perturbation bound for probability transition matrices, which parallels the Davis-Kahan <math><mtext>Θ</mtext></math> theorem for symmetric matrices. This also allows us to close the gap between the <math><msub><mi>l</mi> <mn>2</mn></msub> </math> error upper bound for the spectral method and the minimax lower limit.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"47 4","pages":"2204-2235"},"PeriodicalIF":4.5,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOS1745","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41189337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 102

HYPOTHESIS TESTING ON LINEAR STRUCTURES OF HIGH DIMENSIONAL COVARIANCE MATRIX. 高维协方差矩阵线性结构的假设检验。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2019-01-01 Epub Date: 2019-10-31 DOI: 10.1214/18-AOS1779

Shurong Zheng, Zhao Chen, Hengjian Cui, Runze Li

This paper is concerned with test of significance on high dimensional covariance structures, and aims to develop a unified framework for testing commonly-used linear covariance structures. We first construct a consistent estimator for parameters involved in the linear covariance structure, and then develop two tests for the linear covariance structures based on entropy loss and quadratic loss used for covariance matrix estimation. To study the asymptotic properties of the proposed tests, we study related high dimensional random matrix theory, and establish several highly useful asymptotic results. With the aid of these asymptotic results, we derive the limiting distributions of these two tests under the null and alternative hypotheses. We further show that the quadratic loss based test is asymptotically unbiased. We conduct Monte Carlo simulation study to examine the finite sample performance of the two tests. Our simulation results show that the limiting null distributions approximate their null distributions quite well, and the corresponding asymptotic critical values keep Type I error rate very well. Our numerical comparison implies that the proposed tests outperform existing ones in terms of controlling Type I error rate and power. Our simulation indicates that the test based on quadratic loss seems to have better power than the test based on entropy loss.

本文对高维协方差结构的显著性检验进行了研究，旨在建立一个统一的检验常用线性协方差结构的框架。首先对线性协方差结构中涉及的参数构造一致估计量，然后对用于协方差矩阵估计的基于熵损失和二次损失的线性协方差结构进行了两种检验。为了研究所提检验的渐近性质，我们研究了相关的高维随机矩阵理论，并建立了几个非常有用的渐近结果。利用这些渐近结果，我们得到了这两个检验在零假设和备假设下的极限分布。我们进一步证明了基于二次损失的检验是渐近无偏的。我们进行了蒙特卡罗模拟研究，以检验这两种测试的有限样本性能。仿真结果表明，极限零分布很好地逼近了它们的零分布，相应的渐近临界值很好地保持了I型错误率。我们的数值比较表明，所提出的测试在控制I型错误率和功率方面优于现有的测试。我们的仿真表明，基于二次损失的测试似乎比基于熵损失的测试更有效。

{"title":"HYPOTHESIS TESTING ON LINEAR STRUCTURES OF HIGH DIMENSIONAL COVARIANCE MATRIX.","authors":"Shurong Zheng, Zhao Chen, Hengjian Cui, Runze Li","doi":"10.1214/18-AOS1779","DOIUrl":"https://doi.org/10.1214/18-AOS1779","url":null,"abstract":"This paper is concerned with test of significance on high dimensional covariance structures, and aims to develop a unified framework for testing commonly-used linear covariance structures. We first construct a consistent estimator for parameters involved in the linear covariance structure, and then develop two tests for the linear covariance structures based on entropy loss and quadratic loss used for covariance matrix estimation. To study the asymptotic properties of the proposed tests, we study related high dimensional random matrix theory, and establish several highly useful asymptotic results. With the aid of these asymptotic results, we derive the limiting distributions of these two tests under the null and alternative hypotheses. We further show that the quadratic loss based test is asymptotically unbiased. We conduct Monte Carlo simulation study to examine the finite sample performance of the two tests. Our simulation results show that the limiting null distributions approximate their null distributions quite well, and the corresponding asymptotic critical values keep Type I error rate very well. Our numerical comparison implies that the proposed tests outperform existing ones in terms of controlling Type I error rate and power. Our simulation indicates that the test based on quadratic loss seems to have better power than the test based on entropy loss.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"47 6","pages":"3300-3334"},"PeriodicalIF":4.5,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6910252/pdf/nihms-1022732.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37459228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK. Z估计框架中许多函数参数的一致有效正则化后置信域。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2018-12-01 Epub Date: 2018-09-11 DOI: 10.1214/17-AOS1671

Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, Ying Wei

In this paper, we develop procedures to construct simultaneous confidence bands for $\tilde{p}$ potentially infinite-dimensional parameters after model selection for general moment condition models where $\tilde{p}$ is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the $\tilde{p}$ parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for $\tilde{p} ≫ n$ ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.

在本文中，我们开发了在一般矩条件模型的模型选择后，为p~潜在无限维参数同时构建置信带的程序，其中p~可能远大于可用数据的样本量n。这使我们能够用函数响应数据覆盖设置，其中每个p~参数都是一个函数。该过程基于近似满足奈曼正交性条件的得分函数的构造。所提出的同时置信带依赖于高维向量的一致中心极限定理（而不是我们考虑p~n时的Donsker自变量）。为了构建带，我们采用了一种乘法器自举程序，该程序在计算上是高效的，因为它只涉及对估计的得分函数进行重新采样（并且不需要解决高维优化问题）。我们将一般理论正式应用于具有逻辑环节的分布回归模型中回归系数过程的推断，并详细分析了两种实现方式。提供了模拟和对真实数据的应用，以帮助说明结果的适用性。

{"title":"UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK.","authors":"Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, Ying Wei","doi":"10.1214/17-AOS1671","DOIUrl":"10.1214/17-AOS1671","url":null,"abstract":"In this paper, we develop procedures to construct simultaneous confidence bands for <math><mover><mi>p</mi> <mo>˜</mo></mover> </math> potentially infinite-dimensional parameters after model selection for general moment condition models where <math> <mrow><mover><mi>p</mi> <mo>˜</mo></mover> </mrow> </math> is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the <math> <mrow><mover><mi>p</mi> <mo>˜</mo></mover> </mrow> </math> parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for <math> <mrow><mover><mi>p</mi> <mo>˜</mo></mover> <mo>≫</mo> <mi>n</mi></mrow> </math> ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"46 6B","pages":"3643-3675"},"PeriodicalIF":4.5,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOS1671","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37129329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT. 使用角度分解点评估分类的稳健性。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2018-12-01 Epub Date: 2018-09-11 DOI: 10.1214/17-AOS1661

Junlong Zhao, Guan Yu, Yufeng Liu

Robustness is a desirable property for many statistical techniques. As an important measure of robustness, breakdown point has been widely used for regression problems and many other settings. Despite the existing development, we observe that the standard breakdown point criterion is not directly applicable for many classification problems. In this paper, we propose a new breakdown point criterion, namely angular breakdown point, to better quantify the robustness of different classification methods. Using this new breakdown point criterion, we study the robustness of binary large margin classification techniques, although the idea is applicable to general classification methods. Both bounded and unbounded loss functions with linear and kernel learning are considered. These studies provide useful insights on the robustness of different classification methods. Numerical results further confirm our theoretical findings.

对于许多统计技术来说，稳健性是一个理想的特性。分解点作为衡量稳健性的重要指标，已被广泛用于回归问题和许多其他设置。尽管有现有的发展，我们观察到标准分解点标准并不能直接适用于许多分类问题。在本文中，我们提出了一种新的击穿点准则，即角击穿点，以更好地量化不同分类方法的稳健性。使用这个新的分解点准则，我们研究了二进制大幅度分类技术的鲁棒性，尽管该思想适用于一般的分类方法。考虑了具有线性学习和核学习的有界和无界损失函数。这些研究为不同分类方法的稳健性提供了有用的见解。数值结果进一步证实了我们的理论发现。

引用次数: 7

A NEW PERSPECTIVE ON ROBUST M-ESTIMATION: FINITE SAMPLE THEORY AND APPLICATIONS TO DEPENDENCE-ADJUSTED MULTIPLE TESTING. 稳健M估计的新视角：有限样本理论及其在依赖调整多重检验中的应用。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2018-10-01 Epub Date: 2018-08-17 DOI: 10.1214/17-AOS1606

Wen-Xin Zhou, Koushiki Bose, Jianqing Fan, Han Liu

Heavy-tailed errors impair the accuracy of the least squares estimate, which can be spoiled by a single grossly outlying observation. As argued in the seminal work of Peter Huber in 1973 [Ann. Statist.1 (1973) 799-821], robust alternatives to the method of least squares are sorely needed. To achieve robustness against heavy-tailed sampling distributions, we revisit the Huber estimator from a new perspective by letting the tuning parameter involved diverge with the sample size. In this paper, we develop nonasymptotic concentration results for such an adaptive Huber estimator, namely, the Huber estimator with the tuning parameter adapted to sample size, dimension, and the variance of the noise. Specifically, we obtain a sub-Gaussian-type deviation inequality and a nonasymptotic Bahadur representation when noise variables only have finite second moments. The nonasymptotic results further yield two conventional normal approximation results that are of independent interest, the Berry-Esseen inequality and Cramér-type moderate deviation. As an important application to large-scale simultaneous inference, we apply these robust normal approximation results to analyze a dependence-adjusted multiple testing procedure for moderately heavy-tailed data. It is shown that the robust dependence-adjusted procedure asymptotically controls the overall false discovery proportion at the nominal level under mild moment conditions. Thorough numerical results on both simulated and real datasets are also provided to back up our theory.

重尾误差会削弱最小二乘估计的准确性，而最小二乘估计可能会被单个严重偏离的观测破坏。正如Peter Huber在1973年的开创性工作[Ann.Statist.1（1973）799-821]中所指出的那样，迫切需要最小二乘法的稳健替代方案。为了实现对重尾采样分布的鲁棒性，我们从一个新的角度重新审视Huber估计器，让所涉及的调谐参数随着样本大小而发散。在本文中，我们为这样一个自适应Huber估计器，即具有适应样本大小、维度和噪声方差的调谐参数的Huber估计量，开发了非同调集中结果。具体地说，当噪声变量只有有限的二阶矩时，我们得到了一个亚高斯型偏差不等式和一个非同调Bahadur表示。非共形结果进一步产生了两个独立感兴趣的常规正态近似结果，Berry-Esseen不等式和Cramér型中等偏差。作为大规模同时推理的一个重要应用，我们将这些稳健的正态近似结果应用于分析中等重尾数据的依赖性调整多重测试过程。结果表明，在温和矩条件下，鲁棒依赖性调整过程将总体错误发现比例渐近控制在标称水平。在模拟和真实数据集上也提供了全面的数值结果来支持我们的理论。

{"title":"A NEW PERSPECTIVE ON ROBUST M-ESTIMATION: FINITE SAMPLE THEORY AND APPLICATIONS TO DEPENDENCE-ADJUSTED MULTIPLE TESTING.","authors":"Wen-Xin Zhou, Koushiki Bose, Jianqing Fan, Han Liu","doi":"10.1214/17-AOS1606","DOIUrl":"10.1214/17-AOS1606","url":null,"abstract":"Heavy-tailed errors impair the accuracy of the least squares estimate, which can be spoiled by a single grossly outlying observation. As argued in the seminal work of Peter Huber in 1973 [Ann. Statist.1 (1973) 799-821], robust alternatives to the method of least squares are sorely needed. To achieve robustness against heavy-tailed sampling distributions, we revisit the Huber estimator from a new perspective by letting the tuning parameter involved diverge with the sample size. In this paper, we develop nonasymptotic concentration results for such an adaptive Huber estimator, namely, the Huber estimator with the tuning parameter adapted to sample size, dimension, and the variance of the noise. Specifically, we obtain a sub-Gaussian-type deviation inequality and a nonasymptotic Bahadur representation when noise variables only have finite second moments. The nonasymptotic results further yield two conventional normal approximation results that are of independent interest, the Berry-Esseen inequality and Cramér-type moderate deviation. As an important application to large-scale simultaneous inference, we apply these robust normal approximation results to analyze a dependence-adjusted multiple testing procedure for moderately heavy-tailed data. It is shown that the robust dependence-adjusted procedure asymptotically controls the overall false discovery proportion at the nominal level under mild moment conditions. Thorough numerical results on both simulated and real datasets are also provided to back up our theory.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"46 5","pages":"1904-1931"},"PeriodicalIF":4.5,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6133288/pdf/nihms926033.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36491731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ANALYSIS OF "LEARN-AS-YOU-GO" (LAGO) STUDIES. “随学随走”(lago)研究分析。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2018-08-20 DOI: 10.1214/20-AOS1978

D. Nevo, J. Lok, D. Spiegelman

In Learn-As-you-GO (LAGO) adaptive studies, the intervention is a complex multicomponent package, and is adapted in stages during the study based on past outcome data. This design formalizes standard practice in public health intervention studies. An effective intervention package is sought, while minimizing intervention package cost. In LAGO study data, the interventions in later stages depend upon the outcomes in the previous stages, violating standard statistical theory. We develop an estimator for the intervention effects, and prove consistency and asymptotic normality using a novel coupling argument, ensuring the validity of the test for the hypothesis of no overall intervention effect. We develop a confidence set for the optimal intervention package and confidence bands for the success probabilities under alternative package compositions. We illustrate our methods in the BetterBirth Study, which aimed to improve maternal and neonatal outcomes among 157,689 births in Uttar Pradesh, India through a multicomponent intervention package.

在“随做随学”(LAGO)适应性研究中，干预措施是一个复杂的多组件包，并且在研究过程中根据过去的结果数据分阶段进行调整。本设计正式确立了公共卫生干预研究的标准做法。寻求有效的干预方案，同时最小化干预方案的成本。在LAGO研究数据中，后期的干预取决于前阶段的结果，违反了标准的统计理论。我们建立了干预效应的估计量，并使用一个新的耦合参数证明了一致性和渐近正态性，从而保证了对没有总体干预效应假设的检验的有效性。我们建立了最优干预方案的置信集和不同干预方案组合下成功概率的置信带。我们在“更好的出生研究”中阐述了我们的方法，该研究旨在通过多组分干预方案改善印度北方邦157,689名新生儿的孕产妇和新生儿结局。

引用次数: 4

LARGE COVARIANCE ESTIMATION THROUGH ELLIPTICAL FACTOR MODELS. 通过椭圆因子模型的大协方差估计。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2018-08-01 Epub Date: 2018-06-27 DOI: 10.1214/17-AOS1588

Jianqing Fan, Han Liu, Weichen Wang

We propose a general Principal Orthogonal complEment Thresholding (POET) framework for large-scale covariance matrix estimation based on the approximate factor model. A set of high level sufficient conditions for the procedure to achieve optimal rates of convergence under different matrix norms is established to better understand how POET works. Such a framework allows us to recover existing results for sub-Gaussian data in a more transparent way that only depends on the concentration properties of the sample covariance matrix. As a new theoretical contribution, for the first time, such a framework allows us to exploit conditional sparsity covariance structure for the heavy-tailed data. In particular, for the elliptical distribution, we propose a robust estimator based on the marginal and spatial Kendall's tau to satisfy these conditions. In addition, we study conditional graphical model under the same framework. The technical tools developed in this paper are of general interest to high dimensional principal component analysis. Thorough numerical results are also provided to back up the developed theory.

基于近似因子模型，我们提出了一种用于大规模协方差矩阵估计的通用主正交复数阈值（POET）框架。为了更好地理解POET是如何工作的，建立了该过程在不同矩阵范数下实现最优收敛率的一组高层充分条件。这样的框架允许我们以更透明的方式恢复亚高斯数据的现有结果，该方式仅取决于样本协方差矩阵的浓度特性。作为一种新的理论贡献，这种框架首次允许我们利用重尾数据的条件稀疏性协方差结构。特别是，对于椭圆分布，我们提出了一个基于边缘和空间Kendallτ的鲁棒估计器来满足这些条件。此外，我们还在同一框架下研究了条件图形模型。本文开发的技术工具对高维主成分分析具有普遍的兴趣。文中还提供了较为详尽的数值结果来支持这一理论。

引用次数: 84

Consistency and convergence rate of phylogenetic inference via regularization. 基于正则化的系统发育推理的一致性和收敛率。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2018-08-01 Epub Date: 2018-06-27 DOI: 10.1214/17-AOS1592

Vu Dinh, Lam Si Tung Ho, Marc A Suchard, Frederick A Matsen

It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is "adaptive fast converging," meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.

在系统发育学中，有一些关于一组生物的整体进化树的信息，也许是部分信息，并希望找到这些生物的特定基因的进化树，这是很常见的。基因序列中可能没有足够的信息来准确地重建正确的“基因树”。尽管由于各种遗传过程，基因树可能偏离“物种树”，但在缺乏相反证据的情况下，假设它们一致是吝啬的。在这种情况下，一种常见的统计方法是制定一个可能性惩罚，以纳入这些额外的信息。最近的研究利用模拟和经验数据表明，与单独使用序列数据相比，量化物种树一致性的似然惩罚可以显著提高基因树重建的准确性。但是，这种方法的一致性尚未确定，收敛速度也没有限定。因为系统发育是一个非标准的推理问题，所以标准理论并不适用。在本文中，我们提出了一种用于基因树重建的惩罚极大似然估计，其中惩罚是基因树到物种树的Billera-Holmes-Vogtmann测地距离的平方。我们证明了这种方法是一致的，并推导了同时估计离散基因树结构和连续边缘长度(表示该分支上发生的进化量)的收敛速度。我们发现正则化估计器是“自适应快速收敛”的，这意味着它可以从多项式长度的基因序列中重建长度大于任何给定阈值的所有边。我们的方法不需要确切地知道物种树;事实上，我们的渐近理论适用于任何这样的导树。

{"title":"Consistency and convergence rate of phylogenetic inference via regularization.","authors":"Vu Dinh, Lam Si Tung Ho, Marc A Suchard, Frederick A Matsen","doi":"10.1214/17-AOS1592","DOIUrl":"https://doi.org/10.1214/17-AOS1592","url":null,"abstract":"It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct \"gene tree.\" Although the gene tree may deviate from the \"species tree\" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is \"adaptive fast converging,\" meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"46 4","pages":"1481-1512"},"PeriodicalIF":4.5,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOS1592","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36592809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model. 尖峰协方差模型中特征值的最优收缩。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2018-08-01 Epub Date: 2018-06-27 DOI: 10.1214/17-AOS1601

David L Donoho, Matan Gavish, Iain M Johnstone

We show that in a common high-dimensional covariance model, the choice of loss function has a profound effect on optimal estimation. In an asymptotic framework based on the Spiked Covariance model and use of orthogonally invariant estimators, we show that optimal estimation of the population covariance matrix boils down to design of an optimal shrinker η that acts elementwise on the sample eigenvalues. Indeed, to each loss function there corresponds a unique admissible eigenvalue shrinker η* dominating all other shrinkers. The shape of the optimal shrinker is determined by the choice of loss function and, crucially, by inconsistency of both eigenvalues and eigenvectors of the sample covariance matrix. Details of these phenomena and closed form formulas for the optimal eigenvalue shrinkers are worked out for a menagerie of 26 loss functions for covariance estimation found in the literature, including the Stein, Entropy, Divergence, Fréchet, Bhattacharya/Matusita, Frobenius Norm, Operator Norm, Nuclear Norm and Condition Number losses.

我们证明了在一个常见的高维协方差模型中，损失函数的选择对最优估计有着深远的影响。在一个基于Spiked协方差模型和使用正交不变估计量的渐近框架中，我们证明了总体协方差矩阵的最优估计可以归结为设计一个对样本特征值起元素作用的最优收缩器η。事实上，对于每个损失函数，都对应着一个唯一的可容许特征值收缩因子η*，它支配着所有其他收缩因子。最优收缩器的形状由损失函数的选择决定，关键是由样本协方差矩阵的特征值和特征向量的不一致性决定。对于文献中发现的26个协方差估计损失函数，包括Stein、熵、散度、Fréchet、Bhattacharya/Matusita、Frobenius范数、算子范数、核范数和条件数损失，给出了这些现象的细节和最优特征值收缩器的闭式公式。

引用次数: 181

BALL DIVERGENCE: NONPARAMETRIC TWO SAMPLE TEST. 球散度：非参数双样本检验。

IF 4.5 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics

Pub Date : 2018-06-01 DOI: 10.1214/17-AOS1579

Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang

In this paper, we first introduce Ball Divergence, a novel measure of the difference between two probability measures in separable Banach spaces, and show that the Ball Divergence of two probability measures is zero if and only if these two probability measures are identical without any moment assumption. Using Ball Divergence, we present a metric rank test procedure to detect the equality of distribution measures underlying independent samples. It is therefore robust to outliers or heavy-tail data. We show that this multivariate two sample test statistic is consistent with the Ball Divergence, and it converges to a mixture of χ² distributions under the null hypothesis and a normal distribution under the alternative hypothesis. Importantly, we prove its consistency against a general alternative hypothesis. Moreover, this result does not depend on the ratio of the two imbalanced sample sizes, ensuring that can be applied to imbalanced data. Numerical studies confirm that our test is superior to several existing tests in terms of Type I error and power. We conclude our paper with two applications of our method: one is for virtual screening in drug development process and the other is for genome wide expression analysis in hormone replacement therapy.

在本文中，我们首先引入了可分离Banach空间中两个概率测度之差的一个新测度Ball散度，并证明了两个概率度量的Ball散度为零，当且仅当这两个概率量度在没有任何矩假设的情况下是相同的。使用Ball Divergence，我们提出了一个度量秩检验程序来检测独立样本下分布测度的相等性。因此，它对异常值或重尾数据是稳健的。我们证明了这个多变量两样本检验统计量与Ball散度一致，并且它在零假设下收敛于χ2分布的混合，在替代假设下收敛为正态分布。重要的是，我们证明了它与一般替代假设的一致性。此外，这一结果不取决于两个不平衡样本大小的比率，确保了这一点可以应用于不平衡数据。数值研究证实，我们的测试在I型误差和功率方面优于现有的几种测试。最后，我们将我们的方法应用于两个方面：一是用于药物开发过程中的虚拟筛选，二是用于激素替代治疗中的全基因组表达分析。

{"title":"BALL DIVERGENCE: NONPARAMETRIC TWO SAMPLE TEST.","authors":"Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang","doi":"10.1214/17-AOS1579","DOIUrl":"10.1214/17-AOS1579","url":null,"abstract":"In this paper, we first introduce Ball Divergence, a novel measure of the difference between two probability measures in separable Banach spaces, and show that the Ball Divergence of two probability measures is zero if and only if these two probability measures are identical without any moment assumption. Using Ball Divergence, we present a metric rank test procedure to detect the equality of distribution measures underlying independent samples. It is therefore robust to outliers or heavy-tail data. We show that this multivariate two sample test statistic is consistent with the Ball Divergence, and it converges to a mixture of χ2 distributions under the null hypothesis and a normal distribution under the alternative hypothesis. Importantly, we prove its consistency against a general alternative hypothesis. Moreover, this result does not depend on the ratio of the two imbalanced sample sizes, ensuring that can be applied to imbalanced data. Numerical studies confirm that our test is superior to several existing tests in terms of Type I error and power. We conclude our paper with two applications of our method: one is for virtual screening in drug development process and the other is for genome wide expression analysis in hormone replacement therapy.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"46 3","pages":"1109-1137"},"PeriodicalIF":4.5,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOS1579","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36592808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41