Statistical Science最新文献

英文中文

Random Matrix Theory and Its Applications 随机矩阵理论及其应用

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-08-01 DOI: 10.1142/9789814273121

A. Izenman

This article reviews the important ideas behind random matrix theory (RMT), which has become a major tool in a variety of disciplines, including mathematical physics, number theory, combinatorics and multivariate statistical analysis. Much of the theory involves ensembles of random matrices that are governed by some probability distribution. Examples include Gaussian ensembles and Wishart–Laguerre ensembles. Interest has centered on studying the spectrum of random matrices, especially the extreme eigenvalues, suitably normalized, for a single Wishart matrix and for two Wishart matrices, for finite and infinite sample sizes in the real and complex cases. The Tracy–Widom Laws for the probability distribution of a normalized largest eigenvalue of a random matrix have become very prominent in RMT. Limiting probability distributions of eigenvalues of a certain random matrix lead to Wigner’s Semicircle Law and Marc˘enko–Pastur’s Quarter-Circle Law. Several applications of these results in RMT are described in this article.

本文回顾了随机矩阵理论(RMT)背后的重要思想，它已经成为各种学科的主要工具，包括数学物理，数论，组合学和多元统计分析。许多理论涉及由某种概率分布支配的随机矩阵的集合。例子包括高斯系综和Wishart-Laguerre系综。兴趣集中在研究随机矩阵的谱，特别是对于单个Wishart矩阵和两个Wishart矩阵，在真实和复杂情况下的有限和无限样本量，适当归一化的极端特征值。随机矩阵的归一化最大特征值的概率分布的tracey - wisdom定律在RMT中已经变得非常突出。某随机矩阵特征值的极限概率分布导致Wigner的半圆定律和Marc × × enko-Pastur的四分之一圆定律。本文描述了这些结果在RMT中的几个应用。

引用次数: 2

Khinchin’s 1929 Paper on Von Mises’ Frequency Theory of Probability 钦钦1929年关于冯·米塞斯概率频率理论的论文

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-08-01 DOI: 10.1214/20-sts798

L. Verburgt

In 1929, a few years prior to his colleague Kolmogorov’s Grundbegriffe, the leading Russian probabilist Khinchin published a paper in which he commented on the foundational ambitions of von Mises’ frequency theory of probability. This brief introduction provides background and context for the English translation of Khinchin’s historically revealing paper, published as an online supplement.

1929年，在他的同事Kolmogorov的Grundbergriffe之前几年，俄罗斯著名的概率学家Khinchin发表了一篇论文，他在论文中评论了von Mises的概率频率理论的基本野心。这篇简介为钦钦的历史启示论文的英文翻译提供了背景和背景，该论文作为在线增刊发表。

引用次数: 0

Statistical Modeling for Practical Pooled Testing During the COVID-19 Pandemic 新冠肺炎大流行期间实际汇集测试的统计模型

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-07-12 DOI: 10.1214/22-sts857

S. Comess, H. Wang, S. Holmes, Claire Donnat

Pooled testing offers an efficient solution to the unprecedented testing demands of the COVID-19 pandemic, although with potentially lower sensitivity and increased costs to implementation in some settings. Assessments of this trade-off typically assume pooled specimens are independent and identically distributed. Yet, in the context of COVID-19, these assumptions are often violated: testing done on networks (housemates, spouses, co-workers) captures correlated individuals, while infection risk varies substantially across time, place and individuals. Neglecting dependencies and heterogeneity may bias established optimality grids and induce a sub-optimal implementation of the procedure. As a lesson learned from this pandemic, this paper highlights the necessity of integrating field sampling information with statistical modeling to efficiently optimize pooled testing. Using real data, we show that (a) greater gains can be achieved at low logistical cost by exploiting natural correlations (non-independence) between samples -- allowing improvements in sensitivity and efficiency of up to 30% and 90% respectively;and (b) these gains are robust despite substantial heterogeneity across pools (non-identical). Our modeling results complement and extend the observations of Barak et al (2021) who report an empirical sensitivity well beyond expectations. Finally, we provide an interactive tool for selecting an optimal pool size using contextual information

汇集检测为新冠肺炎大流行前所未有的检测需求提供了有效的解决方案，尽管在某些情况下可能会降低灵敏度并增加实施成本。这种权衡的评估通常假设汇集的样本是独立的且分布相同。然而，在新冠肺炎的背景下，这些假设往往被违反：在网络（室友、配偶、同事）上进行的测试捕捉到了相关的个人，而感染风险因时间、地点和个人的不同而有很大差异。忽略依赖性和异质性可能会使已建立的最优性网格产生偏差，并导致程序的次优实现。作为从这场疫情中吸取的教训，本文强调了将现场采样信息与统计建模相结合以有效优化混合测试的必要性。使用真实数据，我们表明：（a）通过利用样本之间的自然相关性（非独立性），可以在低物流成本下获得更大的收益——灵敏度和效率分别提高30%和90%；以及（b）尽管池之间存在显著的异质性（不完全相同），但这些收益是稳健的。我们的建模结果补充和扩展了Barak等人（2021）的观察结果，他们报告了远远超出预期的经验敏感性。最后，我们提供了一个交互式工具，用于使用上下文信息选择最佳池大小

{"title":"Statistical Modeling for Practical Pooled Testing During the COVID-19 Pandemic","authors":"S. Comess, H. Wang, S. Holmes, Claire Donnat","doi":"10.1214/22-sts857","DOIUrl":"https://doi.org/10.1214/22-sts857","url":null,"abstract":"Pooled testing offers an efficient solution to the unprecedented testing demands of the COVID-19 pandemic, although with potentially lower sensitivity and increased costs to implementation in some settings. Assessments of this trade-off typically assume pooled specimens are independent and identically distributed. Yet, in the context of COVID-19, these assumptions are often violated: testing done on networks (housemates, spouses, co-workers) captures correlated individuals, while infection risk varies substantially across time, place and individuals. Neglecting dependencies and heterogeneity may bias established optimality grids and induce a sub-optimal implementation of the procedure. As a lesson learned from this pandemic, this paper highlights the necessity of integrating field sampling information with statistical modeling to efficiently optimize pooled testing. Using real data, we show that (a) greater gains can be achieved at low logistical cost by exploiting natural correlations (non-independence) between samples -- allowing improvements in sensitivity and efficiency of up to 30% and 90% respectively;and (b) these gains are robust despite substantial heterogeneity across pools (non-identical). Our modeling results complement and extend the observations of Barak et al (2021) who report an empirical sensitivity well beyond expectations. Finally, we provide an interactive tool for selecting an optimal pool size using contextual information","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2021-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46125680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Randomization-Based Test for Censored Outcomes: A New Look at the Logrank Test 基于随机化的检查结果检验：Logrank检验的新视角

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-07-06 DOI: 10.1214/22-sts851

Xinran Li, Dylan S. Small

Two-sample tests with censored outcomes are a classical topic in statistics with wide use even in cutting edge applications. There are at least two modes of inference used to justify two-sample tests. One is usual superpopulation inference assuming that units are independent and identically distributed (i.i.d.) samples from some superpopulation; the other is finite population inference that relies on the random assignments of units into different groups. When randomization is actually implemented, the latter has the advantage of avoiding distributional assumptions on the outcomes. In this paper, we focus on finite population inference for censored outcomes, which has been less explored in the literature. Moreover, we allow the censoring time to depend on treatment assignment, under which exact permutation inference is unachievable. We find that, surprisingly, the usual logrank test can also be justified by randomization. Specifically, under a Bernoulli randomized experiment with non-informative i.i.d. censoring, the logrank test is asymptotically valid for testing Fisher’s null hypothesis of no treatment effect on any unit. The asymptotic validity of the logrank test does not require any distributional assumption on the potential event times. We further extend the theory to the stratified logrank test, which is useful for randomized block designs and when censoring mechanisms vary across strata. In sum, the developed theory for the logrank test from finite population inference supplements its classical theory from usual superpopulation inference, and helps provide a broader justification for the logrank test.

具有截尾结果的两个样本测试是统计学中的一个经典话题，即使在前沿应用中也有广泛的应用。至少有两种推理模式用于证明两个样本测试的合理性。一种是通常的超种群推断，假设单元是来自某个超种群的独立且相同分布（i.i.d.）的样本；另一种是有限总体推理，它依赖于将单元随机分配到不同的组中。当实际实施随机化时，后者的优点是避免了对结果的分布假设。在本文中，我们关注的是审查结果的有限总体推断，这在文献中很少被探索。此外，我们允许审查时间取决于处理分配，在这种情况下，无法实现精确的排列推理。我们发现，令人惊讶的是，通常的logrank检验也可以通过随机化来证明。具体来说，在具有非信息性i.i.d.截尾的伯努利随机实验下，logrank检验对于检验Fisher对任何单位都没有治疗效果的零假设是渐近有效的。logrank检验的渐近有效性不需要对潜在事件时间进行任何分布假设。我们进一步将该理论扩展到分层logrank检验，这对于随机块设计以及当审查机制在不同层之间变化时是有用的。总之，有限总体推理的logrank检验的发展理论补充了通常超总体推理的经典理论，并有助于为logrank测试提供更广泛的理由。

{"title":"Randomization-Based Test for Censored Outcomes: A New Look at the Logrank Test","authors":"Xinran Li, Dylan S. Small","doi":"10.1214/22-sts851","DOIUrl":"https://doi.org/10.1214/22-sts851","url":null,"abstract":"Two-sample tests with censored outcomes are a classical topic in statistics with wide use even in cutting edge applications. There are at least two modes of inference used to justify two-sample tests. One is usual superpopulation inference assuming that units are independent and identically distributed (i.i.d.) samples from some superpopulation; the other is finite population inference that relies on the random assignments of units into different groups. When randomization is actually implemented, the latter has the advantage of avoiding distributional assumptions on the outcomes. In this paper, we focus on finite population inference for censored outcomes, which has been less explored in the literature. Moreover, we allow the censoring time to depend on treatment assignment, under which exact permutation inference is unachievable. We find that, surprisingly, the usual logrank test can also be justified by randomization. Specifically, under a Bernoulli randomized experiment with non-informative i.i.d. censoring, the logrank test is asymptotically valid for testing Fisher’s null hypothesis of no treatment effect on any unit. The asymptotic validity of the logrank test does not require any distributional assumption on the potential event times. We further extend the theory to the stratified logrank test, which is useful for randomized block designs and when censoring mechanisms vary across strata. In sum, the developed theory for the logrank test from finite population inference supplements its classical theory from usual superpopulation inference, and helps provide a broader justification for the logrank test.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47475631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Stein’s Method Meets Computational Statistics: A Review of Some Recent Developments 斯坦因的方法与计算统计:一些最新发展的回顾

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-05-07 DOI: 10.1214/22-sts863

Andreas Anastasiou, A. Barp, F. Briol, B. Ebner, Robert E. Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, A. Gretton, Christophe Ley, Qiang Liu, Lester W. Mackey, C. Oates, G. Reinert, Yvik Swan

Stein's method compares probability distributions through the study of a class of linear operators called Stein operators. While mainly studied in probability and used to underpin theoretical statistics, Stein's method has led to significant advances in computational statistics in recent years. The goal of this survey is to bring together some of these recent developments and, in doing so, to stimulate further research into the successful field of Stein's method and statistics. The topics we discuss include tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, control variate techniques, parameter estimation and goodness-of-fit testing.

斯坦的方法通过研究一类叫做斯坦算子的线性算子来比较概率分布。虽然主要研究概率论并用于理论统计，但斯坦的方法近年来在计算统计方面取得了重大进展。这项调查的目的是汇集这些最新的发展，并在这样做的过程中，刺激对斯坦的方法和统计的成功领域的进一步研究。我们讨论的主题包括基准测试和比较采样方法的工具，如近似马尔可夫链蒙特卡罗，采样方法的确定性替代方案，控制变量技术，参数估计和拟合优度测试。

引用次数: 17

The Costs and Benefits of Uniformly Valid Causal Inference with High-Dimensional Nuisance Parameters 具有高维妨害参数的一致有效因果推理的成本与收益

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-05-05 DOI: 10.1214/21-sts843

Niloofar Moosavi, J. Haggstrom, X. Luna

Important advances have recently been achieved in developing procedures yielding uniformly valid inference for a low dimensional causal parameter when high-dimensional nuisance models must be estimated. In this paper, we review the literature on uniformly valid causal inference and discuss the costs and benefits of using uniformly valid inference procedures. Naive estimation strategies based on regularisation, machine learning, or a preliminary model selection stage for the nuisance models have finite sample distributions which are badly approximated by their asymptotic distributions. To solve this serious problem, estimators which converge uniformly in distribution over a class of data generating mechanisms have been proposed in the literature. In order to obtain uniformly valid results in high-dimensional situations, sparsity conditions for the nuisance models need typically to be made, although a double robustness property holds, whereby if one of the nuisance model is more sparse, the other nuisance model is allowed to be less sparse. While uniformly valid inference is a highly desirable property, uniformly valid procedures pay a high price in terms of inflated variability. Our discussion of this dilemma is illustrated by the study of a double-selection outcome regression estimator, which we show is uniformly asymptotically unbiased, but is less variable than uniformly valid estimators in the numerical experiments conducted.

最近，在开发程序方面取得了重要进展，当必须估计高维滋扰模型时，可以对低维因果参数进行一致有效的推断。在本文中，我们回顾了一致有效因果推理的文献，并讨论了使用一致有效推理程序的成本和收益。基于正则化、机器学习或滋扰模型的初步模型选择阶段的天真估计策略具有有限的样本分布，其渐近分布非常接近。为了解决这个严重的问题，文献中提出了在一类数据生成机制上分布一致收敛的估计量。为了在高维情况下获得一致有效的结果，通常需要为滋扰模型设定稀疏性条件，尽管具有双重鲁棒性，由此，如果滋扰模型中的一个更稀疏，则允许另一个滋扰模型不那么稀疏。虽然一致有效推理是一种非常理想的性质，但一致有效程序在膨胀的可变性方面付出了高昂的代价。我们对这一困境的讨论通过对双重选择结果回归估计量的研究来说明，我们证明了该估计量是一致渐近无偏的，但与数值实验中的一致有效估计量相比，其变量较小。

{"title":"The Costs and Benefits of Uniformly Valid Causal Inference with High-Dimensional Nuisance Parameters","authors":"Niloofar Moosavi, J. Haggstrom, X. Luna","doi":"10.1214/21-sts843","DOIUrl":"https://doi.org/10.1214/21-sts843","url":null,"abstract":"Important advances have recently been achieved in developing procedures yielding uniformly valid inference for a low dimensional causal parameter when high-dimensional nuisance models must be estimated. In this paper, we review the literature on uniformly valid causal inference and discuss the costs and benefits of using uniformly valid inference procedures. Naive estimation strategies based on regularisation, machine learning, or a preliminary model selection stage for the nuisance models have finite sample distributions which are badly approximated by their asymptotic distributions. To solve this serious problem, estimators which converge uniformly in distribution over a class of data generating mechanisms have been proposed in the literature. In order to obtain uniformly valid results in high-dimensional situations, sparsity conditions for the nuisance models need typically to be made, although a double robustness property holds, whereby if one of the nuisance model is more sparse, the other nuisance model is allowed to be less sparse. While uniformly valid inference is a highly desirable property, uniformly valid procedures pay a high price in terms of inflated variability. Our discussion of this dilemma is illustrated by the study of a double-selection outcome regression estimator, which we show is uniformly asymptotically unbiased, but is less variable than uniformly valid estimators in the numerical experiments conducted.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2021-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48898490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Comment: Settle the Unsettling: An Inferential Models Perspective 评论：解决不安：一个推理模型的视角

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-05-01 DOI: 10.1214/21-STS765B

Chuanhai Liu, Ryan Martin

Here, we demonstrate that the inferential model (IM) framework, unlike the updating rules that Gong and Meng show to be unreliable, provides valid and efficient inferences/prediction while not being susceptible to sure loss. In this sense, the IM framework settles what Gong and Meng characterized as “unsettling.”

在这里，我们证明了推理模型（IM）框架与龚和孟所展示的不可靠的更新规则不同，它提供了有效的推断/预测，同时不易受到确定性损失的影响。从这个意义上说，IM框架解决了龚和孟所说的“令人不安”

引用次数: 8

The Box–Cox Transformation: Review and Extensions Box-Cox转换:回顾和扩展

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-05-01 DOI: 10.1214/20-STS778

A. Atkinson, M. Riani, A. Corbellini

The Box-Cox power transformation family for non-negative responses in linear models has a long and interesting history in both statistical practice and theory, which we summarize. The relationship between generalized linear models and log transformed data is illustrated. Extensions investigated include the transform both sides model and the Yeo-Johnson transformation for observations that can be positive or negative. The paper also describes an extended Yeo-Johnson transformation that allows positive and negative responses to have different power transformations. Analyses of data show this to be necessary. Robustness enters in the fan plot for which the forward search provides an ordering of the data. Plausible transformations are checked with an extended fan plot. These procedures are used to compare parametric power transformations with nonparametric transformations produced by smoothing.

线性模型中非负响应的Box-Cox功率变换族在统计实践和理论方面都有着悠久而有趣的历史，我们对此进行了总结。说明了广义线性模型与对数变换数据之间的关系。所研究的扩展包括变换两侧模型和杨-约翰逊变换，可以是正的或负的观测。本文还描述了一个允许正响应和负响应具有不同功率变换的扩展Yeo-Johnson变换。数据分析表明，这是必要的。鲁棒性进入扇形图，其中正向搜索提供了数据的排序。合理的转换用扩展的扇形图进行检验。这些程序用于比较参数幂变换与由平滑产生的非参数变换。

引用次数: 44

A selective overview of deep learning. 深度学习的选择性概述。

IF 5.7 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-05-01 Epub Date: 2020-04-19 DOI: 10.1214/20-sts783

Jianqing Fan, Cong Ma, Yiqiao Zhong

Deep learning has achieved tremendous success in recent years. In simple words, deep learning uses the composition of many nonlinear functions to model the complex dependency between input features and labels. While neural networks have a long history, recent advances have greatly improved their performance in computer vision, natural language processing, etc. From the statistical and scientific perspective, it is natural to ask: What is deep learning? What are the new characteristics of deep learning, compared with classical methods? What are the theoretical foundations of deep learning? To answer these questions, we introduce common neural network models (e.g., convolutional neural nets, recurrent neural nets, generative adversarial nets) and training techniques (e.g., stochastic gradient descent, dropout, batch normalization) from a statistical point of view. Along the way, we highlight new characteristics of deep learning (including depth and over-parametrization) and explain their practical and theoretical benefits. We also sample recent results on theories of deep learning, many of which are only suggestive. While a complete understanding of deep learning remains elusive, we hope that our perspectives and discussions serve as a stimulus for new statistical research.

近年来，深度学习取得了巨大成功。简单地说，深度学习使用许多非线性函数的组合来模拟输入特征和标签之间的复杂依赖关系。虽然神经网络有着悠久的历史，但近年来的进步大大提高了其在计算机视觉、自然语言处理等方面的性能。从统计学和科学的角度看，我们自然会问：什么是深度学习？与经典方法相比，深度学习有哪些新特点？深度学习的理论基础是什么？为了回答这些问题，我们从统计学的角度介绍了常见的神经网络模型（如卷积神经网络、递归神经网络、生成对抗网络）和训练技术（如随机梯度下降、丢弃、批量归一化）。在此过程中，我们强调了深度学习的新特点（包括深度和过参数化），并解释了它们在实践和理论上的益处。我们还列举了有关深度学习理论的最新成果，其中许多成果都只是建议性的。虽然对深度学习的全面理解仍遥不可及，但我们希望我们的观点和讨论能对新的统计研究起到激励作用。

{"title":"A selective overview of deep learning.","authors":"Jianqing Fan, Cong Ma, Yiqiao Zhong","doi":"10.1214/20-sts783","DOIUrl":"10.1214/20-sts783","url":null,"abstract":"Deep learning has achieved tremendous success in recent years. In simple words, deep learning uses the composition of many nonlinear functions to model the complex dependency between input features and labels. While neural networks have a long history, recent advances have greatly improved their performance in computer vision, natural language processing, etc. From the statistical and scientific perspective, it is natural to ask: What is deep learning? What are the new characteristics of deep learning, compared with classical methods? What are the theoretical foundations of deep learning? To answer these questions, we introduce common neural network models (e.g., convolutional neural nets, recurrent neural nets, generative adversarial nets) and training techniques (e.g., stochastic gradient descent, dropout, batch normalization) from a statistical point of view. Along the way, we highlight new characteristics of deep learning (including depth and over-parametrization) and explain their practical and theoretical benefits. We also sample recent results on theories of deep learning, many of which are only suggestive. While a complete understanding of deep learning remains elusive, we hope that our perspectives and discussions serve as a stimulus for new statistical research.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"36 2","pages":"264-290"},"PeriodicalIF":5.7,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8300482/pdf/nihms-1639566.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39219267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust high dimensional factor models with applications to statistical machine learning. 鲁棒高维因子模型及其在统计机器学习中的应用。

IF 3.9 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science

Pub Date : 2021-05-01 Epub Date: 2021-04-19 DOI: 10.1214/20-sts785

Jianqing Fan, Kaizheng Wang, Yiqiao Zhong, Ziwei Zhu

Factor models are a class of powerful statistical models that have been widely used to deal with dependent measurements that arise frequently from various applications from genomics and neuroscience to economics and finance. As data are collected at an ever-growing scale, statistical machine learning faces some new challenges: high dimensionality, strong dependence among observed variables, heavy-tailed variables and heterogeneity. High-dimensional robust factor analysis serves as a powerful toolkit to conquer these challenges. This paper gives a selective overview on recent advance on high-dimensional factor models and their applications to statistics including Factor-Adjusted Robust Model selection (FarmSelect) and Factor-Adjusted Robust Multiple testing (FarmTest). We show that classical methods, especially principal component analysis (PCA), can be tailored to many new problems and provide powerful tools for statistical estimation and inference. We highlight PCA and its connections to matrix perturbation theory, robust statistics, random projection, false discovery rate, etc., and illustrate through several applications how insights from these fields yield solutions to modern challenges. We also present far-reaching connections between factor models and popular statistical learning problems, including network analysis and low-rank matrix recovery.

因子模型是一类强大的统计模型，已被广泛用于处理从基因组学、神经科学到经济学和金融学的各种应用中经常出现的依赖性测量。随着数据的收集规模不断扩大，统计机器学习面临着一些新的挑战：高维度、观测变量之间的强依赖性、重尾变量和异质性。高维稳健因子分析是克服这些挑战的强大工具。本文选择性地综述了高维因子模型的最新进展及其在统计学中的应用，包括因子调整稳健模型选择（FarmSelect）和因子调整稳健多重检验（FarmTest）。我们表明，经典方法，特别是主成分分析（PCA），可以针对许多新问题进行调整，并为统计估计和推理提供强大的工具。我们强调了主成分分析及其与矩阵扰动理论、稳健统计、随机投影、错误发现率等的联系，并通过几个应用程序说明了这些领域的见解如何为现代挑战提供解决方案。我们还提出了因子模型与流行的统计学习问题之间的深远联系，包括网络分析和低秩矩阵恢复。

{"title":"Robust high dimensional factor models with applications to statistical machine learning.","authors":"Jianqing Fan, Kaizheng Wang, Yiqiao Zhong, Ziwei Zhu","doi":"10.1214/20-sts785","DOIUrl":"10.1214/20-sts785","url":null,"abstract":"Factor models are a class of powerful statistical models that have been widely used to deal with dependent measurements that arise frequently from various applications from genomics and neuroscience to economics and finance. As data are collected at an ever-growing scale, statistical machine learning faces some new challenges: high dimensionality, strong dependence among observed variables, heavy-tailed variables and heterogeneity. High-dimensional robust factor analysis serves as a powerful toolkit to conquer these challenges. This paper gives a selective overview on recent advance on high-dimensional factor models and their applications to statistics including Factor-Adjusted Robust Model selection (FarmSelect) and Factor-Adjusted Robust Multiple testing (FarmTest). We show that classical methods, especially principal component analysis (PCA), can be tailored to many new problems and provide powerful tools for statistical estimation and inference. We highlight PCA and its connections to matrix perturbation theory, robust statistics, random projection, false discovery rate, etc., and illustrate through several applications how insights from these fields yield solutions to modern challenges. We also present far-reaching connections between factor models and popular statistical learning problems, including network analysis and low-rank matrix recovery.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"36 2","pages":"303-327"},"PeriodicalIF":3.9,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8315369/pdf/nihms-1639567.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39254018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Statistical Science

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀