Statistical Theory and Related Fields最新文献

英文中文

A new result on recovery sparse signals using orthogonal matching pursuit 利用正交匹配追踪恢复稀疏信号的一个新结果

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2022-03-13 DOI: 10.1080/24754269.2022.2048445

Xueping Chen, Jianzhong Liu, Jiandong Chen

Orthogonal matching pursuit (OMP) algorithm is a classical greedy algorithm widely used in compressed sensing. In this paper, by exploiting the Wielandt inequality and some properties of orthogonal projection matrix, we obtained a new number of iterations required for the OMP algorithm to perform exact recovery of sparse signals, which improves significantly upon the latest results as we know.

正交匹配追踪(OMP)算法是一种经典的贪婪算法，广泛应用于压缩感知领域。本文利用Wielandt不等式和正交投影矩阵的一些性质，得到了OMP算法精确恢复稀疏信号所需的新的迭代次数，在我们所知的最新结果的基础上有了很大的改进。

引用次数: 1

A selective review of statistical methods using calibration information from similar studies 使用类似研究的校准信息的统计方法的选择性回顾

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2022-02-17 DOI: 10.1080/24754269.2022.2037201

J. Qin, Yukun Liu, Pengfei Li

In the era of big data, divide-and-conquer, parallel, and distributed inference methods have become increasingly popular. How to effectively use the calibration information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. Many newly developed methods have roots in traditional statistical approaches that make use of calibration information. In this paper, we first review some classical statistical methods for using calibration information, including simple meta-analysis methods, parametric likelihood, empirical likelihood, and the generalized method of moments. We further investigate how these methods incorporate summarized or auxiliary information from previous studies, related studies, or populations. We find that the methods based on summarized data usually have little or nearly no efficiency loss compared with the corresponding methods based on all-individual data. Finally, we review some recently developed big data analysis methods including communication-efficient distributed approaches, renewal estimation, and incremental inference as examples of the latest developments in methods using calibration information.

在大数据时代，分而治之、并行和分布式推理方法越来越流行。如何在并行计算中有效地使用来自每台机器的校准信息已成为统计学家和计算机科学家的一项具有挑战性的任务。许多新开发的方法都源于利用校准信息的传统统计方法。在本文中，我们首先回顾了一些使用校准信息的经典统计方法，包括简单的荟萃分析方法、参数似然、经验似然和广义矩方法。我们进一步研究了这些方法如何结合先前研究、相关研究或人群的总结或辅助信息。我们发现，与基于所有单个数据的相应方法相比，基于汇总数据的方法通常很少或几乎没有效率损失。最后，我们回顾了一些最近开发的大数据分析方法，包括高效通信的分布式方法、更新估计和增量推理，作为使用校准信息的方法的最新发展的例子。

{"title":"A selective review of statistical methods using calibration information from similar studies","authors":"J. Qin, Yukun Liu, Pengfei Li","doi":"10.1080/24754269.2022.2037201","DOIUrl":"https://doi.org/10.1080/24754269.2022.2037201","url":null,"abstract":"In the era of big data, divide-and-conquer, parallel, and distributed inference methods have become increasingly popular. How to effectively use the calibration information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. Many newly developed methods have roots in traditional statistical approaches that make use of calibration information. In this paper, we first review some classical statistical methods for using calibration information, including simple meta-analysis methods, parametric likelihood, empirical likelihood, and the generalized method of moments. We further investigate how these methods incorporate summarized or auxiliary information from previous studies, related studies, or populations. We find that the methods based on summarized data usually have little or nearly no efficiency loss compared with the corresponding methods based on all-individual data. Finally, we review some recently developed big data analysis methods including communication-efficient distributed approaches, renewal estimation, and incremental inference as examples of the latest developments in methods using calibration information.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"175 - 190"},"PeriodicalIF":0.5,"publicationDate":"2022-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42114372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Optimal model averaging estimator for multinomial logit models 多项式logit模型的最优模型平均估计

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2022-02-17 DOI: 10.1080/24754269.2022.2037204

Rongjie Jiang, Liming Wang, Yang Bai

In this paper, we study optimal model averaging estimators of regression coefficients in a multinomial logit model, which is commonly used in many scientific fields. A Kullback–Leibler (KL) loss-based weight choice criterion is developed to determine averaging weights. Under some regularity conditions, we prove that the resulting model averaging estimators are asymptotically optimal. When the true model is one of the candidate models, the averaged estimators are consistent. Simulation studies suggest the superiority of the proposed method over commonly used model selection criterions, model averaging methods, as well as some other related methods in terms of the KL loss and mean squared forecast error. Finally, the website phishing data is used to illustrate the proposed method.

本文研究了多项logit模型中回归系数的最优模型平均估计，这是许多科学领域中常用的方法。提出了一种基于KL损失的权值选择准则来确定平均权值。在一些正则性条件下，我们证明了得到的模型平均估计量是渐近最优的。当真实模型是候选模型之一时，平均估计量是一致的。仿真研究表明，该方法在KL损失和均方预测误差方面优于常用的模型选择准则、模型平均方法以及其他一些相关方法。最后，以网站钓鱼数据为例说明了所提出的方法。

引用次数: 2

Rejoinder on ‘A review of distributed statistical inference’ 对“分布式统计推理述评”的反驳

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2022-02-09 DOI: 10.1080/24754269.2022.2035304

Yuan Gao, Weidong Liu, Hansheng Wang, Xiaozhou Wang, Yibo Yan, Riquan Zhang

Yuan Gaoa, Weidong Liub, Hansheng Wangc, Xiaozhou Wanga, Yibo Yana and Riquan Zhanga aSchool of Statistics and Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, East China Normal University, Shanghai, People’s Republic of China; bSchool of Mathematical Sciences – School of Life Sciences and Biotechnology – MOE Key Lab of Artifcial Intelligence, Shanghai Jiao Tong University, Shanghai, People’s Republic of China; cGuanghua School of Management, Peking University, Beijing, People’s Republic of China

袁高阿，刘卫东，王汉生，王晓洲，亚娜，张日泉，华东师范大学统计学院统计与数据科学高级理论与应用重点实验室，上海；b中华人民共和国上海交通大学数学科学学院-生命科学与生物技术学院-教育部人工智能重点实验室；中华人民共和国北京，北京大学光华管理学院

引用次数: 0

Model averaging for generalized linear models in fragmentary data prediction 片段数据预测中广义线性模型的模型平均

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2022-02-04 DOI: 10.1080/24754269.2022.2105486

Chao-Qun Yuan, Yang Wu, Fang Fang

ABSTRACT Fragmentary data is becoming more and more popular in many areas which brings big challenges to researchers and data analysts. Most existing methods dealing with fragmentary data consider a continuous response while in many applications the response variable is discrete. In this paper, we propose a model averaging method for generalized linear models in fragmentary data prediction. The candidate models are fitted based on different combinations of covariate availability and sample size. The optimal weight is selected by minimizing the Kullback–Leibler loss in the completed cases and its asymptotic optimality is established. Empirical evidences from a simulation study and a real data analysis about Alzheimer disease are presented.

碎片数据在许多领域的应用越来越广泛，这给研究人员和数据分析人员带来了巨大的挑战。大多数处理零碎数据的现有方法考虑连续响应，而在许多应用中，响应变量是离散的。本文提出了一种用于片段数据预测的广义线性模型的模型平均方法。候选模型根据协变量可用性和样本量的不同组合进行拟合。通过最小化完成情况下的Kullback-Leibler损失来选择最优权值，并建立了其渐近最优性。本文介绍了阿尔茨海默病的模拟研究和实际数据分析的经验证据。

引用次数: 2

Discussion on ‘A review of distributed statistical inference’ 关于“分布式统计推断综述”的讨论

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2022-02-04 DOI: 10.1080/24754269.2022.2030107

Yang Yu, Guang Cheng

We congratulate the authors on an impressive team effort to comprehensively review various statistical estimation and inference methods in distributed frameworks. This paper is an excellent resource for anyone wishing to understand why distributed inference is important in the era of big data, what the challenges of conducting distributed inference instead of centralized inference are, and how statisticians propose solutions to overcome these challenges. First, we notice that this paper focuses mainly on distributed estimation, and we would like to point out several other works on distributed inference. For smooth loss functions, Jordan et al. (2018) established asymptotic normality for their multi-round distributed estimator, which yields two communication-efficient approaches to constructing confidence regions using a sandwiched covariance matrix. For non-smooth loss functions, Chen et al. (2021) similarly proposed a sandwich-type confidence interval based on the asymptotic normality of their distributed estimator. More generic inference approaches, such as bootstrap, have also been studied in the massive data setting including the distributed framework. The authors reviewed the Bag of Little Bootstraps (BLB) method proposed by Kleiner et al. (2014), which is to repeatedly resample and refit the model at each local machine and finally aggregate the bootstrap statistics. Considering the huge computational cost of BLB, Sengupta et al. (2016) proposed the Subsampled Double Bootstrap (SDB) method, which has higher computational efficiency but requires a large number of local machines to maintain statistical accuracy. In addition to distributed samples, the dimensionality can also become large in the big data era, and in this case researchers may be more interested in simultaneous inference onmultiple parameters. In the centralized setting, bootstrap is one of the solutions to the simultaneous inference problems (Zhang & Cheng, 2017). In a distributed framework where the dimensionality grows, Yu et al. (2020) proposed distributed bootstrap methods for simultaneous inference, which not only are efficient in terms of both communication and

我们祝贺作者们令人印象深刻的团队努力，全面回顾了分布式框架中的各种统计估计和推理方法。对于任何想要理解分布式推理在大数据时代为何如此重要、进行分布式推理而不是集中式推理的挑战是什么、以及统计学家如何提出克服这些挑战的解决方案的人来说，这篇论文都是一个很好的资源。首先，我们注意到本文主要关注分布式估计，并且我们想指出在分布式推理方面的其他一些工作。对于平滑损失函数，Jordan等人(2018)为他们的多轮分布估计器建立了渐近正态性，这产生了两种使用夹心协方差矩阵构建置信区域的通信高效方法。对于非光滑损失函数，Chen等人(2021)同样提出了基于其分布估计量的渐近正态性的三明治型置信区间。更通用的推理方法，如bootstrap，也在包括分布式框架在内的海量数据环境中得到了研究。本文回顾了Kleiner et al.(2014)提出的Bag of Little bootstrap (BLB)方法，即在每台本地机器上反复重新采样和重构模型，最后汇总bootstrap统计数据。考虑到BLB的巨大计算成本，Sengupta等(2016)提出了subsampling Double Bootstrap (SDB)方法，该方法具有更高的计算效率，但需要大量的局部机来保持统计精度。除了分布式样本，在大数据时代，维数也会变得很大，在这种情况下，研究人员可能会对多参数的同时推理更感兴趣。在集中式设置中，bootstrap是同时推理问题的解决方案之一(Zhang & Cheng, 2017)。在维数增长的分布式框架中，Yu等人(2020)提出了用于同时推理的分布式自举方法，该方法不仅在通信和数据处理方面都是高效的

{"title":"Discussion on ‘A review of distributed statistical inference’","authors":"Yang Yu, Guang Cheng","doi":"10.1080/24754269.2022.2030107","DOIUrl":"https://doi.org/10.1080/24754269.2022.2030107","url":null,"abstract":"We congratulate the authors on an impressive team effort to comprehensively review various statistical estimation and inference methods in distributed frameworks. This paper is an excellent resource for anyone wishing to understand why distributed inference is important in the era of big data, what the challenges of conducting distributed inference instead of centralized inference are, and how statisticians propose solutions to overcome these challenges. First, we notice that this paper focuses mainly on distributed estimation, and we would like to point out several other works on distributed inference. For smooth loss functions, Jordan et al. (2018) established asymptotic normality for their multi-round distributed estimator, which yields two communication-efficient approaches to constructing confidence regions using a sandwiched covariance matrix. For non-smooth loss functions, Chen et al. (2021) similarly proposed a sandwich-type confidence interval based on the asymptotic normality of their distributed estimator. More generic inference approaches, such as bootstrap, have also been studied in the massive data setting including the distributed framework. The authors reviewed the Bag of Little Bootstraps (BLB) method proposed by Kleiner et al. (2014), which is to repeatedly resample and refit the model at each local machine and finally aggregate the bootstrap statistics. Considering the huge computational cost of BLB, Sengupta et al. (2016) proposed the Subsampled Double Bootstrap (SDB) method, which has higher computational efficiency but requires a large number of local machines to maintain statistical accuracy. In addition to distributed samples, the dimensionality can also become large in the big data era, and in this case researchers may be more interested in simultaneous inference onmultiple parameters. In the centralized setting, bootstrap is one of the solutions to the simultaneous inference problems (Zhang & Cheng, 2017). In a distributed framework where the dimensionality grows, Yu et al. (2020) proposed distributed bootstrap methods for simultaneous inference, which not only are efficient in terms of both communication and","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"102 - 103"},"PeriodicalIF":0.5,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48788970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discussion of: a review of distributed statistical inference 讨论：分布式统计推理综述

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2022-01-12 DOI: 10.1080/24754269.2021.2022998

Zheng-Chu Guo

Analysing and processing massive data is becoming ubiquitous in the era of big data. Distributed learning based on divide-and-conquer approach has attracted increasing interest in recent years, since it not only reduces computational complexity and storage requirements, but also protects the data privacy when data subsets are distributively stored on different local machines. This paper provides a comprehensive review for distributed learning with parametric models, nonparametric models and other popular models. As mentioned in this paper, nonparametric regression in reproducing kernel Hilbert spaces is popular in machine learning; however, theoretical analysis for distributed learning algorithms in reproducing kernel Hilbert spaces mainly focuses on the least-square loss functions, and results for some other loss functions are limited; it would be interesting to conduct error analysis for distributed regression with general loss functions and distributed classification in reproducing kernel Hilbert spaces. In distributed learning, a standard assumption is that the data are identically and independently drawn from some unknown probability distribution; however, this assumption may not hold in practice since data are usually collected asynchronously throughout time. It is of great interest to study distributed learning algorithms with non-i.i.d. data. Recently, Sun and Lin (2020) considered distributed kernel ridge regression for strong mixing sequences. The mixing conditions are very common assumptions in the stochastic processes and the mixing coefficients can be estimated in some cases such as Gaussian and Markov processes. In the community of machine learning, the strong mixing conditions are used to quantify the dependence of samples. It is assumed in Sun and Lin (2020) that Dk (1 ≤ k ≤ m) is a strong mixing sequence with α-mixing coefficient αj, and there exists a suitable arrangement of D1,D2, . . . ,Dm such that D = ⋃mk=1 Dk is also a strong mixing sequence with α-mixing coefficient αj; in addition, under some mild conditions on the regression function and the hypothesis spaces, it is shown in Sun and Lin (2020) that as long as the number of the local machines is not too large, an almost optimal convergence rate can be derived, which is comparable to the result under i.i.d. assumptions.

在大数据时代，海量数据的分析和处理变得无处不在。基于分而治之的分布式学习方法近年来引起了人们越来越多的兴趣，因为它不仅降低了计算复杂度和存储需求，而且当数据子集分布存储在不同的本地机器上时，它还保护了数据隐私。本文对分布学习的参数模型、非参数模型和其他流行的模型进行了全面的综述。如本文所述，非参数回归在再现核希尔伯特空间中的应用在机器学习中很受欢迎;然而，对于再现核Hilbert空间的分布式学习算法的理论分析主要集中在最小二乘损失函数上，对其他一些损失函数的研究结果有限;在再现核希尔伯特空间时，对具有一般损失函数和分布式分类的分布回归进行误差分析是很有意义的。在分布式学习中，一个标准的假设是数据是相同的，独立地从一些未知的概率分布中提取的;然而，这种假设在实践中可能不成立，因为数据通常在整个过程中异步收集。研究非id的分布式学习算法是一个很有意义的课题。数据。最近，Sun和Lin(2020)考虑了强混合序列的分布式核脊回归。混合条件是随机过程中非常常见的假设，混合系数可以在某些情况下估计，如高斯过程和马尔可夫过程。在机器学习领域，强混合条件被用来量化样本的依赖性。Sun and Lin(2020)假设Dk(1≤k≤m)为α-混合系数αj的强混合序列，且D1、D2、…存在合适的排列。，Dm使得D = δ mk=1, Dk也是具有α-混合系数αj的强混合序列;此外，在回归函数和假设空间的一些温和条件下，Sun and Lin(2020)表明，只要局部机器的数量不太大，就可以推导出几乎最优的收敛速度，这与i.i.d假设下的结果相当。

{"title":"Discussion of: a review of distributed statistical inference","authors":"Zheng-Chu Guo","doi":"10.1080/24754269.2021.2022998","DOIUrl":"https://doi.org/10.1080/24754269.2021.2022998","url":null,"abstract":"Analysing and processing massive data is becoming ubiquitous in the era of big data. Distributed learning based on divide-and-conquer approach has attracted increasing interest in recent years, since it not only reduces computational complexity and storage requirements, but also protects the data privacy when data subsets are distributively stored on different local machines. This paper provides a comprehensive review for distributed learning with parametric models, nonparametric models and other popular models. As mentioned in this paper, nonparametric regression in reproducing kernel Hilbert spaces is popular in machine learning; however, theoretical analysis for distributed learning algorithms in reproducing kernel Hilbert spaces mainly focuses on the least-square loss functions, and results for some other loss functions are limited; it would be interesting to conduct error analysis for distributed regression with general loss functions and distributed classification in reproducing kernel Hilbert spaces. In distributed learning, a standard assumption is that the data are identically and independently drawn from some unknown probability distribution; however, this assumption may not hold in practice since data are usually collected asynchronously throughout time. It is of great interest to study distributed learning algorithms with non-i.i.d. data. Recently, Sun and Lin (2020) considered distributed kernel ridge regression for strong mixing sequences. The mixing conditions are very common assumptions in the stochastic processes and the mixing coefficients can be estimated in some cases such as Gaussian and Markov processes. In the community of machine learning, the strong mixing conditions are used to quantify the dependence of samples. It is assumed in Sun and Lin (2020) that Dk (1 ≤ k ≤ m) is a strong mixing sequence with α-mixing coefficient αj, and there exists a suitable arrangement of D1,D2, . . . ,Dm such that D = ⋃mk=1 Dk is also a strong mixing sequence with α-mixing coefficient αj; in addition, under some mild conditions on the regression function and the hypothesis spaces, it is shown in Sun and Lin (2020) that as long as the number of the local machines is not too large, an almost optimal convergence rate can be derived, which is comparable to the result under i.i.d. assumptions.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"104 - 104"},"PeriodicalIF":0.5,"publicationDate":"2022-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48277971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discussion of: ‘A review of distributed statistical inference’ 讨论:“分布式统计推断综述”

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2021-12-28 DOI: 10.1080/24754269.2021.2015868

Shaogao Lv, Xingcai Zhou

First of all, we would like to congratulate Dr Gao et al. for their excellent paper, which provides a comprehensive overview of amounts of existing work on distributed estimation (learning). Different from related work Gu et al. (2019); Liu et al. (2021); Verbraeken et al. (2020) that focus on computing, storage and communication architecture, the current paper leverages how to guarantee statistical efficiency of a given distributed method from a statistical viewpoint. In the following, we divide our discussion into three parts:

首先，我们要祝贺高博士等人的出色论文，该论文全面概述了分布式估计（学习）方面的现有工作。与相关工作不同，顾等人（2019）；刘等（2021）；Verbraeken等人（2020）专注于计算、存储和通信架构，当前的论文利用了如何从统计角度保证给定分布式方法的统计效率。在下文中，我们将讨论分为三个部分：

引用次数: 0

Generalized fiducial methods for testing quantitative trait locus effects in genetic backcross studies 遗传回交研究中检验数量性状基因座效应的广义基准方法

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2021-12-28 DOI: 10.1080/24754269.2021.1984636

Pengcheng Ren, Guanfu Liu, X. Pu, Yan Li

In this paper, we propose generalized fiducial methods and construct four generalized p-values to test the existence of quantitative trait locus effects under phenotype distributions from a location-scale family. Compared with the likelihood ratio test based on simulation studies, our methods perform better at controlling type I errors while retaining comparable power in cases with small or moderate sample sizes. The four generalized fiducial methods support varied scenarios: two of them are more aggressive and powerful, whereas the other two appear more conservative and robust. A real data example involving mouse blood pressure is used to illustrate our proposed methods.

在本文中，我们提出了广义基准方法，并构造了四个广义p值来检验在一个位置尺度家族的表型分布下数量性状基因座效应的存在性。与基于模拟研究的似然比检验相比，我们的方法在控制I型误差方面表现更好，同时在小样本量或中等样本量的情况下保持了可比的能力。四种广义基准方法支持不同的场景：其中两种更具攻击性和强大性，而另两种则显得更保守和稳健。使用一个涉及小鼠血压的真实数据示例来说明我们提出的方法。

引用次数: 1

Discussion of the paper ‘A review of distributed statistical inference’ 关于“分布式统计推断综述”一文的讨论

IF 0.5 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields

Pub Date : 2021-12-28 DOI: 10.1080/24754269.2021.2017544

Heng Lian

The authors should be congratulated on their timely contribution to this emerging field with a comprehensive review, which will certainly attract more researchers into this area. In the simplest one-shot approach, the entire dataset is distributed on multiple machines, and each machine computes a local estimate based on local data only, and a central machine performs an aggregation calculation as a final processing step. In more complicated settings, multiple communications are carried out, typically passing also first-order information (gradient) and/or second-order information (Hession matrix) between local machines and the central machine. This review clearly separates the existing works in this area into several sections, considering parameter regression, nonparametric regression, and other models including principal component analysis and variable screening. In this discussion, I will consider some possible future directions that can be entertained in this area, based on my own personal experience. The first problem is a combination of divide-and-conquer estimation with some efficient local algorithm not used in traditional statistical analysis. This is motivated by that, due to the stringent constraint on the number of machines that can be used either practically or in theory (for example, when using a one-shot approach, the number ofmachines that can be used isO( √ N)), the sample size on each worker machine can still be large. In other words, even after partitioning, the local sample sizemay still be too large to be processed by traditional algorithms. In such a case, a more efficient algorithm (one that possibly approximates the exact solution) should be used on each local machine. The important question here is whether the optimal statistical properties can be retained using such an algorithm. One such attempt with an affirmative answer is recently reported in Lian et al. (2021). In this work, we use random sketches (random projection) for kernel regression in anRKHS framework for nonparametric regression. Use of random sketches reduces the computational complexity on each worker machine, and at the same time still retains the optimal statistical convergence rate. We expect combinations along such a direction can be useful in various settings, and for different settings different efficient algorithms to compute some approximate solution are called for. The second problem is to extend the studies beyond the worker-server model. Most of the existing methods in the statistics literature are focused on the centralized system where there is a single special machine that communicates with all others and coordinates computation and communication. However, in many modern applications, such systems are rare and unreliable since the failure of the central machine would be disastrous. Consideration of statistical inference in a decentralized system, synchronous or asynchronous, where there is no such specialized central machine, would be an intere

值得祝贺的是，作者们对这一新兴领域做出了及时的贡献，并进行了全面的综述，这必将吸引更多的研究人员进入这一领域。在最简单的一次性方法中，整个数据集分布在多台机器上，每台机器仅基于局部数据计算局部估计，中央机器执行聚合计算作为最终处理步骤。在更复杂的设置中，执行多次通信，通常在本地机器和中央机器之间还传递一阶信息（梯度）和/或二阶信息（Hession矩阵）。这篇综述清楚地将该领域的现有工作分为几个部分，考虑了参数回归、非参数回归和其他模型，包括主成分分析和变量筛选。在这次讨论中，我将根据自己的个人经验，考虑在这一领域未来可能的一些方向。第一个问题是将分治估计与传统统计分析中未使用的一些有效的局部算法相结合。这是因为，由于实际或理论上可以使用的机器数量受到严格限制（例如，当使用一次性方法时，可以使用的机械数量为O（√N）），每个工人机器上的样本量仍然很大。换句话说，即使在分区之后，局部样本大小可能仍然太大，无法通过传统算法进行处理。在这种情况下，应该在每个本地机器上使用更有效的算法（可能接近精确解的算法）。这里的重要问题是，使用这样的算法是否可以保留最佳统计特性。Lian等人最近报道了一个这样的尝试，其答案是肯定的。（2021）。在这项工作中，我们在非参数回归的RKHS框架中使用随机草图（随机投影）进行核回归。随机草图的使用降低了每台工作机器的计算复杂性，同时仍然保持了最佳的统计收敛速度。我们期望沿着这样一个方向的组合在各种设置中都是有用的，并且对于不同的设置，需要不同的高效算法来计算一些近似解。第二个问题是将研究扩展到工作服务器模型之外。统计学文献中大多数现有的方法都集中在集中式系统上，在集中式系统中，只有一台专用机器与所有其他机器进行通信，并协调计算和通信。然而，在许多现代应用中，这种系统是罕见且不可靠的，因为中央机器的故障将是灾难性的。在没有这种专门的中央机器的分散系统中，考虑同步或异步的统计推理，将是统计学家感兴趣的研究方向。目前，分散系统是从纯粹的优化角度进行研究的，没有纳入统计特性（Ram等人，2010；袁等人，2016）。最后，在理论方面，分布式统计推理问题为研究考虑通信、计算和统计权衡的可实现性能的基本极限（即下限）提供了机会和挑战。例如，在各种模型中，如果使用一个简短的方法，那么系统中允许的机器数量是有限的，并且更多的机器将导致次优的统计收敛率。另一方面，当允许多个通信时，可以放宽甚至取消对机器数量的限制。这代表了一种沟通和统计上的权衡。另一个例子是，计算和统计权衡已经在许多工作中进行了探索（Khetan&Oh，2018；L.Wang等人，2019；T.Wang等人，2016）。问题是，当沟通发挥作用时，这种情况会如何改变。需要一个考虑计算、统计和通信成本的通用框架，这将大大促进对分布式估计和推理的理解。

{"title":"Discussion of the paper ‘A review of distributed statistical inference’","authors":"Heng Lian","doi":"10.1080/24754269.2021.2017544","DOIUrl":"https://doi.org/10.1080/24754269.2021.2017544","url":null,"abstract":"The authors should be congratulated on their timely contribution to this emerging field with a comprehensive review, which will certainly attract more researchers into this area. In the simplest one-shot approach, the entire dataset is distributed on multiple machines, and each machine computes a local estimate based on local data only, and a central machine performs an aggregation calculation as a final processing step. In more complicated settings, multiple communications are carried out, typically passing also first-order information (gradient) and/or second-order information (Hession matrix) between local machines and the central machine. This review clearly separates the existing works in this area into several sections, considering parameter regression, nonparametric regression, and other models including principal component analysis and variable screening. In this discussion, I will consider some possible future directions that can be entertained in this area, based on my own personal experience. The first problem is a combination of divide-and-conquer estimation with some efficient local algorithm not used in traditional statistical analysis. This is motivated by that, due to the stringent constraint on the number of machines that can be used either practically or in theory (for example, when using a one-shot approach, the number ofmachines that can be used isO( √ N)), the sample size on each worker machine can still be large. In other words, even after partitioning, the local sample sizemay still be too large to be processed by traditional algorithms. In such a case, a more efficient algorithm (one that possibly approximates the exact solution) should be used on each local machine. The important question here is whether the optimal statistical properties can be retained using such an algorithm. One such attempt with an affirmative answer is recently reported in Lian et al. (2021). In this work, we use random sketches (random projection) for kernel regression in anRKHS framework for nonparametric regression. Use of random sketches reduces the computational complexity on each worker machine, and at the same time still retains the optimal statistical convergence rate. We expect combinations along such a direction can be useful in various settings, and for different settings different efficient algorithms to compute some approximate solution are called for. The second problem is to extend the studies beyond the worker-server model. Most of the existing methods in the statistics literature are focused on the centralized system where there is a single special machine that communicates with all others and coordinates computation and communication. However, in many modern applications, such systems are rare and unreliable since the failure of the central machine would be disastrous. Consideration of statistical inference in a decentralized system, synchronous or asynchronous, where there is no such specialized central machine, would be an intere","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"100 - 101"},"PeriodicalIF":0.5,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43053347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Statistical Theory and Related Fields

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀