Pub Date : 2022-03-13DOI: 10.1080/24754269.2022.2048445
Xueping Chen, Jianzhong Liu, Jiandong Chen
Orthogonal matching pursuit (OMP) algorithm is a classical greedy algorithm widely used in compressed sensing. In this paper, by exploiting the Wielandt inequality and some properties of orthogonal projection matrix, we obtained a new number of iterations required for the OMP algorithm to perform exact recovery of sparse signals, which improves significantly upon the latest results as we know.
{"title":"A new result on recovery sparse signals using orthogonal matching pursuit","authors":"Xueping Chen, Jianzhong Liu, Jiandong Chen","doi":"10.1080/24754269.2022.2048445","DOIUrl":"https://doi.org/10.1080/24754269.2022.2048445","url":null,"abstract":"Orthogonal matching pursuit (OMP) algorithm is a classical greedy algorithm widely used in compressed sensing. In this paper, by exploiting the Wielandt inequality and some properties of orthogonal projection matrix, we obtained a new number of iterations required for the OMP algorithm to perform exact recovery of sparse signals, which improves significantly upon the latest results as we know.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"220 - 226"},"PeriodicalIF":0.5,"publicationDate":"2022-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43660484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-17DOI: 10.1080/24754269.2022.2037201
J. Qin, Yukun Liu, Pengfei Li
In the era of big data, divide-and-conquer, parallel, and distributed inference methods have become increasingly popular. How to effectively use the calibration information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. Many newly developed methods have roots in traditional statistical approaches that make use of calibration information. In this paper, we first review some classical statistical methods for using calibration information, including simple meta-analysis methods, parametric likelihood, empirical likelihood, and the generalized method of moments. We further investigate how these methods incorporate summarized or auxiliary information from previous studies, related studies, or populations. We find that the methods based on summarized data usually have little or nearly no efficiency loss compared with the corresponding methods based on all-individual data. Finally, we review some recently developed big data analysis methods including communication-efficient distributed approaches, renewal estimation, and incremental inference as examples of the latest developments in methods using calibration information.
{"title":"A selective review of statistical methods using calibration information from similar studies","authors":"J. Qin, Yukun Liu, Pengfei Li","doi":"10.1080/24754269.2022.2037201","DOIUrl":"https://doi.org/10.1080/24754269.2022.2037201","url":null,"abstract":"In the era of big data, divide-and-conquer, parallel, and distributed inference methods have become increasingly popular. How to effectively use the calibration information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. Many newly developed methods have roots in traditional statistical approaches that make use of calibration information. In this paper, we first review some classical statistical methods for using calibration information, including simple meta-analysis methods, parametric likelihood, empirical likelihood, and the generalized method of moments. We further investigate how these methods incorporate summarized or auxiliary information from previous studies, related studies, or populations. We find that the methods based on summarized data usually have little or nearly no efficiency loss compared with the corresponding methods based on all-individual data. Finally, we review some recently developed big data analysis methods including communication-efficient distributed approaches, renewal estimation, and incremental inference as examples of the latest developments in methods using calibration information.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"175 - 190"},"PeriodicalIF":0.5,"publicationDate":"2022-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42114372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-17DOI: 10.1080/24754269.2022.2037204
Rongjie Jiang, Liming Wang, Yang Bai
In this paper, we study optimal model averaging estimators of regression coefficients in a multinomial logit model, which is commonly used in many scientific fields. A Kullback–Leibler (KL) loss-based weight choice criterion is developed to determine averaging weights. Under some regularity conditions, we prove that the resulting model averaging estimators are asymptotically optimal. When the true model is one of the candidate models, the averaged estimators are consistent. Simulation studies suggest the superiority of the proposed method over commonly used model selection criterions, model averaging methods, as well as some other related methods in terms of the KL loss and mean squared forecast error. Finally, the website phishing data is used to illustrate the proposed method.
{"title":"Optimal model averaging estimator for multinomial logit models","authors":"Rongjie Jiang, Liming Wang, Yang Bai","doi":"10.1080/24754269.2022.2037204","DOIUrl":"https://doi.org/10.1080/24754269.2022.2037204","url":null,"abstract":"In this paper, we study optimal model averaging estimators of regression coefficients in a multinomial logit model, which is commonly used in many scientific fields. A Kullback–Leibler (KL) loss-based weight choice criterion is developed to determine averaging weights. Under some regularity conditions, we prove that the resulting model averaging estimators are asymptotically optimal. When the true model is one of the candidate models, the averaged estimators are consistent. Simulation studies suggest the superiority of the proposed method over commonly used model selection criterions, model averaging methods, as well as some other related methods in terms of the KL loss and mean squared forecast error. Finally, the website phishing data is used to illustrate the proposed method.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"227 - 240"},"PeriodicalIF":0.5,"publicationDate":"2022-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41982683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Gaoa, Weidong Liub, Hansheng Wangc, Xiaozhou Wanga, Yibo Yana and Riquan Zhanga aSchool of Statistics and Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, East China Normal University, Shanghai, People’s Republic of China; bSchool of Mathematical Sciences – School of Life Sciences and Biotechnology – MOE Key Lab of Artifcial Intelligence, Shanghai Jiao Tong University, Shanghai, People’s Republic of China; cGuanghua School of Management, Peking University, Beijing, People’s Republic of China
{"title":"Rejoinder on ‘A review of distributed statistical inference’","authors":"Yuan Gao, Weidong Liu, Hansheng Wang, Xiaozhou Wang, Yibo Yan, Riquan Zhang","doi":"10.1080/24754269.2022.2035304","DOIUrl":"https://doi.org/10.1080/24754269.2022.2035304","url":null,"abstract":"Yuan Gaoa, Weidong Liub, Hansheng Wangc, Xiaozhou Wanga, Yibo Yana and Riquan Zhanga aSchool of Statistics and Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, East China Normal University, Shanghai, People’s Republic of China; bSchool of Mathematical Sciences – School of Life Sciences and Biotechnology – MOE Key Lab of Artifcial Intelligence, Shanghai Jiao Tong University, Shanghai, People’s Republic of China; cGuanghua School of Management, Peking University, Beijing, People’s Republic of China","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"111 - 113"},"PeriodicalIF":0.5,"publicationDate":"2022-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46555795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-04DOI: 10.1080/24754269.2022.2105486
Chao-Qun Yuan, Yang Wu, Fang Fang
ABSTRACT Fragmentary data is becoming more and more popular in many areas which brings big challenges to researchers and data analysts. Most existing methods dealing with fragmentary data consider a continuous response while in many applications the response variable is discrete. In this paper, we propose a model averaging method for generalized linear models in fragmentary data prediction. The candidate models are fitted based on different combinations of covariate availability and sample size. The optimal weight is selected by minimizing the Kullback–Leibler loss in the completed cases and its asymptotic optimality is established. Empirical evidences from a simulation study and a real data analysis about Alzheimer disease are presented.
{"title":"Model averaging for generalized linear models in fragmentary data prediction","authors":"Chao-Qun Yuan, Yang Wu, Fang Fang","doi":"10.1080/24754269.2022.2105486","DOIUrl":"https://doi.org/10.1080/24754269.2022.2105486","url":null,"abstract":"ABSTRACT Fragmentary data is becoming more and more popular in many areas which brings big challenges to researchers and data analysts. Most existing methods dealing with fragmentary data consider a continuous response while in many applications the response variable is discrete. In this paper, we propose a model averaging method for generalized linear models in fragmentary data prediction. The candidate models are fitted based on different combinations of covariate availability and sample size. The optimal weight is selected by minimizing the Kullback–Leibler loss in the completed cases and its asymptotic optimality is established. Empirical evidences from a simulation study and a real data analysis about Alzheimer disease are presented.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"344 - 352"},"PeriodicalIF":0.5,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48024239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-04DOI: 10.1080/24754269.2022.2030107
Yang Yu, Guang Cheng
We congratulate the authors on an impressive team effort to comprehensively review various statistical estimation and inference methods in distributed frameworks. This paper is an excellent resource for anyone wishing to understand why distributed inference is important in the era of big data, what the challenges of conducting distributed inference instead of centralized inference are, and how statisticians propose solutions to overcome these challenges. First, we notice that this paper focuses mainly on distributed estimation, and we would like to point out several other works on distributed inference. For smooth loss functions, Jordan et al. (2018) established asymptotic normality for their multi-round distributed estimator, which yields two communication-efficient approaches to constructing confidence regions using a sandwiched covariance matrix. For non-smooth loss functions, Chen et al. (2021) similarly proposed a sandwich-type confidence interval based on the asymptotic normality of their distributed estimator. More generic inference approaches, such as bootstrap, have also been studied in the massive data setting including the distributed framework. The authors reviewed the Bag of Little Bootstraps (BLB) method proposed by Kleiner et al. (2014), which is to repeatedly resample and refit the model at each local machine and finally aggregate the bootstrap statistics. Considering the huge computational cost of BLB, Sengupta et al. (2016) proposed the Subsampled Double Bootstrap (SDB) method, which has higher computational efficiency but requires a large number of local machines to maintain statistical accuracy. In addition to distributed samples, the dimensionality can also become large in the big data era, and in this case researchers may be more interested in simultaneous inference onmultiple parameters. In the centralized setting, bootstrap is one of the solutions to the simultaneous inference problems (Zhang & Cheng, 2017). In a distributed framework where the dimensionality grows, Yu et al. (2020) proposed distributed bootstrap methods for simultaneous inference, which not only are efficient in terms of both communication and
我们祝贺作者们令人印象深刻的团队努力,全面回顾了分布式框架中的各种统计估计和推理方法。对于任何想要理解分布式推理在大数据时代为何如此重要、进行分布式推理而不是集中式推理的挑战是什么、以及统计学家如何提出克服这些挑战的解决方案的人来说,这篇论文都是一个很好的资源。首先,我们注意到本文主要关注分布式估计,并且我们想指出在分布式推理方面的其他一些工作。对于平滑损失函数,Jordan等人(2018)为他们的多轮分布估计器建立了渐近正态性,这产生了两种使用夹心协方差矩阵构建置信区域的通信高效方法。对于非光滑损失函数,Chen等人(2021)同样提出了基于其分布估计量的渐近正态性的三明治型置信区间。更通用的推理方法,如bootstrap,也在包括分布式框架在内的海量数据环境中得到了研究。本文回顾了Kleiner et al.(2014)提出的Bag of Little bootstrap (BLB)方法,即在每台本地机器上反复重新采样和重构模型,最后汇总bootstrap统计数据。考虑到BLB的巨大计算成本,Sengupta等(2016)提出了subsampling Double Bootstrap (SDB)方法,该方法具有更高的计算效率,但需要大量的局部机来保持统计精度。除了分布式样本,在大数据时代,维数也会变得很大,在这种情况下,研究人员可能会对多参数的同时推理更感兴趣。在集中式设置中,bootstrap是同时推理问题的解决方案之一(Zhang & Cheng, 2017)。在维数增长的分布式框架中,Yu等人(2020)提出了用于同时推理的分布式自举方法,该方法不仅在通信和数据处理方面都是高效的
{"title":"Discussion on ‘A review of distributed statistical inference’","authors":"Yang Yu, Guang Cheng","doi":"10.1080/24754269.2022.2030107","DOIUrl":"https://doi.org/10.1080/24754269.2022.2030107","url":null,"abstract":"We congratulate the authors on an impressive team effort to comprehensively review various statistical estimation and inference methods in distributed frameworks. This paper is an excellent resource for anyone wishing to understand why distributed inference is important in the era of big data, what the challenges of conducting distributed inference instead of centralized inference are, and how statisticians propose solutions to overcome these challenges. First, we notice that this paper focuses mainly on distributed estimation, and we would like to point out several other works on distributed inference. For smooth loss functions, Jordan et al. (2018) established asymptotic normality for their multi-round distributed estimator, which yields two communication-efficient approaches to constructing confidence regions using a sandwiched covariance matrix. For non-smooth loss functions, Chen et al. (2021) similarly proposed a sandwich-type confidence interval based on the asymptotic normality of their distributed estimator. More generic inference approaches, such as bootstrap, have also been studied in the massive data setting including the distributed framework. The authors reviewed the Bag of Little Bootstraps (BLB) method proposed by Kleiner et al. (2014), which is to repeatedly resample and refit the model at each local machine and finally aggregate the bootstrap statistics. Considering the huge computational cost of BLB, Sengupta et al. (2016) proposed the Subsampled Double Bootstrap (SDB) method, which has higher computational efficiency but requires a large number of local machines to maintain statistical accuracy. In addition to distributed samples, the dimensionality can also become large in the big data era, and in this case researchers may be more interested in simultaneous inference onmultiple parameters. In the centralized setting, bootstrap is one of the solutions to the simultaneous inference problems (Zhang & Cheng, 2017). In a distributed framework where the dimensionality grows, Yu et al. (2020) proposed distributed bootstrap methods for simultaneous inference, which not only are efficient in terms of both communication and","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"102 - 103"},"PeriodicalIF":0.5,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48788970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-12DOI: 10.1080/24754269.2021.2022998
Zheng-Chu Guo
Analysing and processing massive data is becoming ubiquitous in the era of big data. Distributed learning based on divide-and-conquer approach has attracted increasing interest in recent years, since it not only reduces computational complexity and storage requirements, but also protects the data privacy when data subsets are distributively stored on different local machines. This paper provides a comprehensive review for distributed learning with parametric models, nonparametric models and other popular models. As mentioned in this paper, nonparametric regression in reproducing kernel Hilbert spaces is popular in machine learning; however, theoretical analysis for distributed learning algorithms in reproducing kernel Hilbert spaces mainly focuses on the least-square loss functions, and results for some other loss functions are limited; it would be interesting to conduct error analysis for distributed regression with general loss functions and distributed classification in reproducing kernel Hilbert spaces. In distributed learning, a standard assumption is that the data are identically and independently drawn from some unknown probability distribution; however, this assumption may not hold in practice since data are usually collected asynchronously throughout time. It is of great interest to study distributed learning algorithms with non-i.i.d. data. Recently, Sun and Lin (2020) considered distributed kernel ridge regression for strong mixing sequences. The mixing conditions are very common assumptions in the stochastic processes and the mixing coefficients can be estimated in some cases such as Gaussian and Markov processes. In the community of machine learning, the strong mixing conditions are used to quantify the dependence of samples. It is assumed in Sun and Lin (2020) that Dk (1 ≤ k ≤ m) is a strong mixing sequence with α-mixing coefficient αj, and there exists a suitable arrangement of D1,D2, . . . ,Dm such that D = ⋃mk=1 Dk is also a strong mixing sequence with α-mixing coefficient αj; in addition, under some mild conditions on the regression function and the hypothesis spaces, it is shown in Sun and Lin (2020) that as long as the number of the local machines is not too large, an almost optimal convergence rate can be derived, which is comparable to the result under i.i.d. assumptions.
在大数据时代,海量数据的分析和处理变得无处不在。基于分而治之的分布式学习方法近年来引起了人们越来越多的兴趣,因为它不仅降低了计算复杂度和存储需求,而且当数据子集分布存储在不同的本地机器上时,它还保护了数据隐私。本文对分布学习的参数模型、非参数模型和其他流行的模型进行了全面的综述。如本文所述,非参数回归在再现核希尔伯特空间中的应用在机器学习中很受欢迎;然而,对于再现核Hilbert空间的分布式学习算法的理论分析主要集中在最小二乘损失函数上,对其他一些损失函数的研究结果有限;在再现核希尔伯特空间时,对具有一般损失函数和分布式分类的分布回归进行误差分析是很有意义的。在分布式学习中,一个标准的假设是数据是相同的,独立地从一些未知的概率分布中提取的;然而,这种假设在实践中可能不成立,因为数据通常在整个过程中异步收集。研究非id的分布式学习算法是一个很有意义的课题。数据。最近,Sun和Lin(2020)考虑了强混合序列的分布式核脊回归。混合条件是随机过程中非常常见的假设,混合系数可以在某些情况下估计,如高斯过程和马尔可夫过程。在机器学习领域,强混合条件被用来量化样本的依赖性。Sun and Lin(2020)假设Dk(1≤k≤m)为α-混合系数αj的强混合序列,且D1、D2、…存在合适的排列。,Dm使得D = δ mk=1, Dk也是具有α-混合系数αj的强混合序列;此外,在回归函数和假设空间的一些温和条件下,Sun and Lin(2020)表明,只要局部机器的数量不太大,就可以推导出几乎最优的收敛速度,这与i.i.d假设下的结果相当。
{"title":"Discussion of: a review of distributed statistical inference","authors":"Zheng-Chu Guo","doi":"10.1080/24754269.2021.2022998","DOIUrl":"https://doi.org/10.1080/24754269.2021.2022998","url":null,"abstract":"Analysing and processing massive data is becoming ubiquitous in the era of big data. Distributed learning based on divide-and-conquer approach has attracted increasing interest in recent years, since it not only reduces computational complexity and storage requirements, but also protects the data privacy when data subsets are distributively stored on different local machines. This paper provides a comprehensive review for distributed learning with parametric models, nonparametric models and other popular models. As mentioned in this paper, nonparametric regression in reproducing kernel Hilbert spaces is popular in machine learning; however, theoretical analysis for distributed learning algorithms in reproducing kernel Hilbert spaces mainly focuses on the least-square loss functions, and results for some other loss functions are limited; it would be interesting to conduct error analysis for distributed regression with general loss functions and distributed classification in reproducing kernel Hilbert spaces. In distributed learning, a standard assumption is that the data are identically and independently drawn from some unknown probability distribution; however, this assumption may not hold in practice since data are usually collected asynchronously throughout time. It is of great interest to study distributed learning algorithms with non-i.i.d. data. Recently, Sun and Lin (2020) considered distributed kernel ridge regression for strong mixing sequences. The mixing conditions are very common assumptions in the stochastic processes and the mixing coefficients can be estimated in some cases such as Gaussian and Markov processes. In the community of machine learning, the strong mixing conditions are used to quantify the dependence of samples. It is assumed in Sun and Lin (2020) that Dk (1 ≤ k ≤ m) is a strong mixing sequence with α-mixing coefficient αj, and there exists a suitable arrangement of D1,D2, . . . ,Dm such that D = ⋃mk=1 Dk is also a strong mixing sequence with α-mixing coefficient αj; in addition, under some mild conditions on the regression function and the hypothesis spaces, it is shown in Sun and Lin (2020) that as long as the number of the local machines is not too large, an almost optimal convergence rate can be derived, which is comparable to the result under i.i.d. assumptions.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"104 - 104"},"PeriodicalIF":0.5,"publicationDate":"2022-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48277971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-28DOI: 10.1080/24754269.2021.2015868
Shaogao Lv, Xingcai Zhou
First of all, we would like to congratulate Dr Gao et al. for their excellent paper, which provides a comprehensive overview of amounts of existing work on distributed estimation (learning). Different from related work Gu et al. (2019); Liu et al. (2021); Verbraeken et al. (2020) that focus on computing, storage and communication architecture, the current paper leverages how to guarantee statistical efficiency of a given distributed method from a statistical viewpoint. In the following, we divide our discussion into three parts:
{"title":"Discussion of: ‘A review of distributed statistical inference’","authors":"Shaogao Lv, Xingcai Zhou","doi":"10.1080/24754269.2021.2015868","DOIUrl":"https://doi.org/10.1080/24754269.2021.2015868","url":null,"abstract":"First of all, we would like to congratulate Dr Gao et al. for their excellent paper, which provides a comprehensive overview of amounts of existing work on distributed estimation (learning). Different from related work Gu et al. (2019); Liu et al. (2021); Verbraeken et al. (2020) that focus on computing, storage and communication architecture, the current paper leverages how to guarantee statistical efficiency of a given distributed method from a statistical viewpoint. In the following, we divide our discussion into three parts:","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"105 - 107"},"PeriodicalIF":0.5,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49138265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-28DOI: 10.1080/24754269.2021.1984636
Pengcheng Ren, Guanfu Liu, X. Pu, Yan Li
In this paper, we propose generalized fiducial methods and construct four generalized p-values to test the existence of quantitative trait locus effects under phenotype distributions from a location-scale family. Compared with the likelihood ratio test based on simulation studies, our methods perform better at controlling type I errors while retaining comparable power in cases with small or moderate sample sizes. The four generalized fiducial methods support varied scenarios: two of them are more aggressive and powerful, whereas the other two appear more conservative and robust. A real data example involving mouse blood pressure is used to illustrate our proposed methods.
{"title":"Generalized fiducial methods for testing quantitative trait locus effects in genetic backcross studies","authors":"Pengcheng Ren, Guanfu Liu, X. Pu, Yan Li","doi":"10.1080/24754269.2021.1984636","DOIUrl":"https://doi.org/10.1080/24754269.2021.1984636","url":null,"abstract":"In this paper, we propose generalized fiducial methods and construct four generalized p-values to test the existence of quantitative trait locus effects under phenotype distributions from a location-scale family. Compared with the likelihood ratio test based on simulation studies, our methods perform better at controlling type I errors while retaining comparable power in cases with small or moderate sample sizes. The four generalized fiducial methods support varied scenarios: two of them are more aggressive and powerful, whereas the other two appear more conservative and robust. A real data example involving mouse blood pressure is used to illustrate our proposed methods.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"148 - 160"},"PeriodicalIF":0.5,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49314125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-28DOI: 10.1080/24754269.2021.2017544
Heng Lian
The authors should be congratulated on their timely contribution to this emerging field with a comprehensive review, which will certainly attract more researchers into this area. In the simplest one-shot approach, the entire dataset is distributed on multiple machines, and each machine computes a local estimate based on local data only, and a central machine performs an aggregation calculation as a final processing step. In more complicated settings, multiple communications are carried out, typically passing also first-order information (gradient) and/or second-order information (Hession matrix) between local machines and the central machine. This review clearly separates the existing works in this area into several sections, considering parameter regression, nonparametric regression, and other models including principal component analysis and variable screening. In this discussion, I will consider some possible future directions that can be entertained in this area, based on my own personal experience. The first problem is a combination of divide-and-conquer estimation with some efficient local algorithm not used in traditional statistical analysis. This is motivated by that, due to the stringent constraint on the number of machines that can be used either practically or in theory (for example, when using a one-shot approach, the number ofmachines that can be used isO( √ N)), the sample size on each worker machine can still be large. In other words, even after partitioning, the local sample sizemay still be too large to be processed by traditional algorithms. In such a case, a more efficient algorithm (one that possibly approximates the exact solution) should be used on each local machine. The important question here is whether the optimal statistical properties can be retained using such an algorithm. One such attempt with an affirmative answer is recently reported in Lian et al. (2021). In this work, we use random sketches (random projection) for kernel regression in anRKHS framework for nonparametric regression. Use of random sketches reduces the computational complexity on each worker machine, and at the same time still retains the optimal statistical convergence rate. We expect combinations along such a direction can be useful in various settings, and for different settings different efficient algorithms to compute some approximate solution are called for. The second problem is to extend the studies beyond the worker-server model. Most of the existing methods in the statistics literature are focused on the centralized system where there is a single special machine that communicates with all others and coordinates computation and communication. However, in many modern applications, such systems are rare and unreliable since the failure of the central machine would be disastrous. Consideration of statistical inference in a decentralized system, synchronous or asynchronous, where there is no such specialized central machine, would be an intere
{"title":"Discussion of the paper ‘A review of distributed statistical inference’","authors":"Heng Lian","doi":"10.1080/24754269.2021.2017544","DOIUrl":"https://doi.org/10.1080/24754269.2021.2017544","url":null,"abstract":"The authors should be congratulated on their timely contribution to this emerging field with a comprehensive review, which will certainly attract more researchers into this area. In the simplest one-shot approach, the entire dataset is distributed on multiple machines, and each machine computes a local estimate based on local data only, and a central machine performs an aggregation calculation as a final processing step. In more complicated settings, multiple communications are carried out, typically passing also first-order information (gradient) and/or second-order information (Hession matrix) between local machines and the central machine. This review clearly separates the existing works in this area into several sections, considering parameter regression, nonparametric regression, and other models including principal component analysis and variable screening. In this discussion, I will consider some possible future directions that can be entertained in this area, based on my own personal experience. The first problem is a combination of divide-and-conquer estimation with some efficient local algorithm not used in traditional statistical analysis. This is motivated by that, due to the stringent constraint on the number of machines that can be used either practically or in theory (for example, when using a one-shot approach, the number ofmachines that can be used isO( √ N)), the sample size on each worker machine can still be large. In other words, even after partitioning, the local sample sizemay still be too large to be processed by traditional algorithms. In such a case, a more efficient algorithm (one that possibly approximates the exact solution) should be used on each local machine. The important question here is whether the optimal statistical properties can be retained using such an algorithm. One such attempt with an affirmative answer is recently reported in Lian et al. (2021). In this work, we use random sketches (random projection) for kernel regression in anRKHS framework for nonparametric regression. Use of random sketches reduces the computational complexity on each worker machine, and at the same time still retains the optimal statistical convergence rate. We expect combinations along such a direction can be useful in various settings, and for different settings different efficient algorithms to compute some approximate solution are called for. The second problem is to extend the studies beyond the worker-server model. Most of the existing methods in the statistics literature are focused on the centralized system where there is a single special machine that communicates with all others and coordinates computation and communication. However, in many modern applications, such systems are rare and unreliable since the failure of the central machine would be disastrous. Consideration of statistical inference in a decentralized system, synchronous or asynchronous, where there is no such specialized central machine, would be an intere","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"100 - 101"},"PeriodicalIF":0.5,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43053347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}