Distributed subsampling for multiplicative regression

IF 1.6 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Statistics and Computing Pub Date : 2024-08-01 DOI:10.1007/s11222-024-10477-7

Xiaoyan Li, Xiaochao Xia, Zhimin Zhang

{"title":"Distributed subsampling for multiplicative regression","authors":"Xiaoyan Li, Xiaochao Xia, Zhimin Zhang","doi":"10.1007/s11222-024-10477-7","DOIUrl":null,"url":null,"abstract":"<p>Multiplicative regression is a useful alternative tool in modeling positive response data. This paper proposes two distributed estimators for multiplicative error model on distributed system with non-randomly distributed massive data. We first present a Poisson subsampling procedure to obtain a subsampling estimator based on the least product relative error (LPRE) loss, which is effective on a distributed system. Theoretically, we justify the subsampling estimator by establishing its convergence rate, asymptotic normality and deriving the optimal subsampling probabilities in terms of the L-optimality criterion. Then, we provide a distributed LPRE estimator based on the Poisson subsampling (DLPRE-P), which is communication-efficient since it needs to transmit a very small subsample from local machines to the central site, which is empirically feasible, together with the gradient of the loss. Practically, due to the use of Newton–Raphson iteration, the Hessian matrix can be computed more robustly using the subsampled data than using one local dataset. We also show that the DLPRE-P estimator is statistically efficient as the global estimator, which is based on putting all the datasets together. Furthermore, we propose a distributed regularized LPRE estimator (DRLPRE-P) to consider the variable selection problem in high dimension. A distributed algorithm based on the alternating direction method of multipliers (ADMM) is developed for implementing the DRLPRE-P. The oracle property holds for DRLPRE-P. Finally, simulation experiments and two real-world data analyses are conducted to illustrate the performance of our methods.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":1.6000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics and Computing","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s11222-024-10477-7","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Multiplicative regression is a useful alternative tool in modeling positive response data. This paper proposes two distributed estimators for multiplicative error model on distributed system with non-randomly distributed massive data. We first present a Poisson subsampling procedure to obtain a subsampling estimator based on the least product relative error (LPRE) loss, which is effective on a distributed system. Theoretically, we justify the subsampling estimator by establishing its convergence rate, asymptotic normality and deriving the optimal subsampling probabilities in terms of the L-optimality criterion. Then, we provide a distributed LPRE estimator based on the Poisson subsampling (DLPRE-P), which is communication-efficient since it needs to transmit a very small subsample from local machines to the central site, which is empirically feasible, together with the gradient of the loss. Practically, due to the use of Newton–Raphson iteration, the Hessian matrix can be computed more robustly using the subsampled data than using one local dataset. We also show that the DLPRE-P estimator is statistically efficient as the global estimator, which is based on putting all the datasets together. Furthermore, we propose a distributed regularized LPRE estimator (DRLPRE-P) to consider the variable selection problem in high dimension. A distributed algorithm based on the alternating direction method of multipliers (ADMM) is developed for implementing the DRLPRE-P. The oracle property holds for DRLPRE-P. Finally, simulation experiments and two real-world data analyses are conducted to illustrate the performance of our methods.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于乘法回归的分布式子采样

乘法回归是建立正反应数据模型的另一种有用工具。本文针对具有非随机分布海量数据的分布式系统上的乘法误差模型，提出了两种分布式估计器。我们首先提出了一种泊松子采样程序，以获得基于最小乘积相对误差（LPRE）损失的子采样估计器，该估计器在分布式系统中非常有效。从理论上讲，我们通过建立子采样估计器的收敛率和渐近正态性来证明其合理性，并根据 L-optimality 准则推导出最优子采样概率。然后，我们提供了一种基于泊松子采样的分布式 LPRE 估计器（DLPRE-P），该估计器通信效率高，因为它只需从本地机器向中心站点传输一个很小的子样本，这在经验上是可行的，同时损失梯度也是可行的。实际上，由于使用了牛顿-拉夫逊迭代法，使用子样本数据比使用一个本地数据集能更稳健地计算 Hessian 矩阵。我们还证明，DLPRE-P 估计器与基于所有数据集的全局估计器一样具有统计效率。此外，我们还提出了分布式正则化 LPRE 估计器（DRLPRE-P），以考虑高维度下的变量选择问题。为了实现 DRLPRE-P，我们开发了一种基于交替乘数方向法（ADMM）的分布式算法。DRLPRE-P 的神谕特性成立。最后，我们进行了模拟实验和两个真实世界数据分析，以说明我们方法的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Statistics and Computing 数学-计算机：理论方法

CiteScore

3.20

自引率

4.50%

发文量

审稿时长

6-12 weeks

期刊介绍： Statistics and Computing is a bi-monthly refereed journal which publishes papers covering the range of the interface between the statistical and computing sciences. In particular, it addresses the use of statistical concepts in computing science, for example in machine learning, computer vision and data analytics, as well as the use of computers in data modelling, prediction and analysis. Specific topics which are covered include: techniques for evaluating analytically intractable problems such as bootstrap resampling, Markov chain Monte Carlo, sequential Monte Carlo, approximate Bayesian computation, search and optimization methods, stochastic simulation and Monte Carlo, graphics, computer environments, statistical approaches to software errors, information retrieval, machine learning, statistics of databases and database technology, huge data sets and big data analytics, computer algebra, graphical models, image processing, tomography, inverse problems and uncertainty quantification. In addition, the journal contains original research reports, authoritative review papers, discussed papers, and occasional special issues on particular topics or carrying proceedings of relevant conferences. Statistics and Computing also publishes book review and software review sections.

期刊最新文献

Accelerated failure time models with error-prone response and nonlinear covariates Sequential model identification with reversible jump ensemble data assimilation method Hidden Markov models for multivariate panel data Shrinkage for extreme partial least-squares Nonconvex Dantzig selector and its parallel computing algorithm