基于分布式数据处理不等式的统计估计问题的通信下界

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing Pub Date : 2015-06-24 DOI:10.1145/2897518.2897582

M. Braverman, A. Garg, Tengyu Ma, Huy L. Nguyen, David P. Woodruff

{"title":"基于分布式数据处理不等式的统计估计问题的通信下界","authors":"M. Braverman, A. Garg, Tengyu Ma, Huy L. Nguyen, David P. Woodruff","doi":"10.1145/2897518.2897582","DOIUrl":null,"url":null,"abstract":"We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the m machines receives n data points from a d-dimensional Gaussian distribution with unknown mean θ which is promised to be k-sparse. The machines communicate by message passing and aim to estimate the mean θ. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed sparse linear regression problem: to achieve the statistical minimax error, the total communication is at least Ω(min{n,d}m), where n is the number of observations that each machine receives and d is the ambient dimension. These lower results improve upon Shamir (NIPS'14) and Steinhardt-Duchi (COLT'15) by allowing multi-round iterative communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a distributed data processing inequality, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.","PeriodicalId":442965,"journal":{"name":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"154","resultStr":"{\"title\":\"Communication lower bounds for statistical estimation problems via a distributed data processing inequality\",\"authors\":\"M. Braverman, A. Garg, Tengyu Ma, Huy L. Nguyen, David P. Woodruff\",\"doi\":\"10.1145/2897518.2897582\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the m machines receives n data points from a d-dimensional Gaussian distribution with unknown mean θ which is promised to be k-sparse. The machines communicate by message passing and aim to estimate the mean θ. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed sparse linear regression problem: to achieve the statistical minimax error, the total communication is at least Ω(min{n,d}m), where n is the number of observations that each machine receives and d is the ambient dimension. These lower results improve upon Shamir (NIPS'14) and Steinhardt-Duchi (COLT'15) by allowing multi-round iterative communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a distributed data processing inequality, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.\",\"PeriodicalId\":442965,\"journal\":{\"name\":\"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"154\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2897518.2897582\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2897518.2897582","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 154

摘要

研究了高维分布统计估计问题的统计误差与通信代价之间的权衡。在分布稀疏高斯均值估计问题中，m台机器中的每台机器都从一个未知均值θ的d维高斯分布中接收n个数据点，该分布被保证为k稀疏。机器通过信息传递进行通信，目的是估计平均值θ。我们在估计误差和机器之间通信的比特数之间提供了一个紧密的(高达对数因子)权衡。这直接导致了分布式稀疏线性回归问题的下界:为了实现统计上的极大极小误差，总通信至少为Ω(min{n,d}m)，其中n是每台机器接收到的观测数，d是环境维数。这些较低的结果通过允许多轮迭代通信模型改进了Shamir (NIPS'14)和Steinhardt-Duchi (COLT'15)。我们还给出了密集情况下的第一个最优同步协议的均值估计。作为我们的主要技术，我们证明了一个分布式数据处理不等式，作为通常数据处理不等式的推广，它可能是独立的兴趣和对其他问题有用的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Communication lower bounds for statistical estimation problems via a distributed data processing inequality

We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the m machines receives n data points from a d-dimensional Gaussian distribution with unknown mean θ which is promised to be k-sparse. The machines communicate by message passing and aim to estimate the mean θ. We provide a tight (up to logarithmic factors) tradeoff between the estimation error and the number of bits communicated between the machines. This directly leads to a lower bound for the distributed sparse linear regression problem: to achieve the statistical minimax error, the total communication is at least Ω(min{n,d}m), where n is the number of observations that each machine receives and d is the ambient dimension. These lower results improve upon Shamir (NIPS'14) and Steinhardt-Duchi (COLT'15) by allowing multi-round iterative communication model. We also give the first optimal simultaneous protocol in the dense case for mean estimation. As our main technique, we prove a distributed data processing inequality, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

自引率

0.00%

发文量

期刊最新文献

Exponential separation of communication and external information Proceedings of the forty-eighth annual ACM symposium on Theory of Computing Explicit two-source extractors and resilient functions Constant-rate coding for multiparty interactive communication is impossible Approximating connectivity domination in weighted bounded-genus graphs