Optimal principal component analysis in distributed and streaming models

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing Pub Date : 2015-04-25 DOI:10.1145/2897518.2897646

Christos Boutsidis, David P. Woodruff, Peilin Zhong

{"title":"Optimal principal component analysis in distributed and streaming models","authors":"Christos Boutsidis, David P. Woodruff, Peilin Zhong","doi":"10.1145/2897518.2897646","DOIUrl":null,"url":null,"abstract":"This paper studies the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix A ∈ Rm×n, a rank parameter k<rank(A), and an accuracy parameter 0<ε<1, we want to output an m×k orthonormal matrix U for which ||A-UUTA||2F≤(1+ε)||A-Ak||2F where Ak∈Rm×n is the best rank-k approximation to A. Our contributions are summarized as follows: 1. In the arbitrary partition distributed model of Kannan et al. (COLT 2014), each of s machines holds a matrix Ai and A=ΣAi. Each machine should output U. Kannan et al. achieve O(skm/ε)+poly(sk/ε) words (of O(log(nm)) bits) communication. We obtain the improved bound of O(skm)+poly(sk/ε) words, and show an optimal (up to low order terms) Ω(skm) lower bound. This resolves an open question in the literature. A poly(ε-1) dependence is known to be required, but we separate this dependence from m. 2. In a more specific distributed model where each server receives a subset of columns of A, we bypass the above lower bound when A is φ-sparse in each column. Here we obtain an O(skφ/ε)+poly(sk/ε) word protocol. Our communication is independent of the matrix dimensions, and achieves the guarantee that each server, in addition to outputting U, outputs a subset of O(k/ε) columns of A containing a U in its span (that is, for the first time, we solve distributed column subset selection). Additionally, we show a matching Ω(skφ/ε) lower bound for distributed column subset selection. Achieving our communication bound when A is sparse in general but not sparse in each column, is impossible. 3. In the streaming model of computation, in which the columns of the matrix A arrive one at a time, an algorithm of Liberty (KDD, 2013) with an improved analysis by Ghashami and Phillips (SODA, 2014) achieves O(km/ε) \"real numbers\" space complexity. We improve this result, since our one-pass streaming PCA algorithm achieves an O(km/ε)+poly(k/ε) word space upper bound. This almost matches a known Ω(km/ε) bit lower bound of Woodruff (NIPS, 2014). We show that with two passes over the columns of A one can achieve an O(km)+poly(k/ε) word space upper bound; another lower bound of Woodruff (NIPS, 2014) shows that this is optimal for any constant number of passes (up to the poly(k/ε) term and the distinction between words versus bits). 4. Finally, in turnstile streams, in which we receive entries of A one at a time in an arbitrary order, we describe an algorithm with O((m+n)kε-1) words of space. This improves the O((m+n)ε-2)kε-2) upper bound of Clarkson and Woodruff (STOC 2009), and matches their Ω((m+n)kε-1) word lower bound. Notably, our results do not depend on the condition number or any singular value gaps of A.","PeriodicalId":442965,"journal":{"name":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"111","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2897518.2897646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 111

Abstract

This paper studies the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix A ∈ Rm×n, a rank parameter k

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分布式和流模型的最优主成分分析

本文研究了分布式和流计算模型中的主成分分析问题。给定矩阵a∈Rm×n，秩参数k<秩(a)，精度参数0<ε<1，我们想要输出一个m×k正交矩阵U，其中|| a - uuta ||2F≤(1+ε)|| a -Ak||2F，其中Ak∈Rm×n是a的最佳秩-k逼近。在Kannan et al. (COLT 2014)的任意分区分布模型中，s台机器中的每台机器都保存一个矩阵Ai和a =ΣAi。Kannan等人实现了O(skm/ε)+poly(sk/ε)字(of O(log(nm))位)的通信。我们得到了O(skm)+poly(sk/ε)字的改进界，并给出了一个最优(直到低阶项)Ω(skm)下界。这解决了文献中的一个悬而未决的问题。已知需要poly(ε-1)依赖性，但我们将这种依赖性从m. 2中分离出来。在更具体的分布式模型中，每个服务器接收a的列子集，当a在每列中为φ-稀疏时，我们绕过上述下界。这里我们得到了一个O(skφ/ε)+poly(sk/ε)字协议。我们的通信是独立于矩阵维数的，并且实现了保证每个服务器除了输出U之外，还输出包含U的a的O(k/ε)列的子集(即，我们首次解决了分布式列子集选择问题)。此外，我们还展示了一个匹配的Ω(skφ/ε)下界，用于分布式列子集的选择。当A总体上是稀疏的，但不是每一列都是稀疏的，实现我们的通信边界是不可能的。3.在矩阵A的列每次到达一列的流计算模型中，Liberty (KDD, 2013)算法与Ghashami和Phillips (SODA, 2014)的改进分析实现了O(km/ε)。“实数”空间复杂度。我们改进了这个结果，因为我们的一遍流PCA算法实现了O(km/ε)+poly(k/ε)字空间上界。这几乎与wooddruff已知的Ω(km/ε)钻头下界相匹配(NIPS, 2014)。我们证明了在A的列上经过两次可以实现O(km)+poly(k/ε)字空间上界;Woodruff (NIPS, 2014)的另一个下界表明，这对于任何恒定次数的通过(直到poly(k/ε)项以及字与位之间的区别)都是最佳的。4. 最后，在turnstile流中，我们每次以任意顺序接收A的一个条目，我们用O((m+n)kε-1)个词的空间描述了一个算法。这改进了Clarkson和Woodruff (STOC 2009)的O((m+n)ε-2)kε-2)上界，并匹配了他们的Ω((m+n)kε-1)下界。值得注意的是，我们的结果不依赖于条件数或A的任何奇异值间隙。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

自引率

0.00%

发文量

期刊最新文献

Exponential separation of communication and external information Proceedings of the forty-eighth annual ACM symposium on Theory of Computing Explicit two-source extractors and resilient functions Constant-rate coding for multiparty interactive communication is impossible Approximating connectivity domination in weighted bounded-genus graphs