Optimal principal component analysis in distributed and streaming models

Christos Boutsidis, David P. Woodruff, Peilin Zhong
{"title":"Optimal principal component analysis in distributed and streaming models","authors":"Christos Boutsidis, David P. Woodruff, Peilin Zhong","doi":"10.1145/2897518.2897646","DOIUrl":null,"url":null,"abstract":"This paper studies the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix A ∈ Rm×n, a rank parameter k<rank(A), and an accuracy parameter 0<ε<1, we want to output an m×k orthonormal matrix U for which ||A-UUTA||2F≤(1+ε)||A-Ak||2F where Ak∈Rm×n is the best rank-k approximation to A. Our contributions are summarized as follows: 1. In the arbitrary partition distributed model of Kannan et al. (COLT 2014), each of s machines holds a matrix Ai and A=ΣAi. Each machine should output U. Kannan et al. achieve O(skm/ε)+poly(sk/ε) words (of O(log(nm)) bits) communication. We obtain the improved bound of O(skm)+poly(sk/ε) words, and show an optimal (up to low order terms) Ω(skm) lower bound. This resolves an open question in the literature. A poly(ε-1) dependence is known to be required, but we separate this dependence from m. 2. In a more specific distributed model where each server receives a subset of columns of A, we bypass the above lower bound when A is φ-sparse in each column. Here we obtain an O(skφ/ε)+poly(sk/ε) word protocol. Our communication is independent of the matrix dimensions, and achieves the guarantee that each server, in addition to outputting U, outputs a subset of O(k/ε) columns of A containing a U in its span (that is, for the first time, we solve distributed column subset selection). Additionally, we show a matching Ω(skφ/ε) lower bound for distributed column subset selection. Achieving our communication bound when A is sparse in general but not sparse in each column, is impossible. 3. In the streaming model of computation, in which the columns of the matrix A arrive one at a time, an algorithm of Liberty (KDD, 2013) with an improved analysis by Ghashami and Phillips (SODA, 2014) achieves O(km/ε) \"real numbers\" space complexity. We improve this result, since our one-pass streaming PCA algorithm achieves an O(km/ε)+poly(k/ε) word space upper bound. This almost matches a known Ω(km/ε) bit lower bound of Woodruff (NIPS, 2014). We show that with two passes over the columns of A one can achieve an O(km)+poly(k/ε) word space upper bound; another lower bound of Woodruff (NIPS, 2014) shows that this is optimal for any constant number of passes (up to the poly(k/ε) term and the distinction between words versus bits). 4. Finally, in turnstile streams, in which we receive entries of A one at a time in an arbitrary order, we describe an algorithm with O((m+n)kε-1) words of space. This improves the O((m+n)ε-2)kε-2) upper bound of Clarkson and Woodruff (STOC 2009), and matches their Ω((m+n)kε-1) word lower bound. Notably, our results do not depend on the condition number or any singular value gaps of A.","PeriodicalId":442965,"journal":{"name":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"111","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2897518.2897646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 111

Abstract

This paper studies the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix A ∈ Rm×n, a rank parameter k
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
分布式和流模型的最优主成分分析
本文研究了分布式和流计算模型中的主成分分析问题。给定矩阵a∈Rm×n,秩参数k<秩(a),精度参数0<ε<1,我们想要输出一个m×k正交矩阵U,其中|| a - uuta ||2F≤(1+ε)|| a -Ak||2F,其中Ak∈Rm×n是a的最佳秩-k逼近。在Kannan et al. (COLT 2014)的任意分区分布模型中,s台机器中的每台机器都保存一个矩阵Ai和a =ΣAi。Kannan等人实现了O(skm/ε)+poly(sk/ε)字(of O(log(nm))位)的通信。我们得到了O(skm)+poly(sk/ε)字的改进界,并给出了一个最优(直到低阶项)Ω(skm)下界。这解决了文献中的一个悬而未决的问题。已知需要poly(ε-1)依赖性,但我们将这种依赖性从m. 2中分离出来。在更具体的分布式模型中,每个服务器接收a的列子集,当a在每列中为φ-稀疏时,我们绕过上述下界。这里我们得到了一个O(skφ/ε)+poly(sk/ε)字协议。我们的通信是独立于矩阵维数的,并且实现了保证每个服务器除了输出U之外,还输出包含U的a的O(k/ε)列的子集(即,我们首次解决了分布式列子集选择问题)。此外,我们还展示了一个匹配的Ω(skφ/ε)下界,用于分布式列子集的选择。当A总体上是稀疏的,但不是每一列都是稀疏的,实现我们的通信边界是不可能的。3.在矩阵A的列每次到达一列的流计算模型中,Liberty (KDD, 2013)算法与Ghashami和Phillips (SODA, 2014)的改进分析实现了O(km/ε)。“实数”空间复杂度。我们改进了这个结果,因为我们的一遍流PCA算法实现了O(km/ε)+poly(k/ε)字空间上界。这几乎与wooddruff已知的Ω(km/ε)钻头下界相匹配(NIPS, 2014)。我们证明了在A的列上经过两次可以实现O(km)+poly(k/ε)字空间上界;Woodruff (NIPS, 2014)的另一个下界表明,这对于任何恒定次数的通过(直到poly(k/ε)项以及字与位之间的区别)都是最佳的。4. 最后,在turnstile流中,我们每次以任意顺序接收A的一个条目,我们用O((m+n)kε-1)个词的空间描述了一个算法。这改进了Clarkson和Woodruff (STOC 2009)的O((m+n)ε-2)kε-2)上界,并匹配了他们的Ω((m+n)kε-1)下界。值得注意的是,我们的结果不依赖于条件数或A的任何奇异值间隙。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1