Heavy Hitters via Cluster-Preserving Clustering

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS) Pub Date : 2016-04-05 DOI:10.1145/3339185

Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, M. Thorup

{"title":"Heavy Hitters via Cluster-Preserving Clustering","authors":"Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, M. Thorup","doi":"10.1145/3339185","DOIUrl":null,"url":null,"abstract":"In the turnstile ℓp heavy hitters problem with parameter ε, one must maintain a high-dimensional vector x ∈ ℝn subject to updates of the form update (i,Δ) causing the change xi ← xi + Δ, where i ε[n], Δ ∈ ℝ. Upon receiving a query, the goal is to report every \"heavy hitter\" i ∈ [n] with |xi| ≥ ε ∥x∥p as part of a list L ⊆ [n] of size O(1/εp), i.e. proportional to the maximum possible number of heavy hitters. For any pε(0,2] the COUNTSKETCH of [CCFC04] solves ℓp heavy hitters using O(ε-p lg n) words of space with O(lg n) update time, O(n lg n) query time to output L, and whose output after any query is correct with high probability (whp) 1 - 1/poly(n) [JST11, Section 4.4]. This space bound is optimal even in the strict turnstile model [JST11] in which it is promised that xi ≥ 0 for all i ∈ [n] at all points in the stream, but unfortunately the query time is very slow. To remedy this, the work [CM05] proposed the \"dyadic trick\" for the COUNTMIN sketch for p = 1 in the strict turnstile model, which to maintain whp correctness achieves suboptimal space O(ε-1lg2 n), worse update time O(lg2 n), but much better query time O(ε-1poly(lg n)). An extension to all p ∈ (0,2] appears in [KNPW11, Theorem 1], and can be obtained from [Pag13]. We show that this tradeoff between space and update time versus query time is unnecessary. We provide a new algorithm, EXPANDERSKETCH, which in the most general turnstile model achieves optimal O(ε-plog n) space, O(log n) update time, and fast O(ε-ppoly(log n)) query time, providing correctness whp. In fact, a simpler version of our algorithm for p = 1 in the strict turnstile model answers queries even faster than the \"dyadic trick\" by roughly a log n factor, dominating it in all regards. Our main innovation is an efficient reduction from the heavy hitters to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a much bigger graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a \"cluster-preserving clustering\" algorithm, that partitions the graph into clusters with the promise of not destroying any original cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel combinatorial techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our cluster-preserving clustering may be of broader interest much beyond heavy hitters.","PeriodicalId":414001,"journal":{"name":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"77","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3339185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 77

Abstract

In the turnstile ℓ_p heavy hitters problem with parameter ε, one must maintain a high-dimensional vector x ∈ ℝⁿ subject to updates of the form update (i,Δ) causing the change x_i ← x_i + Δ, where i ε[n], Δ ∈ ℝ. Upon receiving a query, the goal is to report every "heavy hitter" i ∈ [n] with |x_i| ≥ ε ∥x∥_p as part of a list L ⊆ [n] of size O(1/ε^p), i.e. proportional to the maximum possible number of heavy hitters. For any pε(0,2] the COUNTSKETCH of [CCFC04] solves ℓ_p heavy hitters using O(ε^-p lg n) words of space with O(lg n) update time, O(n lg n) query time to output L, and whose output after any query is correct with high probability (whp) 1 - 1/poly(n) [JST11, Section 4.4]. This space bound is optimal even in the strict turnstile model [JST11] in which it is promised that x_i ≥ 0 for all i ∈ [n] at all points in the stream, but unfortunately the query time is very slow. To remedy this, the work [CM05] proposed the "dyadic trick" for the COUNTMIN sketch for p = 1 in the strict turnstile model, which to maintain whp correctness achieves suboptimal space O(ε^-1lg² n), worse update time O(lg² n), but much better query time O(ε^-1poly(lg n)). An extension to all p ∈ (0,2] appears in [KNPW11, Theorem 1], and can be obtained from [Pag13]. We show that this tradeoff between space and update time versus query time is unnecessary. We provide a new algorithm, EXPANDERSKETCH, which in the most general turnstile model achieves optimal O(ε-plog n) space, O(log n) update time, and fast O(ε-ppoly(log n)) query time, providing correctness whp. In fact, a simpler version of our algorithm for p = 1 in the strict turnstile model answers queries even faster than the "dyadic trick" by roughly a log n factor, dominating it in all regards. Our main innovation is an efficient reduction from the heavy hitters to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a much bigger graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a "cluster-preserving clustering" algorithm, that partitions the graph into clusters with the promise of not destroying any original cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel combinatorial techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our cluster-preserving clustering may be of broader interest much beyond heavy hitters.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过保持集群的集群

在具有参数ε的转门p重头问题中，必须保持一个高维向量x∈∈∈n，其更新形式为update (i，Δ)，引起xi←xi + Δ的变化，其中i ε[n]， Δ∈∈∈。在收到查询时，目标是将每个“重拳手”i∈[n]，且|xi|≥ε∥x∥p作为大小为O(1/εp)的列表L∈[n]的一部分，即与最大可能的重拳手数量成正比。对于任意pε(0,2)， [CCFC04]的countssketch使用O(ε-p lg n)个空间字，以O(lg n)个更新时间，O(n lg n)个查询时间来输出L，并且任意查询后的输出具有高概率(whp) 1 - 1/poly(n) [JST11, Section 4.4]。即使在严格的turnstile模型[JST11]中，这个空间界也是最优的，在严格的turnstile模型中，它承诺在流的所有点上，对于所有i∈[n]， xi≥0，但不幸的是，查询时间非常慢。为了解决这个问题，[CM05]提出了严格转门模型中p = 1的COUNTMIN草图的“二元技巧”，为了保持whp的准确性，实现了次优空间O(ε-1lg2 n)，较差的更新时间O(lg2 n)，但较好的查询时间O(ε-1poly(lgn))。对所有p∈(0,2)的扩展出现在[KNPW11，定理1]中，可以从[Pag13]中得到。我们表明，空间和更新时间与查询时间之间的这种权衡是不必要的。本文提出了一种新的EXPANDERSKETCH算法，该算法在最一般的转门模型中实现了最优的O(ε-plog n)空间、O(log n)更新时间和O(ε-plog (log n))查询时间，提供了正确性whp。事实上，在严格的旋转门模型中，我们的p = 1算法的一个更简单的版本回答查询的速度甚至比“二元技巧”要快，大约高出log n个因子，在所有方面都占主导地位。我们的主要创新是将重磅数据有效地简化为一个聚类问题，其中每个重磅数据在一个更大的图中被编码为某种形式的噪声光谱聚类，目标是识别每个聚类。因为必须找到每一个重量级角色，所以正确性要求找到每一个集群。因此，我们需要一种“保簇聚类”算法，该算法将图划分为簇，并保证不破坏任何原始簇。为了做到这一点，我们首先应用标准谱图划分，然后我们使用一些新的组合技术来修改得到的切割，以确保原始聚类得到充分的保留。我们的保持集群的集群可能会引起更广泛的兴趣，远远超出重量级人物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)

自引率

0.00%

发文量