{"title":"Streaming Lower Bounds and Asymmetric Set-Disjointness","authors":"Shachar Lovett, Jiapeng Zhang","doi":"10.48550/arXiv.2301.05658","DOIUrl":null,"url":null,"abstract":"Frequency estimation in data streams is one of the classical problems in streaming algorithms. Following much research, there are now almost matching upper and lower bounds for the trade-off needed between the number of samples and the space complexity of the algorithm, when the data streams are adversarial. However, in the case where the data stream is given in a random order, or is stochastic, only weaker lower bounds exist. In this work we close this gap, up to logarithmic factors. In order to do so we consider the needle problem, which is a natural hard problem for frequency estimation studied in (Andoni et al. 2008, Crouch et al. 2016). Here, the goal is to distinguish between two distributions over data streams with $t$ samples. The first is uniform over a large enough domain. The second is a planted model; a secret ''needle'' is uniformly chosen, and then each element in the stream equals the needle with probability $p$, and otherwise is uniformly chosen from the domain. It is simple to design streaming algorithms that distinguish the distributions using space $s \\approx 1/(p^2 t)$. It was unclear if this is tight, as the existing lower bounds are weaker. We close this gap and show that the trade-off is near optimal, up to a logarithmic factor. Our proof builds and extends classical connections between streaming algorithms and communication complexity, concretely multi-party unique set-disjointness. We introduce two new ingredients that allow us to prove sharp bounds. The first is a lower bound for an asymmetric version of multi-party unique set-disjointness, where players receive input sets of different sizes, and where the communication of each player is normalized relative to their input length. The second is a combinatorial technique that allows to sample needles in the planted model by first sampling intervals, and then sampling a uniform needle in each interval.","PeriodicalId":11639,"journal":{"name":"Electron. Colloquium Comput. Complex.","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electron. Colloquium Comput. Complex.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.05658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Frequency estimation in data streams is one of the classical problems in streaming algorithms. Following much research, there are now almost matching upper and lower bounds for the trade-off needed between the number of samples and the space complexity of the algorithm, when the data streams are adversarial. However, in the case where the data stream is given in a random order, or is stochastic, only weaker lower bounds exist. In this work we close this gap, up to logarithmic factors. In order to do so we consider the needle problem, which is a natural hard problem for frequency estimation studied in (Andoni et al. 2008, Crouch et al. 2016). Here, the goal is to distinguish between two distributions over data streams with $t$ samples. The first is uniform over a large enough domain. The second is a planted model; a secret ''needle'' is uniformly chosen, and then each element in the stream equals the needle with probability $p$, and otherwise is uniformly chosen from the domain. It is simple to design streaming algorithms that distinguish the distributions using space $s \approx 1/(p^2 t)$. It was unclear if this is tight, as the existing lower bounds are weaker. We close this gap and show that the trade-off is near optimal, up to a logarithmic factor. Our proof builds and extends classical connections between streaming algorithms and communication complexity, concretely multi-party unique set-disjointness. We introduce two new ingredients that allow us to prove sharp bounds. The first is a lower bound for an asymmetric version of multi-party unique set-disjointness, where players receive input sets of different sizes, and where the communication of each player is normalized relative to their input length. The second is a combinatorial technique that allows to sample needles in the planted model by first sampling intervals, and then sampling a uniform needle in each interval.
数据流中的频率估计是流算法中的经典问题之一。经过大量的研究,当数据流是对抗性的时,对于样本数量和算法的空间复杂度之间的权衡,现在几乎有匹配的上限和下限。然而,在数据流以随机顺序给定的情况下,或者是随机的,只存在较弱的下界。在这项工作中,我们缩小了这个差距,直到对数因子。为了做到这一点,我们考虑了针问题,这是在(Andoni et al. 2008, Crouch et al. 2016)中研究的频率估计的自然难题。这里的目标是区分具有$t$样本的数据流上的两个分布。第一种在足够大的范围内是均匀的。第二种是被植入的模型;统一选择一个秘密“针”,然后流中的每个元素以概率$p$等于针,否则从域中统一选择。设计使用空间$s \约1/(p^2 t)$来区分分布的流算法很简单。目前尚不清楚这是否严格,因为现有的下限较弱。我们缩小了这个差距,并表明权衡接近最优,达到对数因子。我们的证明建立并扩展了流算法和通信复杂性之间的经典联系,具体来说是多方唯一集不连接。我们引入两种新的成分,使我们能够证明尖锐的界限。第一个是多方唯一集不连接的非对称版本的下界,其中玩家接收不同大小的输入集,并且每个玩家的通信相对于他们的输入长度进行规范化。第二种是一种组合技术,允许在种植模型中通过第一次采样间隔对针进行采样,然后在每个间隔中对均匀的针进行采样。