{"title":"Truly Perfect Samplers for Data Streams and Sliding Windows","authors":"Rajesh Jayaram, David P. Woodruff, Samson Zhou","doi":"10.1145/3517804.3524139","DOIUrl":null,"url":null,"abstract":"In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.","PeriodicalId":230606,"journal":{"name":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3517804.3524139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
In the G-sampling problem, the goal is to output an index i of a vector f ∈ Rn, such that for all coordinates j ∈[n], [Pr [i=j] = (1 ± ε) (G(fj))/(∑k ∈[n] G(fk)) + γ,] where G: R → R ≥ 0 is some non-negative function. If ε = 0 and γ = 1/poly(n), the sampler is calledperfect. In the data stream model, f is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect Lp samplers in turnstile streams, where G(x)=|x|p, using polylog(n) space for p∈(0,2]. However, to date all known sampling algorithms are nottruly perfect, since their output distribution is only point-wise γ = 1/poly(n) close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study oftruly perfect samplers, with ε = γ = 0, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of Ω(min(n, log 1/γ)) for any G-sampler with point-wise error γ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses Lp sampling for all p>0, e.g., Õn1-1/p space for p ≥ 1, concave functions, and a large number of measure functions, including the L1-L2, Fair, Huber, and Tukey estimators. The update time of our truly perfect Lp-samplers is Ø(1), which is an exponential improvement over the running time of previous perfect Lp-samplers.