{"title":"Random Sampling and Size Estimation Over Cyclic Joins","authors":"Y. Chen, K. Yi","doi":"10.4230/LIPIcs.ICDT.2020.7","DOIUrl":null,"url":null,"abstract":"Computing joins is expensive, and often unnecessary when the output size is large. In 1999, Chaudhuri et al. [7] posed the problem of random sampling over joins as a potentially effective approach to avoiding computing the join in full, while obtaining important statistical information about the join results. Unfortunately, no significant progress has been made in the last 20 years, except for the case of acyclic joins. In this paper, we present the first non-trivial result on sampling over cyclic joins. We show that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN/OUT), where IN is known as the AGM bound of the join and OUT is its output size. This result holds for all joins on binary relations, as well as certain joins on relations of higher arity. We further show how this algorithm immediately leads to a join size estimation algorithm with the same running time. 2012 ACM Subject Classification Theory of computation → Database theory","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"81 1","pages":"7:1-7:18"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ICDT.2020.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
Computing joins is expensive, and often unnecessary when the output size is large. In 1999, Chaudhuri et al. [7] posed the problem of random sampling over joins as a potentially effective approach to avoiding computing the join in full, while obtaining important statistical information about the join results. Unfortunately, no significant progress has been made in the last 20 years, except for the case of acyclic joins. In this paper, we present the first non-trivial result on sampling over cyclic joins. We show that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN/OUT), where IN is known as the AGM bound of the join and OUT is its output size. This result holds for all joins on binary relations, as well as certain joins on relations of higher arity. We further show how this algorithm immediately leads to a join size estimation algorithm with the same running time. 2012 ACM Subject Classification Theory of computation → Database theory
计算连接的成本很高,而且当输出大小很大时通常没有必要。1999年,Chaudhuri等人[7]提出了连接上的随机抽样问题,作为一种潜在的有效方法,可以避免完全计算连接,同时获得有关连接结果的重要统计信息。不幸的是,在过去的20年里,除了无环连接的情况外,没有取得重大进展。本文给出了循环连接上采样的第一个非平凡结果。我们表明,在线性时间预处理步骤之后,可以在预期时间0 (in /OUT)内均匀随机地绘制连接结果,其中in称为连接的AGM边界,OUT是其输出大小。这个结果适用于所有二元关系上的连接,以及某些更高密度关系上的连接。我们将进一步展示该算法如何立即生成具有相同运行时间的连接大小估计算法。2012 ACM学科分类:计算理论→数据库理论