Random Sampling and Size Estimation Over Cyclic Joins

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory Pub Date : 2020-01-01 DOI:10.4230/LIPIcs.ICDT.2020.7

Y. Chen, K. Yi

引用次数: 16

Abstract

Computing joins is expensive, and often unnecessary when the output size is large. In 1999, Chaudhuri et al. [7] posed the problem of random sampling over joins as a potentially effective approach to avoiding computing the join in full, while obtaining important statistical information about the join results. Unfortunately, no significant progress has been made in the last 20 years, except for the case of acyclic joins. In this paper, we present the first non-trivial result on sampling over cyclic joins. We show that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN/OUT), where IN is known as the AGM bound of the join and OUT is its output size. This result holds for all joins on binary relations, as well as certain joins on relations of higher arity. We further show how this algorithm immediately leads to a join size estimation algorithm with the same running time. 2012 ACM Subject Classification Theory of computation → Database theory

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

循环连接上的随机抽样和大小估计

计算连接的成本很高，而且当输出大小很大时通常没有必要。1999年，Chaudhuri等人[7]提出了连接上的随机抽样问题，作为一种潜在的有效方法，可以避免完全计算连接，同时获得有关连接结果的重要统计信息。不幸的是，在过去的20年里，除了无环连接的情况外，没有取得重大进展。本文给出了循环连接上采样的第一个非平凡结果。我们表明，在线性时间预处理步骤之后，可以在预期时间0 (in /OUT)内均匀随机地绘制连接结果，其中in称为连接的AGM边界，OUT是其输出大小。这个结果适用于所有二元关系上的连接，以及某些更高密度关系上的连接。我们将进一步展示该算法如何立即生成具有相同运行时间的连接大小估计算法。2012 ACM学科分类:计算理论→数据库理论

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

自引率

0.00%

发文量

期刊最新文献

Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs A Simple Algorithm for Consistent Query Answering under Primary Keys Size Bounds and Algorithms for Conjunctive Regular Path Queries Compact Data Structures Meet Databases (Invited Talk) Enumerating Subgraphs of Constant Sizes in External Memory