端点偏置样本的连接基数估计

Cristian Estan, J. Naughton
{"title":"端点偏置样本的连接基数估计","authors":"Cristian Estan, J. Naughton","doi":"10.1109/ICDE.2006.61","DOIUrl":null,"url":null,"abstract":"We present a new technique for using samples to estimate join cardinalities. This technique, which we term \"end-biased samples,\" is inspired by recent work in network traffic measurement. It improves on random samples by using coordinated pseudo-random samples and retaining the sampled values in proportion to their frequency. We show that end-biased samples always provide more accurate estimates than random samples with the same sample size. The comparison with histograms is more interesting ― while end-biased histograms are somewhat better than end-biased samples for uncorrelated data sets, end-biased samples dominate by a large margin when the data is correlated. Finally, we compare end-biased samples to the recently proposed \"skimmed sketches\" and show that neither dominates the other, that each has different and compelling strengths and weaknesses. These results suggest that endbiased samples may be a useful addition to the repertoire of techniques used for data summarization.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"1 1","pages":"20-20"},"PeriodicalIF":0.0000,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"60","resultStr":"{\"title\":\"End-biased Samples for Join Cardinality Estimation\",\"authors\":\"Cristian Estan, J. Naughton\",\"doi\":\"10.1109/ICDE.2006.61\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a new technique for using samples to estimate join cardinalities. This technique, which we term \\\"end-biased samples,\\\" is inspired by recent work in network traffic measurement. It improves on random samples by using coordinated pseudo-random samples and retaining the sampled values in proportion to their frequency. We show that end-biased samples always provide more accurate estimates than random samples with the same sample size. The comparison with histograms is more interesting ― while end-biased histograms are somewhat better than end-biased samples for uncorrelated data sets, end-biased samples dominate by a large margin when the data is correlated. Finally, we compare end-biased samples to the recently proposed \\\"skimmed sketches\\\" and show that neither dominates the other, that each has different and compelling strengths and weaknesses. These results suggest that endbiased samples may be a useful addition to the repertoire of techniques used for data summarization.\",\"PeriodicalId\":6819,\"journal\":{\"name\":\"22nd International Conference on Data Engineering (ICDE'06)\",\"volume\":\"1 1\",\"pages\":\"20-20\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"60\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"22nd International Conference on Data Engineering (ICDE'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2006.61\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"22nd International Conference on Data Engineering (ICDE'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2006.61","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 60

摘要

我们提出了一种使用样本估计连接基数的新技术。这种技术,我们称之为“端偏样本”,是受到最近网络流量测量工作的启发。它通过使用协调伪随机样本并按频率比例保留采样值来改进随机样本。我们表明,端偏样本总是比具有相同样本量的随机样本提供更准确的估计。与直方图的比较更有趣——虽然对于不相关的数据集,端偏直方图比端偏样本要好一些,但当数据相关时,端偏样本占主导地位。最后,我们将端偏样本与最近提出的“略读草图”进行了比较,并表明两者都不占主导地位,两者都有不同且引人注目的优势和劣势。这些结果表明,端偏样本可能是一个有用的补充,用于数据汇总技术的曲目。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
End-biased Samples for Join Cardinality Estimation
We present a new technique for using samples to estimate join cardinalities. This technique, which we term "end-biased samples," is inspired by recent work in network traffic measurement. It improves on random samples by using coordinated pseudo-random samples and retaining the sampled values in proportion to their frequency. We show that end-biased samples always provide more accurate estimates than random samples with the same sample size. The comparison with histograms is more interesting ― while end-biased histograms are somewhat better than end-biased samples for uncorrelated data sets, end-biased samples dominate by a large margin when the data is correlated. Finally, we compare end-biased samples to the recently proposed "skimmed sketches" and show that neither dominates the other, that each has different and compelling strengths and weaknesses. These results suggest that endbiased samples may be a useful addition to the repertoire of techniques used for data summarization.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An Approach to Adaptive Memory Management in Data Stream Systems Revision Processing in a Stream Processing Engine: A High-Level Design SUBSKY: Efficient Computation of Skylines in Subspaces How to Determine a Good Multi-Programming Level for External Scheduling Warehousing and Analyzing Massive RFID Data Sets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1