Approximating a data stream for querying and estimation: algorithms and performance evaluation

Proceedings 18th International Conference on Data Engineering Pub Date : 2002-08-07 DOI:10.1109/ICDE.2002.994775

S. Guha, Nick Koudas

{"title":"Approximating a data stream for querying and estimation: algorithms and performance evaluation","authors":"S. Guha, Nick Koudas","doi":"10.1109/ICDE.2002.994775","DOIUrl":null,"url":null,"abstract":"Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"109","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994775","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 109

Abstract

Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于查询和估计的近似数据流:算法和性能评估

获取快速、高质量的数据分布近似值是数据库管理的核心问题。各种流行的数据库应用程序，包括大多数应用领域中的近似查询、相似度搜索和数据挖掘，都依赖于这种高质量的近似。在数据库理论和实践中，基于直方图的近似是一种非常流行的方法，它以一种节省空间的方式简洁地表示数据分布。在本文中，我们把直方图构造问题的角度，我们提出了一个有限的数据集和/或已知的数据集大小的要求，我们推广它。我们考虑一个无限数据集的情况，其中数据连续到达，形成无限数据流。在这种情况下，我们提出了能够构建可证明的高质量直方图的单遍算法。我们提出了用于基本直方图构建问题的固定窗口变体的算法，支持直方图的增量维护。所提出的算法以精度换取速度，并允许基于应用程序需求在两者之间进行适当的权衡。在无限数据流近似查询的情况下，我们提出了一个详细的实验评估，将我们的算法与使用真实数据集的其他适用技术进行比较，证明了我们建议的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings 18th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Out from under the trees [linear file template] Declarative composition and peer-to-peer provisioning of dynamic Web services Multivariate time series prediction via temporal classification Integrating workflow management systems with business-to-business interaction standards YFilter: efficient and scalable filtering of XML documents