有效计算数据流上的偏分位数

21st International Conference on Data Engineering (ICDE'05) Pub Date : 2005-04-05 DOI:10.1109/ICDE.2005.55

Graham Cormode, Flip Korn, S. Muthukrishnan, D. Srivastava

{"title":"有效计算数据流上的偏分位数","authors":"Graham Cormode, Flip Korn, S. Muthukrishnan, D. Srivastava","doi":"10.1109/ICDE.2005.55","DOIUrl":null,"url":null,"abstract":"Skew is prevalent in many data sources such as IP traffic streams. To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two problems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the \"high-biased\" and the \"targeted\" quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures. Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over highspeed data streams.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":"{\"title\":\"Effective computation of biased quantiles over data streams\",\"authors\":\"Graham Cormode, Flip Korn, S. Muthukrishnan, D. Srivastava\",\"doi\":\"10.1109/ICDE.2005.55\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Skew is prevalent in many data sources such as IP traffic streams. To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two problems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the \\\"high-biased\\\" and the \\\"targeted\\\" quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures. Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over highspeed data streams.\",\"PeriodicalId\":297231,\"journal\":{\"name\":\"21st International Conference on Data Engineering (ICDE'05)\",\"volume\":\"195 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"50\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"21st International Conference on Data Engineering (ICDE'05)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2005.55\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"21st International Conference on Data Engineering (ICDE'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2005.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

摘要

在许多数据源(如IP流量流)中普遍存在偏差。为了不断地总结这些数据的分布，高偏差的分位数(例如，第50、90和99百分位数)在更高的等级(例如，分别为5.1%和0.1%的误差)上具有更精细的误差保证，比具有统一误差保证的均匀分布的分位数(例如，第25、50和75百分位数)更有用。在本文中，我们解决以下两个问题。首先，与以最优误差统一计算所有分位数相比，我们是否可以使用更少的空间和计算时间，有效地为数据分布的较高级别计算具有更精细误差保证的分位数?其次，如果预先请求特定的分位数及其误差范围，是否可以减少必要的空间使用和计算时间?我们通过将它们分别形式化为“高偏差”和“目标”分位数问题来肯定地回答这两个问题，并提出具有可证明保证的算法，这些算法的性能明显优于这些问题的先前已知解决方案。我们在Gigascope数据流管理系统中实现了我们的算法，并评估了维护相关摘要结构的替代方法。我们在真实和合成IP数据流上的实验结果补充了我们的理论分析，并强调了在高速数据流上维护摘要结构时轻量级、非阻塞实现的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Effective computation of biased quantiles over data streams

Skew is prevalent in many data sources such as IP traffic streams. To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two problems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the "high-biased" and the "targeted" quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures. Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over highspeed data streams.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

21st International Conference on Data Engineering (ICDE'05)

自引率

0.00%

发文量

期刊最新文献

Proactive caching for spatial queries in mobile environments MoDB: database system for synthesizing human motion Integrating data from disparate sources: a mass collaboration approach ViteX: a streaming XPath processing system Efficient data management on lightweight computing devices