P. Agarwal, Graham Cormode, Zengfeng Huang, J. M. Phillips, Zhewei Wei, K. Yi
{"title":"可以合并汇总","authors":"P. Agarwal, Graham Cormode, Zengfeng Huang, J. M. Phillips, Zhewei Wei, K. Yi","doi":"10.1145/2500128","DOIUrl":null,"url":null,"abstract":"We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ϵ-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ϵ); for ϵ-approximate quantiles, there is a deterministic summary of size O((1/ϵ) log(ϵ n)) that has a restricted form of mergeability, and a randomized one of size O((1/ϵ) log3/2(1/ϵ)) with full mergeability. We also extend our results to geometric summaries such as ϵ-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network.\n We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ϵ-approximate quantiles that depends only on ϵ, of size O((1/ϵ) log3/2(1/ϵ)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"9 1","pages":"26"},"PeriodicalIF":2.2000,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":"{\"title\":\"Mergeable summaries\",\"authors\":\"P. Agarwal, Graham Cormode, Zengfeng Huang, J. M. Phillips, Zhewei Wei, K. Yi\",\"doi\":\"10.1145/2500128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ϵ-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ϵ); for ϵ-approximate quantiles, there is a deterministic summary of size O((1/ϵ) log(ϵ n)) that has a restricted form of mergeability, and a randomized one of size O((1/ϵ) log3/2(1/ϵ)) with full mergeability. We also extend our results to geometric summaries such as ϵ-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network.\\n We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ϵ-approximate quantiles that depends only on ϵ, of size O((1/ϵ) log3/2(1/ϵ)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.\",\"PeriodicalId\":50915,\"journal\":{\"name\":\"ACM Transactions on Database Systems\",\"volume\":\"9 1\",\"pages\":\"26\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2013-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"61\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Database Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/2500128\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2500128","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ϵ-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ϵ); for ϵ-approximate quantiles, there is a deterministic summary of size O((1/ϵ) log(ϵ n)) that has a restricted form of mergeability, and a randomized one of size O((1/ϵ) log3/2(1/ϵ)) with full mergeability. We also extend our results to geometric summaries such as ϵ-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network.
We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ϵ-approximate quantiles that depends only on ϵ, of size O((1/ϵ) log3/2(1/ϵ)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.
期刊介绍:
Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.