Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining Pub Date : 2016-02-08 DOI:10.1145/2835776.2835841

Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref

{"title":"Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop","authors":"Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref","doi":"10.1145/2835776.2835841","DOIUrl":null,"url":null,"abstract":"Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2835776.2835841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

袋鼠:Hadoop中范围数据和范围查询的工作负载感知处理

尽管距离数据(如时间间隔、空间范围等)的重要性和广泛应用，但在大数据背景下对距离数据的处理和查询的研究却很少受到关注。主要的挑战在于传统索引结构的本质，例如B-Tree和R-Tree，本质上是集中的，因此在分布式环境中部署时几乎是瘫痪的。为了解决这个问题，本文提出了Kangaroo，这是一个建立在Hadoop之上的系统，用于优化对范围数据的范围查询的执行。Kangaroo背后的主要思想是以最小化查询执行时间的方式将数据分割为不重叠的分区。袋鼠是查询工作负载敏感的，也就是说，它产生的分区布局可以最大限度地减少给定查询模式的查询处理时间。在本文中，我们研究了袋鼠在分布式文件系统(即HDFS)上部署时所面临的设计挑战。我们还研究了袋鼠可以支持的四种不同的分区方案。通过使用超过10亿条记录的实际范围数据和超过30,000条查询的实际查询工作负载进行大量实验，我们表明Kangaroo的分区方案可以显着减少范围数据上的范围查询I/O。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量

期刊最新文献

Beyond-Accuracy Goals, Again WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022 A Semantic Layer Querying Tool Multilingual and Multimodal Hate Speech Analysis in Twitter Designing the Cogno-Web Observatory: To Characterize the Dynamics of Online Social Cognition