Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop

Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref
{"title":"Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop","authors":"Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref","doi":"10.1145/2835776.2835841","DOIUrl":null,"url":null,"abstract":"Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2835776.2835841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
袋鼠:Hadoop中范围数据和范围查询的工作负载感知处理
尽管距离数据(如时间间隔、空间范围等)的重要性和广泛应用,但在大数据背景下对距离数据的处理和查询的研究却很少受到关注。主要的挑战在于传统索引结构的本质,例如B-Tree和R-Tree,本质上是集中的,因此在分布式环境中部署时几乎是瘫痪的。为了解决这个问题,本文提出了Kangaroo,这是一个建立在Hadoop之上的系统,用于优化对范围数据的范围查询的执行。Kangaroo背后的主要思想是以最小化查询执行时间的方式将数据分割为不重叠的分区。袋鼠是查询工作负载敏感的,也就是说,它产生的分区布局可以最大限度地减少给定查询模式的查询处理时间。在本文中,我们研究了袋鼠在分布式文件系统(即HDFS)上部署时所面临的设计挑战。我们还研究了袋鼠可以支持的四种不同的分区方案。通过使用超过10亿条记录的实际范围数据和超过30,000条查询的实际查询工作负载进行大量实验,我们表明Kangaroo的分区方案可以显着减少范围数据上的范围查询I/O。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Beyond-Accuracy Goals, Again WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022 A Semantic Layer Querying Tool Multilingual and Multimodal Hate Speech Analysis in Twitter Designing the Cogno-Web Observatory: To Characterize the Dynamics of Online Social Cognition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1