DHive: Query Execution Performance Analysis via Dataflow in Apache Hive

IF 2.6 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Proceedings of the Vldb Endowment Pub Date : 2023-08-01 DOI:10.14778/3611540.3611605
Chaozu Zhang, Qiaomu Shen, Bo Tang
{"title":"DHive: Query Execution Performance Analysis via Dataflow in Apache Hive","authors":"Chaozu Zhang, Qiaomu Shen, Bo Tang","doi":"10.14778/3611540.3611605","DOIUrl":null,"url":null,"abstract":"Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"222 1","pages":"0"},"PeriodicalIF":2.6000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Vldb Endowment","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3611540.3611605","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Hive: Apache Hive中基于数据流的查询执行性能分析
如今,Apache Hive已被广泛用于许多组织的大规模数据分析应用程序。Hive开发了各种可视化分析工具,帮助用户快速分析查询执行过程,识别执行查询的性能瓶颈。但是,现有的工具主要侧重于显示查询子组件(作业和操作符)的时间使用情况,但无法提供足够的证据来分析执行进度缓慢的根本原因。为了解决这个问题,我们开发了一个可视化分析系统hive,通过数据流分析对查询执行过程进行可视化分析。hive在多个级别显示查询执行过程中的数据流:查询级别、作业级别和任务级别,用户可以通过将关键的作业/任务与系统配置和硬件状态等辅助信息联系起来,从而识别关键的作业/任务并解释其时间使用情况。我们通过一个生产集群中的两个案例来演示hive的有效性。hive是开源的,网址是https://github.com/DBGroup-SUSTech/DHive.git。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Proceedings of the Vldb Endowment
Proceedings of the Vldb Endowment Computer Science-General Computer Science
CiteScore
7.70
自引率
0.00%
发文量
95
期刊介绍: The Proceedings of the VLDB (PVLDB) welcomes original research papers on a broad range of research topics related to all aspects of data management, where systems issues play a significant role, such as data management system technology and information management infrastructures, including their very large scale of experimentation, novel architectures, and demanding applications as well as their underpinning theory. The scope of a submission for PVLDB is also described by the subject areas given below. Moreover, the scope of PVLDB is restricted to scientific areas that are covered by the combined expertise on the submission’s topic of the journal’s editorial board. Finally, the submission’s contributions should build on work already published in data management outlets, e.g., PVLDB, VLDBJ, ACM SIGMOD, IEEE ICDE, EDBT, ACM TODS, IEEE TKDE, and go beyond a syntactic citation.
期刊最新文献
Auditory Brainstem Response in a Child with Mitochondrial Disorder-Leigh Syndrome. Breathing New Life into an Old Tree: Resolving Logging Dilemma of B + -tree on Modern Computational Storage Drives QO-Insight: Inspecting Steered Query Optimizers A Learned Query Rewrite System Demonstrating ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Joins via Reinforcement Learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1