ROME: All Overlays Lead to Aggregation, but Some Are Faster than Others

Marcel Blöcher, Emilio Coppa, Pascal Kleber, P. Eugster, W. Culhane, Masoud Saeida Ardekani
{"title":"ROME: All Overlays Lead to Aggregation, but Some Are Faster than Others","authors":"Marcel Blöcher, Emilio Coppa, Pascal Kleber, P. Eugster, W. Culhane, Masoud Saeida Ardekani","doi":"10.1145/3516430","DOIUrl":null,"url":null,"abstract":"Aggregation is common in data analytics and crucial to distilling information from large datasets, but current data analytics frameworks do not fully exploit the potential for optimization in such phases. The lack of optimization is particularly notable in current “online” approaches that store data in main memory across nodes, shifting the bottleneck away from disk I/O toward network and compute resources, thus increasing the relative performance impact of distributed aggregation phases. We present ROME, an aggregation system for use within data analytics frameworks or in isolation. ROME uses a set of novel heuristics based primarily on basic knowledge of aggregation functions combined with deployment constraints to efficiently aggregate results from computations performed on individual data subsets across nodes (e.g., merging sorted lists resulting from top-k). The user can either provide minimal information that allows our heuristics to be applied directly, or ROME can autodetect the relevant information at little cost. We integrated ROME as a subsystem into the Spark and Flink data analytics frameworks. We use real-world data to experimentally demonstrate speedups up to 3× over single-level aggregation overlays, up to 21% over other multi-level overlays, and 50% for iterative algorithms like gradient descent at 100 iterations.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3516430","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Aggregation is common in data analytics and crucial to distilling information from large datasets, but current data analytics frameworks do not fully exploit the potential for optimization in such phases. The lack of optimization is particularly notable in current “online” approaches that store data in main memory across nodes, shifting the bottleneck away from disk I/O toward network and compute resources, thus increasing the relative performance impact of distributed aggregation phases. We present ROME, an aggregation system for use within data analytics frameworks or in isolation. ROME uses a set of novel heuristics based primarily on basic knowledge of aggregation functions combined with deployment constraints to efficiently aggregate results from computations performed on individual data subsets across nodes (e.g., merging sorted lists resulting from top-k). The user can either provide minimal information that allows our heuristics to be applied directly, or ROME can autodetect the relevant information at little cost. We integrated ROME as a subsystem into the Spark and Flink data analytics frameworks. We use real-world data to experimentally demonstrate speedups up to 3× over single-level aggregation overlays, up to 21% over other multi-level overlays, and 50% for iterative algorithms like gradient descent at 100 iterations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
罗马:所有的叠加都会导致聚合,但有些会比其他更快
聚合在数据分析中很常见,对于从大型数据集中提取信息至关重要,但当前的数据分析框架并没有充分利用这些阶段的优化潜力。当前的“在线”方法将数据跨节点存储在主存中,将瓶颈从磁盘I/O转移到网络和计算资源,从而增加了分布式聚合阶段的相对性能影响,这种方法尤其缺乏优化。我们提出了ROME,一个用于数据分析框架或单独使用的聚合系统。ROME使用一组新颖的启发式方法,主要基于聚合函数的基本知识,结合部署约束,有效地聚合跨节点对单个数据子集执行的计算结果(例如,合并由top-k产生的排序列表)。用户可以提供最少的信息,以便直接应用我们的启发式算法,或者ROME可以以很少的成本自动检测相关信息。我们将ROME作为一个子系统集成到Spark和Flink数据分析框架中。我们使用真实世界的数据来实验证明,与单级聚合叠加相比,加速高达3倍,与其他多级叠加相比,加速高达21%,对于像梯度下降这样的迭代算法,在100次迭代中加速高达50%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Boosting Inter-process Communication with Architectural Support H-Container: Enabling Heterogeneous-ISA Container Migration in Edge Computing ROME: All Overlays Lead to Aggregation, but Some Are Faster than Others The Role of Compute in Autonomous Micro Aerial Vehicles: Optimizing for Mission Time and Energy Efficiency An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous Nodes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1