Incremental MapReduce Computations

Pramod Bhatotia, Alexander Wieder, Umut A. Acar, R. Rodrigues
{"title":"Incremental MapReduce Computations","authors":"Pramod Bhatotia, Alexander Wieder, Umut A. Acar, R. Rodrigues","doi":"10.1201/b17112-5","DOIUrl":null,"url":null,"abstract":"Abstract Distributed processing of large data sets is an area that received much attention from researchers and practitioners over the last few years. In this context, several proposals exist that leverage the observation that data sets evolve over time, and as such there is often a substantial overlap between the input to consecutive runs of a data processing job. This allows the programmers of these systems to devise an e ffi cient logic to update the output upon an input change. However, most of these systems lack compatibility existing models and require the programmer to implement an application-specific dynamic algorithm, which increases algorithm and code complexity. In this chapter, we describe our previous work on building a platform called Incoop, which allows for running MapReduce computations incrementally and transparently. Incoop detects changes between two files that are used as inputs to consecutive MapReduce jobs, and e ffi ciently propagates those changes until the new output is produced. The design of Incoop is based on memoizing the results of previously run tasks, and reusing these results whenever possible. Doing this e ffi ciently introduces several technical challenges that are overcome with novel concepts, such as a large-scale storage system that e ffi ciently computes deltas between two inputs, a Contraction phase to break up the work of the Reduce phase, and an a ffi nity-based scheduling algorithm. This chapter presents the motivation and design of Incoop, as well as a complete evaluation using several application benchmarks. Our results show significant performance improvements without changing a single line of application code.","PeriodicalId":448182,"journal":{"name":"Large Scale and Big Data","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Large Scale and Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1201/b17112-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Abstract Distributed processing of large data sets is an area that received much attention from researchers and practitioners over the last few years. In this context, several proposals exist that leverage the observation that data sets evolve over time, and as such there is often a substantial overlap between the input to consecutive runs of a data processing job. This allows the programmers of these systems to devise an e ffi cient logic to update the output upon an input change. However, most of these systems lack compatibility existing models and require the programmer to implement an application-specific dynamic algorithm, which increases algorithm and code complexity. In this chapter, we describe our previous work on building a platform called Incoop, which allows for running MapReduce computations incrementally and transparently. Incoop detects changes between two files that are used as inputs to consecutive MapReduce jobs, and e ffi ciently propagates those changes until the new output is produced. The design of Incoop is based on memoizing the results of previously run tasks, and reusing these results whenever possible. Doing this e ffi ciently introduces several technical challenges that are overcome with novel concepts, such as a large-scale storage system that e ffi ciently computes deltas between two inputs, a Contraction phase to break up the work of the Reduce phase, and an a ffi nity-based scheduling algorithm. This chapter presents the motivation and design of Incoop, as well as a complete evaluation using several application benchmarks. Our results show significant performance improvements without changing a single line of application code.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
增量MapReduce计算
大型数据集的分布式处理是近年来备受研究者和实践者关注的一个领域。在这种情况下,存在一些利用数据集随时间演变的观察结果的建议,因此,在连续运行的数据处理作业的输入之间通常存在大量重叠。这允许这些系统的程序员设计一个有效的逻辑来更新输入更改后的输出。然而,这些系统中的大多数缺乏与现有模型的兼容性,并且要求程序员实现特定于应用程序的动态算法,这增加了算法和代码的复杂性。在本章中,我们描述了我们之前构建一个名为Incoop的平台的工作,该平台允许以增量和透明的方式运行MapReduce计算。inoop检测作为连续MapReduce作业输入的两个文件之间的更改,并有效地传播这些更改,直到产生新的输出。inoop的设计基于记忆先前运行任务的结果,并尽可能重用这些结果。高效地完成这一任务会带来一些技术挑战,这些挑战可以通过一些新概念来克服,比如高效地计算两个输入之间的增量的大规模存储系统,分解Reduce阶段工作的收缩阶段,以及基于ffi的调度算法。本章介绍了inoop的动机和设计,以及使用几个应用程序基准的完整评估。我们的结果显示了显著的性能改进,而无需更改任何一行应用程序代码。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
MapReduce Family of Large-Scale Data-Processing Systems Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies Algebraic Optimization of RDF Graph Pattern Queries on MapReduce Distributed Programming for the Cloud: Models, Challenges, and Analytics Engines Network Performance Aware Graph Partitioning for Large Graph Processing Systems in the Cloud
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1