Incremental MapReduce Computations

Large Scale and Big Data Pub Date : 1900-01-01 DOI:10.1201/b17112-5

Pramod Bhatotia, Alexander Wieder, Umut A. Acar, R. Rodrigues

{"title":"Incremental MapReduce Computations","authors":"Pramod Bhatotia, Alexander Wieder, Umut A. Acar, R. Rodrigues","doi":"10.1201/b17112-5","DOIUrl":null,"url":null,"abstract":"Abstract Distributed processing of large data sets is an area that received much attention from researchers and practitioners over the last few years. In this context, several proposals exist that leverage the observation that data sets evolve over time, and as such there is often a substantial overlap between the input to consecutive runs of a data processing job. This allows the programmers of these systems to devise an e ﬃ cient logic to update the output upon an input change. However, most of these systems lack compatibility existing models and require the programmer to implement an application-speciﬁc dynamic algorithm, which increases algorithm and code complexity. In this chapter, we describe our previous work on building a platform called Incoop, which allows for running MapReduce computations incrementally and transparently. Incoop detects changes between two ﬁles that are used as inputs to consecutive MapReduce jobs, and e ﬃ ciently propagates those changes until the new output is produced. The design of Incoop is based on memoizing the results of previously run tasks, and reusing these results whenever possible. Doing this e ﬃ ciently introduces several technical challenges that are overcome with novel concepts, such as a large-scale storage system that e ﬃ ciently computes deltas between two inputs, a Contraction phase to break up the work of the Reduce phase, and an a ﬃ nity-based scheduling algorithm. This chapter presents the motivation and design of Incoop, as well as a complete evaluation using several application benchmarks. Our results show signiﬁcant performance improvements without changing a single line of application code.","PeriodicalId":448182,"journal":{"name":"Large Scale and Big Data","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Large Scale and Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1201/b17112-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Abstract Distributed processing of large data sets is an area that received much attention from researchers and practitioners over the last few years. In this context, several proposals exist that leverage the observation that data sets evolve over time, and as such there is often a substantial overlap between the input to consecutive runs of a data processing job. This allows the programmers of these systems to devise an e ﬃ cient logic to update the output upon an input change. However, most of these systems lack compatibility existing models and require the programmer to implement an application-speciﬁc dynamic algorithm, which increases algorithm and code complexity. In this chapter, we describe our previous work on building a platform called Incoop, which allows for running MapReduce computations incrementally and transparently. Incoop detects changes between two ﬁles that are used as inputs to consecutive MapReduce jobs, and e ﬃ ciently propagates those changes until the new output is produced. The design of Incoop is based on memoizing the results of previously run tasks, and reusing these results whenever possible. Doing this e ﬃ ciently introduces several technical challenges that are overcome with novel concepts, such as a large-scale storage system that e ﬃ ciently computes deltas between two inputs, a Contraction phase to break up the work of the Reduce phase, and an a ﬃ nity-based scheduling algorithm. This chapter presents the motivation and design of Incoop, as well as a complete evaluation using several application benchmarks. Our results show signiﬁcant performance improvements without changing a single line of application code.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

增量MapReduce计算

大型数据集的分布式处理是近年来备受研究者和实践者关注的一个领域。在这种情况下，存在一些利用数据集随时间演变的观察结果的建议，因此，在连续运行的数据处理作业的输入之间通常存在大量重叠。这允许这些系统的程序员设计一个有效的逻辑来更新输入更改后的输出。然而，这些系统中的大多数缺乏与现有模型的兼容性，并且要求程序员实现特定于应用程序的动态算法，这增加了算法和代码的复杂性。在本章中，我们描述了我们之前构建一个名为Incoop的平台的工作，该平台允许以增量和透明的方式运行MapReduce计算。inoop检测作为连续MapReduce作业输入的两个文件之间的更改，并有效地传播这些更改，直到产生新的输出。inoop的设计基于记忆先前运行任务的结果，并尽可能重用这些结果。高效地完成这一任务会带来一些技术挑战，这些挑战可以通过一些新概念来克服，比如高效地计算两个输入之间的增量的大规模存储系统，分解Reduce阶段工作的收缩阶段，以及基于ffi的调度算法。本章介绍了inoop的动机和设计，以及使用几个应用程序基准的完整评估。我们的结果显示了显著的性能改进，而无需更改任何一行应用程序代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Large Scale and Big Data

自引率

0.00%

发文量