MapReduce中的数据管道

2013 IEEE 9th International Conference on e-Science Pub Date : 2013-10-22 DOI:10.1109/eScience.2013.21

Jiaan Zeng, Beth Plale

{"title":"MapReduce中的数据管道","authors":"Jiaan Zeng, Beth Plale","doi":"10.1109/eScience.2013.21","DOIUrl":null,"url":null,"abstract":"MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"89 25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Data Pipeline in MapReduce\",\"authors\":\"Jiaan Zeng, Beth Plale\",\"doi\":\"10.1109/eScience.2013.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.\",\"PeriodicalId\":325272,\"journal\":{\"name\":\"2013 IEEE 9th International Conference on e-Science\",\"volume\":\"89 25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 9th International Conference on e-Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/eScience.2013.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 9th International Conference on e-Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2013.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

MapReduce是一种用于大规模文本和数据分析的有效编程模型。传统的MapReduce实现，例如Hadoop，有一个限制，在进行任何分析之前，必须将整个输入数据集加载到集群中。当数据集很大，并且不可能一次加载数据并多次处理时(例如，日志文件、健康记录和受保护的文本就存在这种情况)，这会导致相当大的延迟。我们提出了一种数据管道方法来隐藏MapReduce分析中的数据上传延迟。我们的实现基于Hadoop MapReduce，对用户是完全透明的。引入分布式并发队列来协调数据块的分配和同步，实现数据上传和执行的重叠。本文克服了两个挑战:固定数量的地图调度和动态数量的地图调度允许更好地处理未知大小的输入数据集。我们还使用延迟调度器来实现数据管道的数据局部性。在真实世界数据集的不同应用程序上对解决方案的评估表明，我们的方法显示出性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Data Pipeline in MapReduce

MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE 9th International Conference on e-Science

自引率

0.00%

发文量

期刊最新文献

Policy Derived Access Rights in the Social Cloud Accelerating In-memory Cross Match of Astronomical Catalogs Scientific Analysis by Queries in Extended SPARQL over a Scalable e-Science Data Store Malleable Access Rights to Establish and Enable Scientific Collaboration An Autonomous Security Storage Solution for Data-Intensive Cooperative Cloud Computing