Communication optimisation for intermediate data of MapReduce computing model

Int. J. Comput. Sci. Eng. Pub Date : 2020-03-06 DOI:10.1504/ijcse.2020.10027428

Yunpeng Cao, Haifeng Wang

{"title":"Communication optimisation for intermediate data of MapReduce computing model","authors":"Yunpeng Cao, Haifeng Wang","doi":"10.1504/ijcse.2020.10027428","DOIUrl":null,"url":null,"abstract":"MapReduce is a typical computing model for processing and analysis of big data. MapReduce computing job produces a large amount of intermediate data after map phase. Massive intermediate data results in a large amount of intermediate data communication across rack switches in the Shuffle process of MapReduce computing model, this degrades the performance of heterogeneous cluster computing. In order to optimise the intermediate data communication performance of map-intensive jobs, the characteristics of pre-running scheduling information of MapReduce computing jobs are extracted, and job classification is realised by machine learning. The jobs of active intermediate data communication are mapped into a rack to keep the communication locality of intermediate data. The jobs with inactive communication are deployed to the nodes sorted by computing performance. The experimental results show that the proposed communication optimisation scheme has a good effect on Shuffle-intensive jobs, and can reach 4%–5%. In the case of larger amount of input data, the communication optimisation scheme is robust and can adapt to heterogeneous cluster. In the case of multi-user application scene, the intermediate data communication can be reduced by 4.1%.","PeriodicalId":340410,"journal":{"name":"Int. J. Comput. Sci. Eng.","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Sci. Eng.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/ijcse.2020.10027428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

MapReduce is a typical computing model for processing and analysis of big data. MapReduce computing job produces a large amount of intermediate data after map phase. Massive intermediate data results in a large amount of intermediate data communication across rack switches in the Shuffle process of MapReduce computing model, this degrades the performance of heterogeneous cluster computing. In order to optimise the intermediate data communication performance of map-intensive jobs, the characteristics of pre-running scheduling information of MapReduce computing jobs are extracted, and job classification is realised by machine learning. The jobs of active intermediate data communication are mapped into a rack to keep the communication locality of intermediate data. The jobs with inactive communication are deployed to the nodes sorted by computing performance. The experimental results show that the proposed communication optimisation scheme has a good effect on Shuffle-intensive jobs, and can reach 4%–5%. In the case of larger amount of input data, the communication optimisation scheme is robust and can adapt to heterogeneous cluster. In the case of multi-user application scene, the intermediate data communication can be reduced by 4.1%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MapReduce计算模型中间数据的通信优化

MapReduce是一种典型的处理和分析大数据的计算模型。MapReduce计算作业在map阶段之后会产生大量的中间数据。在MapReduce计算模型的Shuffle过程中，大量的中间数据导致了大量的中间数据跨机架交换机通信，从而降低了异构集群计算的性能。为了优化地图密集型作业的中间数据通信性能，提取MapReduce计算作业的预运行调度信息特征，并通过机器学习实现作业分类。主动中间数据通信的作业被映射到一个机架中，以保持中间数据的通信局部性。具有非活动通信的作业被部署到按计算性能排序的节点上。实验结果表明，所提出的通信优化方案对shuffle密集型作业具有良好的效果，可达到4% ~ 5%。在输入数据量较大的情况下，该通信优化方案具有鲁棒性，能够适应异构集群。在多用户应用场景下，中间数据通信可减少4.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Int. J. Comput. Sci. Eng.

自引率

0.00%

发文量