面向Hadoop MapReduce应用的性能优化

2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) Pub Date : 2020-06-01 DOI:10.1109/ECTI-CON49241.2020.9158095

Thandar Htay, S. Phyu

{"title":"面向Hadoop MapReduce应用的性能优化","authors":"Thandar Htay, S. Phyu","doi":"10.1109/ECTI-CON49241.2020.9158095","DOIUrl":null,"url":null,"abstract":"Apache Hadoop is a widely used open-source distributed platform towards big data processing and provides YARN based distributed parallel processing framework on low cost commodity machines. However, YARN adopts static resource management (that is, the number of containers available per node and the size of each container are static in nature) depending on pre-configured default resource units called containers leading to poor performance to deal with various sort of MapReduce applications. In addition, during the last wave of a job, many available resources occur frequently being idle because YARN does not consider the wave behavior in tasks of MapReduce applications. To take advantage of idle resources resulting in performance improvement, the important parameter, the number of map tasks is needed to optimize based on the available resources and governed by split size. Therefore, this parameter is optimized through the split size tuning based on the available resources. To address the drawback of static resource management of yarn in Hadoop, the numbers of concurrent containers per machine are tuned to optimize the node performance for running each MapReduce application. As per experimental results, the proposed system that optimizes the selected parameter on optimized concurrent containers can achieve the performance gains of MapReduce applications while reducing the optimization overheads.","PeriodicalId":371552,"journal":{"name":"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards Performance Optimization for Hadoop MapReduce Applications\",\"authors\":\"Thandar Htay, S. Phyu\",\"doi\":\"10.1109/ECTI-CON49241.2020.9158095\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Apache Hadoop is a widely used open-source distributed platform towards big data processing and provides YARN based distributed parallel processing framework on low cost commodity machines. However, YARN adopts static resource management (that is, the number of containers available per node and the size of each container are static in nature) depending on pre-configured default resource units called containers leading to poor performance to deal with various sort of MapReduce applications. In addition, during the last wave of a job, many available resources occur frequently being idle because YARN does not consider the wave behavior in tasks of MapReduce applications. To take advantage of idle resources resulting in performance improvement, the important parameter, the number of map tasks is needed to optimize based on the available resources and governed by split size. Therefore, this parameter is optimized through the split size tuning based on the available resources. To address the drawback of static resource management of yarn in Hadoop, the numbers of concurrent containers per machine are tuned to optimize the node performance for running each MapReduce application. As per experimental results, the proposed system that optimizes the selected parameter on optimized concurrent containers can achieve the performance gains of MapReduce applications while reducing the optimization overheads.\",\"PeriodicalId\":371552,\"journal\":{\"name\":\"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)\",\"volume\":\"154 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ECTI-CON49241.2020.9158095\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ECTI-CON49241.2020.9158095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

Apache Hadoop是一个广泛使用的面向大数据处理的开源分布式平台，在低成本的商用机器上提供基于YARN的分布式并行处理框架。然而，YARN采用静态资源管理(即每个节点可用的容器数量和每个容器的大小本质上是静态的)，这取决于预配置的默认资源单元(称为容器)，导致处理各种类型的MapReduce应用程序的性能较差。此外，在作业的最后一波期间，由于YARN没有考虑MapReduce应用程序任务中的波行为，许多可用资源经常出现空闲状态。为了利用空闲资源从而提高性能，需要根据可用资源和分割大小对映射任务的数量进行优化。因此，该参数通过基于可用资源的分割大小调优进行优化。为了解决Hadoop中yarn静态资源管理的缺点，我们调整了每台机器的并发容器数量，以优化运行每个MapReduce应用程序的节点性能。实验结果表明，本文提出的系统在优化的并发容器上对所选参数进行优化，可以在降低优化开销的同时实现MapReduce应用程序的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Towards Performance Optimization for Hadoop MapReduce Applications

Apache Hadoop is a widely used open-source distributed platform towards big data processing and provides YARN based distributed parallel processing framework on low cost commodity machines. However, YARN adopts static resource management (that is, the number of containers available per node and the size of each container are static in nature) depending on pre-configured default resource units called containers leading to poor performance to deal with various sort of MapReduce applications. In addition, during the last wave of a job, many available resources occur frequently being idle because YARN does not consider the wave behavior in tasks of MapReduce applications. To take advantage of idle resources resulting in performance improvement, the important parameter, the number of map tasks is needed to optimize based on the available resources and governed by split size. Therefore, this parameter is optimized through the split size tuning based on the available resources. To address the drawback of static resource management of yarn in Hadoop, the numbers of concurrent containers per machine are tuned to optimize the node performance for running each MapReduce application. As per experimental results, the proposed system that optimizes the selected parameter on optimized concurrent containers can achieve the performance gains of MapReduce applications while reducing the optimization overheads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)

自引率

0.00%

发文量