{"title":"Towards Performance Optimization for Hadoop MapReduce Applications","authors":"Thandar Htay, S. Phyu","doi":"10.1109/ECTI-CON49241.2020.9158095","DOIUrl":null,"url":null,"abstract":"Apache Hadoop is a widely used open-source distributed platform towards big data processing and provides YARN based distributed parallel processing framework on low cost commodity machines. However, YARN adopts static resource management (that is, the number of containers available per node and the size of each container are static in nature) depending on pre-configured default resource units called containers leading to poor performance to deal with various sort of MapReduce applications. In addition, during the last wave of a job, many available resources occur frequently being idle because YARN does not consider the wave behavior in tasks of MapReduce applications. To take advantage of idle resources resulting in performance improvement, the important parameter, the number of map tasks is needed to optimize based on the available resources and governed by split size. Therefore, this parameter is optimized through the split size tuning based on the available resources. To address the drawback of static resource management of yarn in Hadoop, the numbers of concurrent containers per machine are tuned to optimize the node performance for running each MapReduce application. As per experimental results, the proposed system that optimizes the selected parameter on optimized concurrent containers can achieve the performance gains of MapReduce applications while reducing the optimization overheads.","PeriodicalId":371552,"journal":{"name":"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ECTI-CON49241.2020.9158095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Apache Hadoop is a widely used open-source distributed platform towards big data processing and provides YARN based distributed parallel processing framework on low cost commodity machines. However, YARN adopts static resource management (that is, the number of containers available per node and the size of each container are static in nature) depending on pre-configured default resource units called containers leading to poor performance to deal with various sort of MapReduce applications. In addition, during the last wave of a job, many available resources occur frequently being idle because YARN does not consider the wave behavior in tasks of MapReduce applications. To take advantage of idle resources resulting in performance improvement, the important parameter, the number of map tasks is needed to optimize based on the available resources and governed by split size. Therefore, this parameter is optimized through the split size tuning based on the available resources. To address the drawback of static resource management of yarn in Hadoop, the numbers of concurrent containers per machine are tuned to optimize the node performance for running each MapReduce application. As per experimental results, the proposed system that optimizes the selected parameter on optimized concurrent containers can achieve the performance gains of MapReduce applications while reducing the optimization overheads.