Tzu-Chi Huang, Kuo-Chih Chu, Jia-Hui Lin, C. Shieh
{"title":"Idempotent Task Cache System for Handling Intermediate Data Skew in MapReduce on Cloud Computing","authors":"Tzu-Chi Huang, Kuo-Chih Chu, Jia-Hui Lin, C. Shieh","doi":"10.1109/ICS.2016.0111","DOIUrl":null,"url":null,"abstract":"A MapReduce system gradually becomes a popular platform for developing cloud applications while MapReduce is the de facto standard programming model of the applications. However, a MapReduce system may suffer intermediate data skew to degrade performances because input data is unpredictable and the Map function of the application may generate different quantities of intermediate data according to the application algorithm. A MapReduce system can use the Idempotent Task Cache System (ITCS) proposed in this paper to handle intermediate data skew. A MapReduce system can avoid negative performance impacts of intermediate data skew with ITCS by using caches to skip the high workload of processing skewed intermediate data in certain Reduce tasks. In experiments, a MapReduce system is tested with several popular applications to prove that ITCS not only alleviates performance penalties when intermediate data skew happens, but also greatly outperforms native MapReduce systems without any help of ITCS.","PeriodicalId":281088,"journal":{"name":"2016 International Computer Symposium (ICS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Computer Symposium (ICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICS.2016.0111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
A MapReduce system gradually becomes a popular platform for developing cloud applications while MapReduce is the de facto standard programming model of the applications. However, a MapReduce system may suffer intermediate data skew to degrade performances because input data is unpredictable and the Map function of the application may generate different quantities of intermediate data according to the application algorithm. A MapReduce system can use the Idempotent Task Cache System (ITCS) proposed in this paper to handle intermediate data skew. A MapReduce system can avoid negative performance impacts of intermediate data skew with ITCS by using caches to skip the high workload of processing skewed intermediate data in certain Reduce tasks. In experiments, a MapReduce system is tested with several popular applications to prove that ITCS not only alleviates performance penalties when intermediate data skew happens, but also greatly outperforms native MapReduce systems without any help of ITCS.