{"title":"Improving MapReduce Performance by Streaming Input Data from Multiple Replicas","authors":"Jiadong Wu, Bo Hong","doi":"10.1109/CloudCom.2013.88","DOIUrl":null,"url":null,"abstract":"The MapReduce programming model, along with its open-source implementation Hadoop has provided a cost effective solution for many data-intensive applications. Hadoop stores data distributively and exploits data locality by assigning tasks to where data is stored. In many cases, however, accessing remote data (rack-local and off-rack) is inevitable. In this paper we are evaluating the possibility of improving the remote data accessing performance by streaming data from multiple available replicas. The proposed design consists of a circular buffer, a slice reader and a enhanced Data Node. Such system is capable of adapting to both the static performance variance caused by network topology as well as dynamic variance caused by congestion. Extensive experiments show that mutil-source streaming can significantly improve the throughput of remote data access and accelerate the related map tasks by 10%-20%. In some imbalanced environment, the proposed system can even achieve as much as 4x speedup.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudCom.2013.88","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The MapReduce programming model, along with its open-source implementation Hadoop has provided a cost effective solution for many data-intensive applications. Hadoop stores data distributively and exploits data locality by assigning tasks to where data is stored. In many cases, however, accessing remote data (rack-local and off-rack) is inevitable. In this paper we are evaluating the possibility of improving the remote data accessing performance by streaming data from multiple available replicas. The proposed design consists of a circular buffer, a slice reader and a enhanced Data Node. Such system is capable of adapting to both the static performance variance caused by network topology as well as dynamic variance caused by congestion. Extensive experiments show that mutil-source streaming can significantly improve the throughput of remote data access and accelerate the related map tasks by 10%-20%. In some imbalanced environment, the proposed system can even achieve as much as 4x speedup.