{"title":"A performance comparison of Apache Tez and MapReduce with data compression on Hadoop cluster","authors":"Kritwara Rattanaopas","doi":"10.1109/JCSSE.2017.8025950","DOIUrl":null,"url":null,"abstract":"Big data is a popular topic on cloud computing research. The main characteristics of big data are volume, velocity and variety. These characteristics are difficult to handle by using traditional softwares and methods. Hadoop is open-source framework software which was developed to provide solutions for handling several domains of big data problems. For big data analytic, MapReduce framework is a main engine of Hadoop cluster and widely used nowadays. It uses a batch oriented processing. Apache also developed an alternative engine called “Tez”. It supports an interactive query and does not write temporary data into HDFS. In this paper, we focus on the performance comparison between MapReduce and Tez. We also investigate the performance of these two engines with the compression of input files and map output files. Bzip is a compression algorithm used for input files and snappy is used for map output files. Word-count and terasort benchmarks are used in our experiments. For the word-count benchmark, the results show that Tez engine always has better execution-time than MapReduce engine for both of compressed data or non-compressed data. It can reduce an execution-time up to 39% comparing with the execution time of MapReduce engine. In contrast, the results show that Tez engine usually has higher execution-time than MapReduce engine up to 13% for terasort benchmark. The results also show that the performance of compressing map output files with snappy provides better performance on execution time for both benchmarks.","PeriodicalId":6460,"journal":{"name":"2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"42 1","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2017.8025950","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Big data is a popular topic on cloud computing research. The main characteristics of big data are volume, velocity and variety. These characteristics are difficult to handle by using traditional softwares and methods. Hadoop is open-source framework software which was developed to provide solutions for handling several domains of big data problems. For big data analytic, MapReduce framework is a main engine of Hadoop cluster and widely used nowadays. It uses a batch oriented processing. Apache also developed an alternative engine called “Tez”. It supports an interactive query and does not write temporary data into HDFS. In this paper, we focus on the performance comparison between MapReduce and Tez. We also investigate the performance of these two engines with the compression of input files and map output files. Bzip is a compression algorithm used for input files and snappy is used for map output files. Word-count and terasort benchmarks are used in our experiments. For the word-count benchmark, the results show that Tez engine always has better execution-time than MapReduce engine for both of compressed data or non-compressed data. It can reduce an execution-time up to 39% comparing with the execution time of MapReduce engine. In contrast, the results show that Tez engine usually has higher execution-time than MapReduce engine up to 13% for terasort benchmark. The results also show that the performance of compressing map output files with snappy provides better performance on execution time for both benchmarks.