{"title":"Keddah: Capturing Hadoop Network Behaviour","authors":"Jie Deng, Gareth Tyson, F. Cuadrado, S. Uhlig","doi":"10.1109/ICDCS.2017.211","DOIUrl":null,"url":null,"abstract":"As a distributed system, Hadoop heavily relies on the network to complete data processing jobs. While Hadoop traffic is perceived to be critical for job execution performance, the actual behaviour of Hadoop network traffic is still poorly understood. This lack of understanding greatly complicates research relying on Hadoop workloads. In this paper, we explore Hadoop traffic through experimentation. We analyse the generated traffic of multiple types of MapReduce jobs, with varying input sizes, and cluster configuration parameters. As a result, we present Keddah, a toolchain for capturing, modelling and reproducing Hadoop traffic, for use with network simulators. Keddah can be used to create empirical Hadoop traffic models, enabling reproducible Hadoop research in more realistic scenarios.","PeriodicalId":127689,"journal":{"name":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","volume":"304 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2017.211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
As a distributed system, Hadoop heavily relies on the network to complete data processing jobs. While Hadoop traffic is perceived to be critical for job execution performance, the actual behaviour of Hadoop network traffic is still poorly understood. This lack of understanding greatly complicates research relying on Hadoop workloads. In this paper, we explore Hadoop traffic through experimentation. We analyse the generated traffic of multiple types of MapReduce jobs, with varying input sizes, and cluster configuration parameters. As a result, we present Keddah, a toolchain for capturing, modelling and reproducing Hadoop traffic, for use with network simulators. Keddah can be used to create empirical Hadoop traffic models, enabling reproducible Hadoop research in more realistic scenarios.