Makoto Nakagami, J. Kon, Gil Jae Lee, J. Fortes, Saneyasu Yamaguchi
{"title":"File Placing Location Optimization on Hadoop SWIM","authors":"Makoto Nakagami, J. Kon, Gil Jae Lee, J. Fortes, Saneyasu Yamaguchi","doi":"10.1109/CANDARW.2018.00100","DOIUrl":null,"url":null,"abstract":"Hadoop is a popular platform based on the MapReduce model for processing big data. For I/O performance improvement in Hadoop, this paper uses realistic workloads to conduct in-depth evaluations of a method that optimally places file in storage. This method places files in the outer zones of hard disk drives because sequential access in the outer zones is generally faster than in the inner zones. The research reported in this paper goes beyond using an I/O-intensive job example (e.g., TeraSort) to use realistic workloads generated by Statistical Workload Injector for MapReduce (SWIM). First, the CPU and I/O resource usage by SWIM jobs is explored in various settings and then it is shown that a shuffle-heavy workload is I/O bounded. Second, I/O patterns of some SWIM jobs are investigated and it is shown that their accesses are performed sequentially. Third, the proposed method is applied to a shuffle-heavy SWIM job and evaluated, the results demonstrating that the method can improve performance by 14%.","PeriodicalId":329439,"journal":{"name":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDARW.2018.00100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Hadoop is a popular platform based on the MapReduce model for processing big data. For I/O performance improvement in Hadoop, this paper uses realistic workloads to conduct in-depth evaluations of a method that optimally places file in storage. This method places files in the outer zones of hard disk drives because sequential access in the outer zones is generally faster than in the inner zones. The research reported in this paper goes beyond using an I/O-intensive job example (e.g., TeraSort) to use realistic workloads generated by Statistical Workload Injector for MapReduce (SWIM). First, the CPU and I/O resource usage by SWIM jobs is explored in various settings and then it is shown that a shuffle-heavy workload is I/O bounded. Second, I/O patterns of some SWIM jobs are investigated and it is shown that their accesses are performed sequentially. Third, the proposed method is applied to a shuffle-heavy SWIM job and evaluated, the results demonstrating that the method can improve performance by 14%.