Makoto Nakagami, J. Kon, Gil Jae Lee, J. Fortes, Saneyasu Yamaguchi
{"title":"Hadoop SWIM上的文件放置位置优化","authors":"Makoto Nakagami, J. Kon, Gil Jae Lee, J. Fortes, Saneyasu Yamaguchi","doi":"10.1109/CANDARW.2018.00100","DOIUrl":null,"url":null,"abstract":"Hadoop is a popular platform based on the MapReduce model for processing big data. For I/O performance improvement in Hadoop, this paper uses realistic workloads to conduct in-depth evaluations of a method that optimally places file in storage. This method places files in the outer zones of hard disk drives because sequential access in the outer zones is generally faster than in the inner zones. The research reported in this paper goes beyond using an I/O-intensive job example (e.g., TeraSort) to use realistic workloads generated by Statistical Workload Injector for MapReduce (SWIM). First, the CPU and I/O resource usage by SWIM jobs is explored in various settings and then it is shown that a shuffle-heavy workload is I/O bounded. Second, I/O patterns of some SWIM jobs are investigated and it is shown that their accesses are performed sequentially. Third, the proposed method is applied to a shuffle-heavy SWIM job and evaluated, the results demonstrating that the method can improve performance by 14%.","PeriodicalId":329439,"journal":{"name":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"File Placing Location Optimization on Hadoop SWIM\",\"authors\":\"Makoto Nakagami, J. Kon, Gil Jae Lee, J. Fortes, Saneyasu Yamaguchi\",\"doi\":\"10.1109/CANDARW.2018.00100\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hadoop is a popular platform based on the MapReduce model for processing big data. For I/O performance improvement in Hadoop, this paper uses realistic workloads to conduct in-depth evaluations of a method that optimally places file in storage. This method places files in the outer zones of hard disk drives because sequential access in the outer zones is generally faster than in the inner zones. The research reported in this paper goes beyond using an I/O-intensive job example (e.g., TeraSort) to use realistic workloads generated by Statistical Workload Injector for MapReduce (SWIM). First, the CPU and I/O resource usage by SWIM jobs is explored in various settings and then it is shown that a shuffle-heavy workload is I/O bounded. Second, I/O patterns of some SWIM jobs are investigated and it is shown that their accesses are performed sequentially. Third, the proposed method is applied to a shuffle-heavy SWIM job and evaluated, the results demonstrating that the method can improve performance by 14%.\",\"PeriodicalId\":329439,\"journal\":{\"name\":\"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CANDARW.2018.00100\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDARW.2018.00100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hadoop is a popular platform based on the MapReduce model for processing big data. For I/O performance improvement in Hadoop, this paper uses realistic workloads to conduct in-depth evaluations of a method that optimally places file in storage. This method places files in the outer zones of hard disk drives because sequential access in the outer zones is generally faster than in the inner zones. The research reported in this paper goes beyond using an I/O-intensive job example (e.g., TeraSort) to use realistic workloads generated by Statistical Workload Injector for MapReduce (SWIM). First, the CPU and I/O resource usage by SWIM jobs is explored in various settings and then it is shown that a shuffle-heavy workload is I/O bounded. Second, I/O patterns of some SWIM jobs are investigated and it is shown that their accesses are performed sequentially. Third, the proposed method is applied to a shuffle-heavy SWIM job and evaluated, the results demonstrating that the method can improve performance by 14%.