{"title":"DYRS: Bandwidth-Aware Disk-to-Memory Migration of Cold Data in Big-Data File Systems","authors":"Simbarashe Dzinamarira, Florin Dinu, T. Ng","doi":"10.1109/IPDPS.2019.00069","DOIUrl":null,"url":null,"abstract":"Migrating data into memory can significantly accelerate big-data applications by hiding low disk throughput. While prior work has mostly targeted caching frequently used data, the techniques employed do not benefit jobs that read cold data. For these jobs, the file system has to pro-actively migrate the inputs into memory. Successfully migrating cold inputs can result in a large speedup for many jobs, especially those that spend a significant part of their execution reading inputs. In this paper, we use data from the Google cluster trace to make the case that the conditions in production workloads are favorable for migration. We then design and implement DYRS, a framework for migrating cold data in big-data file systems. DYRS can adapt to match the available bandwidth on storage nodes, ensuring all nodes are fully utilized throughout the migration. In addition to balancing the load, DYRS optimizes the placement of each migration to maximize the number of successful migrations and eliminate stragglers at the end of a job. We evaluate DYRS using several Hive queries, a trace-based workload from Facebook, and the Sort application. Our results show that DYRS successfully adapts to bandwidth heterogeneity and effectively migrates data. DYRS accelerates Hive queries by up to 48%, and by 36% on average. Jobs in a trace-based workload experience a speedup of 33% on average. The mapper tasks in this workload have an even greater speedup of 46%. DYRS accelerates sort jobs by up to 20%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Migrating data into memory can significantly accelerate big-data applications by hiding low disk throughput. While prior work has mostly targeted caching frequently used data, the techniques employed do not benefit jobs that read cold data. For these jobs, the file system has to pro-actively migrate the inputs into memory. Successfully migrating cold inputs can result in a large speedup for many jobs, especially those that spend a significant part of their execution reading inputs. In this paper, we use data from the Google cluster trace to make the case that the conditions in production workloads are favorable for migration. We then design and implement DYRS, a framework for migrating cold data in big-data file systems. DYRS can adapt to match the available bandwidth on storage nodes, ensuring all nodes are fully utilized throughout the migration. In addition to balancing the load, DYRS optimizes the placement of each migration to maximize the number of successful migrations and eliminate stragglers at the end of a job. We evaluate DYRS using several Hive queries, a trace-based workload from Facebook, and the Sort application. Our results show that DYRS successfully adapts to bandwidth heterogeneity and effectively migrates data. DYRS accelerates Hive queries by up to 48%, and by 36% on average. Jobs in a trace-based workload experience a speedup of 33% on average. The mapper tasks in this workload have an even greater speedup of 46%. DYRS accelerates sort jobs by up to 20%.