Hideo Inagaki, Tomoyuki Fujii, Ryota Kawashima, H. Matsuo
{"title":"基于工作负载特征的Apache Spark数据缓存机制自适应控制","authors":"Hideo Inagaki, Tomoyuki Fujii, Ryota Kawashima, H. Matsuo","doi":"10.1109/W-FICLOUD.2018.00016","DOIUrl":null,"url":null,"abstract":"Apache Spark caches reusable data into memory/disk. From our preliminary evaluation, we have found that a memory-and-disk caching is ineffective compared to disk-only caching when memory usage has reached its limit. This is because a thrashing state involving frequent data move between the memory and the disk occurs for a memory-and-disk caching. Spark has introduced a thrashing avoidance method for a single RDD (Resilient Distributed Dataset), but it cannot be applied to workloads using multiple RDDs because prior detection of the dependencies between the RDDs is difficult due to unpredictable access pattern. In this paper, we propose a thrashing avoidance method for such workloads. Our method adaptively modifies the cache I/O behavior depending on characteristics of the workload. In particular, caching data are directly written to the disk instead of the memory if cached data are frequently moved from the memory to the disk. Further, cached data are directly returned to the execution-memory instead of the storage-memory if cached data in the disk are required. Our method can adaptively select the optimal cache I/O behavior by observing workload characteristics at runtime instead of analyzing the dependence among RDDs. Evaluation results showed that execution time was reduced by 33% for KMeans using the modified Spark memory-and-disk caching rather than the original.","PeriodicalId":218683,"journal":{"name":"2018 6th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Adaptive Control of Apache Spark's Data Caching Mechanism Based on Workload Characteristics\",\"authors\":\"Hideo Inagaki, Tomoyuki Fujii, Ryota Kawashima, H. Matsuo\",\"doi\":\"10.1109/W-FICLOUD.2018.00016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Apache Spark caches reusable data into memory/disk. From our preliminary evaluation, we have found that a memory-and-disk caching is ineffective compared to disk-only caching when memory usage has reached its limit. This is because a thrashing state involving frequent data move between the memory and the disk occurs for a memory-and-disk caching. Spark has introduced a thrashing avoidance method for a single RDD (Resilient Distributed Dataset), but it cannot be applied to workloads using multiple RDDs because prior detection of the dependencies between the RDDs is difficult due to unpredictable access pattern. In this paper, we propose a thrashing avoidance method for such workloads. Our method adaptively modifies the cache I/O behavior depending on characteristics of the workload. In particular, caching data are directly written to the disk instead of the memory if cached data are frequently moved from the memory to the disk. Further, cached data are directly returned to the execution-memory instead of the storage-memory if cached data in the disk are required. Our method can adaptively select the optimal cache I/O behavior by observing workload characteristics at runtime instead of analyzing the dependence among RDDs. Evaluation results showed that execution time was reduced by 33% for KMeans using the modified Spark memory-and-disk caching rather than the original.\",\"PeriodicalId\":218683,\"journal\":{\"name\":\"2018 6th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 6th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/W-FICLOUD.2018.00016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 6th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/W-FICLOUD.2018.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Adaptive Control of Apache Spark's Data Caching Mechanism Based on Workload Characteristics
Apache Spark caches reusable data into memory/disk. From our preliminary evaluation, we have found that a memory-and-disk caching is ineffective compared to disk-only caching when memory usage has reached its limit. This is because a thrashing state involving frequent data move between the memory and the disk occurs for a memory-and-disk caching. Spark has introduced a thrashing avoidance method for a single RDD (Resilient Distributed Dataset), but it cannot be applied to workloads using multiple RDDs because prior detection of the dependencies between the RDDs is difficult due to unpredictable access pattern. In this paper, we propose a thrashing avoidance method for such workloads. Our method adaptively modifies the cache I/O behavior depending on characteristics of the workload. In particular, caching data are directly written to the disk instead of the memory if cached data are frequently moved from the memory to the disk. Further, cached data are directly returned to the execution-memory instead of the storage-memory if cached data in the disk are required. Our method can adaptively select the optimal cache I/O behavior by observing workload characteristics at runtime instead of analyzing the dependence among RDDs. Evaluation results showed that execution time was reduced by 33% for KMeans using the modified Spark memory-and-disk caching rather than the original.