Improving Apache Spark's Cache Mechanism with LRC-Based Method Using Bloom Filter

2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW) Pub Date : 2018-11-01 DOI:10.1109/CANDARW.2018.00096

Hideo Inagaki, Ryota Kawashima, H. Matsuo

{"title":"Improving Apache Spark's Cache Mechanism with LRC-Based Method Using Bloom Filter","authors":"Hideo Inagaki, Ryota Kawashima, H. Matsuo","doi":"10.1109/CANDARW.2018.00096","DOIUrl":null,"url":null,"abstract":"Memory-and-Disk caching is a common caching mechanism for temporal output in Apache Spark. However, it causes performance degradation when memory usage has reached its limit because of the Spark's LRU (Least Recently Used) based cache management. Existing studies have reported that replacement of LRU-based cache mechanism to LRC (Least Reference Count) based one that is a more accurate indicator of the likelihood of future data access. However, frequently used partitions cannot be determined because Spark accesses all of partitions for user-driven RDD operations, even if partitions do not include necessary data. In this paper, we propose a cache management method that enables allocating necessary partitions to the memory by introducing the bloom filter into existing methods. The bloom filter prevents unnecessary partitions from being processed because partitions are checked whether required data is contained. Furthermore, frequently used partitions can be properly determined by measuring the reference count of partitions. We implemented two architecture types, the driver-side bloom filter and the executor-side bloom filter, to consider the optimal place of the bloom filter. Evaluation results showed that the execution time of the driver-side implementation was reduced by 89% in a filter-test benchmark based on the LRC-based method.","PeriodicalId":329439,"journal":{"name":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","volume":"243 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDARW.2018.00096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Memory-and-Disk caching is a common caching mechanism for temporal output in Apache Spark. However, it causes performance degradation when memory usage has reached its limit because of the Spark's LRU (Least Recently Used) based cache management. Existing studies have reported that replacement of LRU-based cache mechanism to LRC (Least Reference Count) based one that is a more accurate indicator of the likelihood of future data access. However, frequently used partitions cannot be determined because Spark accesses all of partitions for user-driven RDD operations, even if partitions do not include necessary data. In this paper, we propose a cache management method that enables allocating necessary partitions to the memory by introducing the bloom filter into existing methods. The bloom filter prevents unnecessary partitions from being processed because partitions are checked whether required data is contained. Furthermore, frequently used partitions can be properly determined by measuring the reference count of partitions. We implemented two architecture types, the driver-side bloom filter and the executor-side bloom filter, to consider the optimal place of the bloom filter. Evaluation results showed that the execution time of the driver-side implementation was reduced by 89% in a filter-test benchmark based on the LRC-based method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于lrc的Bloom Filter改进Apache Spark的缓存机制

内存和磁盘缓存是Apache Spark中用于临时输出的常用缓存机制。然而，由于Spark基于LRU(最近最少使用)的缓存管理，当内存使用达到极限时，它会导致性能下降。现有的研究报告称，将基于lru的缓存机制替换为基于LRC(最小引用计数)的缓存机制，这是未来数据访问可能性的更准确指标。但是，不能确定经常使用的分区，因为Spark访问所有分区进行用户驱动的RDD操作，即使分区不包含必要的数据。在本文中，我们提出了一种缓存管理方法，该方法通过在现有方法中引入bloom过滤器来分配必要的内存分区。布隆过滤器防止处理不必要的分区，因为会检查分区是否包含所需的数据。此外，可以通过测量分区的引用计数来正确地确定经常使用的分区。我们实现了两种架构类型，驱动端布隆过滤器和执行端布隆过滤器，以考虑布隆过滤器的最佳位置。评估结果表明，在基于lrc方法的过滤器测试基准测试中，驾驶员侧实现的执行时间减少了89%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)

自引率

0.00%

发文量

期刊最新文献

Towards Improving Data Transfer Efficiency for Accelerators Using Hardware Compression Tile Art Image Generation Using Conditional Generative Adversarial Networks A New Higher Order Differential of FeW Non-volatile Memory Driver for Applying Automated Tiered Storage with Fast Memory and Slow Flash Storage DHT Clustering for Load Balancing Considering Blockchain Data Size