Improving cache performance using read-write partitioning

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-06-19 DOI:10.1109/HPCA.2014.6835954

S. Khan, Alaa R. Alameldeen, C. Wilkerson, O. Mutlu, Daniel A. Jiménez

{"title":"Improving cache performance using read-write partitioning","authors":"S. Khan, Alaa R. Alameldeen, C. Wilkerson, O. Mutlu, Daniel A. Jiménez","doi":"10.1109/HPCA.2014.6835954","DOIUrl":null,"url":null,"abstract":"Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. This paper proposes a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cache-sensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP's state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2014.6835954","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 68

Abstract

Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. This paper proposes a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cache-sensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP's state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用读写分区提高缓存性能

如果没有独立的指令要执行，缓存读取失败会使处理器停滞。相反，大多数缓存写失败都发生在执行的关键路径之外，因为写可以缓冲在缓存或存储缓冲区中。除了少数例外，服务于负载的缓存线比仅服务于存储的缓存线对性能更为关键。不幸的是，传统的缓存管理机制并没有考虑到读写临界性之间的差异。本文提出了一种读写分区(RWP)策略，该策略通过将缓存动态划分为干净分区和脏分区来最大限度地减少读丢失，如果分区更有可能接收到未来的读请求，则分区的大小会增加。我们表明，利用读写临界性的差异可以提供比以前的缓存管理机制更好的性能。对于单核系统，相对于基准LRU替换策略，RWP在整个SPEC CPU2006套件中提供5%的平均加速，在缓存敏感基准测试中提供14%的平均加速。我们还表明，RWP可以在3%的范围内执行一种新的复杂的基于指令地址的技术，读参考预测器(RRP)，它绕过不太可能接收任何读请求的缓存线，而只需要5.4%的RRP状态开销。在4核系统上，我们的RWP机制将系统吞吐量提高了6%，并且优于我们评估的其他三种最先进的机制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量