Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture Pub Date : 2014-12-13 DOI:10.1109/MICRO.2014.51

Djordje Jevdjic, G. Loh, Cansu Kaynak, B. Falsafi

{"title":"Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache","authors":"Djordje Jevdjic, G. Loh, Cansu Kaynak, B. Falsafi","doi":"10.1109/MICRO.2014.51","DOIUrl":null,"url":null,"abstract":"Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"21 1","pages":"25-37"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"159","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 159

Abstract

Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Unison高速缓存:一种可扩展和有效的模堆叠DRAM高速缓存

最近的研究提倡在许多核心服务器中使用大型die-stacked DRAM缓存，以打破内存延迟和带宽墙。为了充分发挥其潜力，芯片堆叠式DRAM缓存需要低查找延迟、高命中率和有效利用片外带宽。今天的堆叠式DRAM缓存设计根据其管理数据的粒度分为两类:基于块的和基于页的。最先进的基于块的设计，称为Alloy Cache，在堆叠的DRAM中为每个数据块(例如64B)配置一个标签，以便在单个DRAM访问中提供对数据的快速访问。然而，由于DRAM缓存中的时间局部性差，这种设计的命中率很低。相比之下，最先进的基于页面的设计，称为Footprint Cache，按页面粒度(例如，4KB)组织DRAM缓存，但只获取可能在页面中被触摸的块。在这样做时，Footprint Cache通过适度的片上标记存储和合理的查找延迟实现了高命中率。然而，多千兆字节的堆叠DRAM缓存很快就会成为服务器应用程序所需要的，因此即使是基于页面的DRAM缓存也需要数十mb的标签存储。我们介绍了一种新的堆叠dram高速缓存设计，Unison高速缓存。与Alloy Cache的方法类似，Unison Cache将标签元数据直接集成到堆叠DRAM中，以实现任意堆叠DRAM容量的可扩展性。然后，利用来自Footprint Cache设计的见解，Unison Cache采用大型页面大小的缓存分配单元来实现高命中率和减少标签开销，同时预测和获取每个页面中有用的块，以最大限度地减少芯片外流量。我们对服务器工作负载和高达8GB的缓存进行了评估，结果显示，由于命中率高，Unison缓存的性能比Alloy缓存提高了14%，同时性能优于最先进的基于页面的设计，这些设计需要大约50MB的不切实际的基于sram的标签。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

自引率

0.00%

发文量