Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache

Djordje Jevdjic, G. Loh, Cansu Kaynak, B. Falsafi
{"title":"Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache","authors":"Djordje Jevdjic, G. Loh, Cansu Kaynak, B. Falsafi","doi":"10.1109/MICRO.2014.51","DOIUrl":null,"url":null,"abstract":"Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"21 1","pages":"25-37"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"159","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 159

Abstract

Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Unison高速缓存:一种可扩展和有效的模堆叠DRAM高速缓存
最近的研究提倡在许多核心服务器中使用大型die-stacked DRAM缓存,以打破内存延迟和带宽墙。为了充分发挥其潜力,芯片堆叠式DRAM缓存需要低查找延迟、高命中率和有效利用片外带宽。今天的堆叠式DRAM缓存设计根据其管理数据的粒度分为两类:基于块的和基于页的。最先进的基于块的设计,称为Alloy Cache,在堆叠的DRAM中为每个数据块(例如64B)配置一个标签,以便在单个DRAM访问中提供对数据的快速访问。然而,由于DRAM缓存中的时间局部性差,这种设计的命中率很低。相比之下,最先进的基于页面的设计,称为Footprint Cache,按页面粒度(例如,4KB)组织DRAM缓存,但只获取可能在页面中被触摸的块。在这样做时,Footprint Cache通过适度的片上标记存储和合理的查找延迟实现了高命中率。然而,多千兆字节的堆叠DRAM缓存很快就会成为服务器应用程序所需要的,因此即使是基于页面的DRAM缓存也需要数十mb的标签存储。我们介绍了一种新的堆叠dram高速缓存设计,Unison高速缓存。与Alloy Cache的方法类似,Unison Cache将标签元数据直接集成到堆叠DRAM中,以实现任意堆叠DRAM容量的可扩展性。然后,利用来自Footprint Cache设计的见解,Unison Cache采用大型页面大小的缓存分配单元来实现高命中率和减少标签开销,同时预测和获取每个页面中有用的块,以最大限度地减少芯片外流量。我们对服务器工作负载和高达8GB的缓存进行了评估,结果显示,由于命中率高,Unison缓存的性能比Alloy缓存提高了14%,同时性能优于最先进的基于页面的设计,这些设计需要大约50MB的不切实际的基于sram的标签。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution Harnessing Soft Computations for Low-Budget Fault Tolerance
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1