A Software-Managed Approach to Die-Stacked DRAM

M. Oskin, G. Loh
{"title":"A Software-Managed Approach to Die-Stacked DRAM","authors":"M. Oskin, G. Loh","doi":"10.1109/PACT.2015.30","DOIUrl":null,"url":null,"abstract":"Advances in die-stacking (3D) technology have enabled the tight integration of significant quantities of DRAM with high-performance computation logic. How to integrate this technology into the overall architecture of a computing system is an open question. While much recent effort has focused on hardware-based techniques for using die-stacked memory (e.g., caching), in this paper we explore what it takes for a software-driven approach to be effective. First we consider exposing die-stacked DRAM directly to applications, relying on the static partitioning of allocations between fast on-chip and slow off-chip DRAM. We see only marginal benefits from this approach (9% speedup). Next, we explore OS-based page caches that dynamically partition application memory, but we find such approaches to be worse than not having stacked DRAM at all! We analyze the performance bottlenecks in OS page caches, and propose two simple techniques that make the OS approach viable. The first is a hardware-assisted TLB shoot-down, which is a more general mechanism that is valuable beyond stacked DRAM, and enables OS-managed page caches to achieve a 27% speedup, the second is a software-implemented prefetcher that extends classic hardware prefetching algorithms to the page level, leading to 39% speedup. With these simple and lightweight components, the OS page cache can provide 70% of the performance benefit that would be achievable with an ideal and unrealistic system where all of main memory is die-stacked. However, we also found that applications with poor locality (e.g., graph analyses) are not amenable to any page-caching schemes -- whether hardware or software -- and therefore we recommend that the system still provides APIs to the application layers to explicitly control die-stacked DRAM allocations.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 57

Abstract

Advances in die-stacking (3D) technology have enabled the tight integration of significant quantities of DRAM with high-performance computation logic. How to integrate this technology into the overall architecture of a computing system is an open question. While much recent effort has focused on hardware-based techniques for using die-stacked memory (e.g., caching), in this paper we explore what it takes for a software-driven approach to be effective. First we consider exposing die-stacked DRAM directly to applications, relying on the static partitioning of allocations between fast on-chip and slow off-chip DRAM. We see only marginal benefits from this approach (9% speedup). Next, we explore OS-based page caches that dynamically partition application memory, but we find such approaches to be worse than not having stacked DRAM at all! We analyze the performance bottlenecks in OS page caches, and propose two simple techniques that make the OS approach viable. The first is a hardware-assisted TLB shoot-down, which is a more general mechanism that is valuable beyond stacked DRAM, and enables OS-managed page caches to achieve a 27% speedup, the second is a software-implemented prefetcher that extends classic hardware prefetching algorithms to the page level, leading to 39% speedup. With these simple and lightweight components, the OS page cache can provide 70% of the performance benefit that would be achievable with an ideal and unrealistic system where all of main memory is die-stacked. However, we also found that applications with poor locality (e.g., graph analyses) are not amenable to any page-caching schemes -- whether hardware or software -- and therefore we recommend that the system still provides APIs to the application layers to explicitly control die-stacked DRAM allocations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种软件管理的模堆叠DRAM方法
芯片堆叠(3D)技术的进步使大量DRAM与高性能计算逻辑紧密集成成为可能。如何将这种技术集成到计算系统的整体体系结构中是一个悬而未决的问题。虽然最近的许多努力都集中在基于硬件的技术上,用于使用模堆叠内存(例如,缓存),但在本文中,我们将探讨如何使软件驱动的方法有效。首先,我们考虑将芯片堆叠的DRAM直接暴露给应用程序,依赖于快速片内和慢速片外DRAM之间的静态分配分区。我们只看到这种方法的边际效益(9%的加速)。接下来,我们将探索基于操作系统的页面缓存,这些缓存可以动态地对应用程序内存进行分区,但是我们发现这种方法比根本没有堆叠DRAM还要糟糕!我们分析了操作系统页面缓存的性能瓶颈,并提出了两种使操作系统方法可行的简单技术。第一个是硬件辅助的TLB关闭,这是一种比堆叠DRAM更有价值的更通用的机制,它使操作系统管理的页面缓存能够实现27%的加速,第二个是软件实现的预取器,它将经典的硬件预取算法扩展到页面级别,从而实现39%的加速。有了这些简单和轻量级的组件,操作系统页面缓存可以提供70%的性能优势,而在一个理想的和不现实的系统中,所有的主存都是模堆叠的。然而,我们也发现局部性差的应用程序(例如,图形分析)不适合任何页面缓存方案——无论是硬件还是软件——因此,我们建议系统仍然向应用层提供api,以显式地控制芯片堆叠的DRAM分配。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Storage Consolidation on SSDs: Not Always a Panacea, but Can We Ease the Pain? AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures Scalable Task Scheduling and Synchronization Using Hierarchical Effects Scalable SIMD-Efficient Graph Processing on GPUs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1