A Software-Managed Approach to Die-Stacked DRAM

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.30

M. Oskin, G. Loh

{"title":"A Software-Managed Approach to Die-Stacked DRAM","authors":"M. Oskin, G. Loh","doi":"10.1109/PACT.2015.30","DOIUrl":null,"url":null,"abstract":"Advances in die-stacking (3D) technology have enabled the tight integration of significant quantities of DRAM with high-performance computation logic. How to integrate this technology into the overall architecture of a computing system is an open question. While much recent effort has focused on hardware-based techniques for using die-stacked memory (e.g., caching), in this paper we explore what it takes for a software-driven approach to be effective. First we consider exposing die-stacked DRAM directly to applications, relying on the static partitioning of allocations between fast on-chip and slow off-chip DRAM. We see only marginal benefits from this approach (9% speedup). Next, we explore OS-based page caches that dynamically partition application memory, but we find such approaches to be worse than not having stacked DRAM at all! We analyze the performance bottlenecks in OS page caches, and propose two simple techniques that make the OS approach viable. The first is a hardware-assisted TLB shoot-down, which is a more general mechanism that is valuable beyond stacked DRAM, and enables OS-managed page caches to achieve a 27% speedup, the second is a software-implemented prefetcher that extends classic hardware prefetching algorithms to the page level, leading to 39% speedup. With these simple and lightweight components, the OS page cache can provide 70% of the performance benefit that would be achievable with an ideal and unrealistic system where all of main memory is die-stacked. However, we also found that applications with poor locality (e.g., graph analyses) are not amenable to any page-caching schemes -- whether hardware or software -- and therefore we recommend that the system still provides APIs to the application layers to explicitly control die-stacked DRAM allocations.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"460 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 57

Abstract

Advances in die-stacking (3D) technology have enabled the tight integration of significant quantities of DRAM with high-performance computation logic. How to integrate this technology into the overall architecture of a computing system is an open question. While much recent effort has focused on hardware-based techniques for using die-stacked memory (e.g., caching), in this paper we explore what it takes for a software-driven approach to be effective. First we consider exposing die-stacked DRAM directly to applications, relying on the static partitioning of allocations between fast on-chip and slow off-chip DRAM. We see only marginal benefits from this approach (9% speedup). Next, we explore OS-based page caches that dynamically partition application memory, but we find such approaches to be worse than not having stacked DRAM at all! We analyze the performance bottlenecks in OS page caches, and propose two simple techniques that make the OS approach viable. The first is a hardware-assisted TLB shoot-down, which is a more general mechanism that is valuable beyond stacked DRAM, and enables OS-managed page caches to achieve a 27% speedup, the second is a software-implemented prefetcher that extends classic hardware prefetching algorithms to the page level, leading to 39% speedup. With these simple and lightweight components, the OS page cache can provide 70% of the performance benefit that would be achievable with an ideal and unrealistic system where all of main memory is die-stacked. However, we also found that applications with poor locality (e.g., graph analyses) are not amenable to any page-caching schemes -- whether hardware or software -- and therefore we recommend that the system still provides APIs to the application layers to explicitly control die-stacked DRAM allocations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种软件管理的模堆叠DRAM方法

芯片堆叠(3D)技术的进步使大量DRAM与高性能计算逻辑紧密集成成为可能。如何将这种技术集成到计算系统的整体体系结构中是一个悬而未决的问题。虽然最近的许多努力都集中在基于硬件的技术上，用于使用模堆叠内存(例如，缓存)，但在本文中，我们将探讨如何使软件驱动的方法有效。首先，我们考虑将芯片堆叠的DRAM直接暴露给应用程序，依赖于快速片内和慢速片外DRAM之间的静态分配分区。我们只看到这种方法的边际效益(9%的加速)。接下来，我们将探索基于操作系统的页面缓存，这些缓存可以动态地对应用程序内存进行分区，但是我们发现这种方法比根本没有堆叠DRAM还要糟糕!我们分析了操作系统页面缓存的性能瓶颈，并提出了两种使操作系统方法可行的简单技术。第一个是硬件辅助的TLB关闭，这是一种比堆叠DRAM更有价值的更通用的机制，它使操作系统管理的页面缓存能够实现27%的加速，第二个是软件实现的预取器，它将经典的硬件预取算法扩展到页面级别，从而实现39%的加速。有了这些简单和轻量级的组件，操作系统页面缓存可以提供70%的性能优势，而在一个理想的和不现实的系统中，所有的主存都是模堆叠的。然而，我们也发现局部性差的应用程序(例如，图形分析)不适合任何页面缓存方案——无论是硬件还是软件——因此，我们建议系统仍然向应用层提供api，以显式地控制芯片堆叠的DRAM分配。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量

期刊最新文献

Storage Consolidation on SSDs: Not Always a Panacea, but Can We Ease the Pain? AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures Scalable Task Scheduling and Synchronization Using Hierarchical Effects Scalable SIMD-Efficient Graph Processing on GPUs