PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI:10.1145/2749469.2750385

Junwhan Ahn, S. Yoo, O. Mutlu, Kiyoung Choi

{"title":"PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture","authors":"Junwhan Ahn, S. Yoo, O. Mutlu, Kiyoung Choi","doi":"10.1145/2749469.2750385","DOIUrl":null,"url":null,"abstract":"Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches. In this paper, we propose a new PIM architecture that (I) does not change the existing sequential programming models and (2) automatically decides whether to execute PIM operations in memory or processors depending on the locality of data. The key idea is to implement simple in-memory computation using compute-capable memory commands and use specialized instructions, which we call PIM-enabled instructions, to invoke in-memory computation. This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification. In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute the instruction at the host processor (instead of in memory) when the instruction can benefit from large on-chip caches. Consequently, our architecture provides the illusion that PIM operations are executed as if they were host processor instructions. We provide a case study of how ten emerging data-intensive workloads can benefit from our new PIM abstraction and its hardware implementation. Evaluations show that our architecture significantly improves system performance and, more importantly, combines the best parts of conventional and PlM architectures by adapting to data locality of applications.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"14 1","pages":"336-348"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"443","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2749469.2750385","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 443

Abstract

Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches. In this paper, we propose a new PIM architecture that (I) does not change the existing sequential programming models and (2) automatically decides whether to execute PIM operations in memory or processors depending on the locality of data. The key idea is to implement simple in-memory computation using compute-capable memory commands and use specialized instructions, which we call PIM-enabled instructions, to invoke in-memory computation. This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification. In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute the instruction at the host processor (instead of in memory) when the instruction can benefit from large on-chip caches. Consequently, our architecture provides the illusion that PIM operations are executed as if they were host processor instructions. We provide a case study of how ten emerging data-intensive workloads can benefit from our new PIM abstraction and its hardware implementation. Evaluations show that our architecture significantly improves system performance and, more importantly, combines the best parts of conventional and PlM architectures by adapting to data locality of applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

支持pim的指令:低开销、位置感知的内存处理体系结构

内存中处理(PIM)作为解决内存墙危机的可行方案迅速崛起，从20世纪90年代由于实用性问题而失败的尝试中反弹，最近3D堆叠技术的进步缓解了这种危机。然而，由于两个共同的特点，将PIM体系结构与现有系统无缝集成仍然具有挑战性:内存计算单元的非常规编程模型以及缺乏利用大型片上缓存的能力。在本文中，我们提出了一种新的PIM架构，它(I)不改变现有的顺序编程模型，(2)根据数据的位置自动决定是在内存中执行PIM操作还是在处理器中执行PIM操作。关键思想是使用可计算的内存命令实现简单的内存计算，并使用专用指令(我们称之为支持pim的指令)来调用内存计算。这使得PIM操作无需修改即可与现有编程模型、缓存一致性协议和虚拟内存机制进行互操作。此外，我们还介绍了一个简单的硬件结构，该结构在运行时监视支持pim的指令访问的数据的位置，以便在指令可以从大型片上缓存中获益时，在主机处理器(而不是内存)上自适应地执行该指令。因此，我们的体系结构提供了PIM操作被执行的假象，就好像它们是主机处理器指令一样。我们提供了一个案例研究，说明十个新兴的数据密集型工作负载如何从我们的新PIM抽象及其硬件实现中受益。评估表明，我们的体系结构显著提高了系统性能，更重要的是，通过适应应用程序的数据位置，结合了传统和PlM体系结构的最佳部分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量

期刊最新文献

Redundant Memory Mappings for fast access to large memories Multiple Clone Row DRAM: A low latency and area optimized DRAM Manycore Network Interfaces for in-memory rack-scale computing Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures ShiDianNao: Shifting vision processing closer to the sensor