藏匿:准备好你的便签本，并把它藏起来

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI:10.1145/2749469.2750374

Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, S. Adve, Vikram S. Adve

{"title":"藏匿:准备好你的便签本，并把它藏起来","authors":"Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, S. Adve, Vikram S. Adve","doi":"10.1145/2749469.2750374","DOIUrl":null,"url":null,"abstract":"Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches. We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"45 1","pages":"707-719"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"71","resultStr":"{\"title\":\"Stash: Have your scratchpad and cache it too\",\"authors\":\"Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, S. Adve, Vikram S. Adve\",\"doi\":\"10.1145/2749469.2750374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches. We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).\",\"PeriodicalId\":6878,\"journal\":{\"name\":\"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)\",\"volume\":\"45 1\",\"pages\":\"707-719\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"71\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2749469.2750374\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2749469.2750374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 71

摘要

异构系统采用能源效率专业化。由于数据移动预计将成为主要的能源消耗者，这些系统采用专门的存储器(例如，刮擦板和fifo)来提高目标数据的效率。然而，这些内存结构往往存在于本地地址空间中，由于在全局空间和私有空间之间的数据移动效率低下，导致显著的性能和能量损失。我们提出了一种高效的异构存储系统，其中专用存储组件在统一和一致的地址空间中紧密耦合。本文将这些思想应用到一个带有刮擦板和缓存的cpu和gpu系统中。我们引入了一种新的内存组织，stash，它结合了缓存和刮擦板的优点，而不会产生它们的缺点。与刮擦板一样，存储库是直接寻址的(没有标签和TLB访问)，并提供紧凑的存储。与缓存一样，存储库是全局可寻址和可见的，提供隐式数据移动和增加的数据重用。我们展示了存储库提供了比缓存和便签簿更好的性能和能量，同时支持异构系统的新用例。对于利用新用例(例如，跨GPU计算内核的重用)的4个微基准测试，与scratchpad和缓存相比，stash分别平均减少了27%和13%的执行周期，平均减少了53%和35%的能源。对于目前的7个GPU应用程序，这些应用程序没有设计利用stash的新功能，与刮擦板和缓存相比，stash平均减少了10%和12%的周期(最大22%和31%)，平均减少了16%和32%的能量(最大30%和51%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Stash: Have your scratchpad and cache it too

Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches. We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量

期刊最新文献

Redundant Memory Mappings for fast access to large memories Multiple Clone Row DRAM: A low latency and area optimized DRAM Manycore Network Interfaces for in-memory rack-scale computing Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures ShiDianNao: Shifting vision processing closer to the sensor