Data Convection

Proceedings of the ACM on Measurement and Analysis of Computing Systems Pub Date : 2022-02-24 DOI:10.1145/3508027

Soheil Khadirsharbiyani, Jagadish B. Kotra, Karthik Rao, M. Kandemir

{"title":"Data Convection","authors":"Soheil Khadirsharbiyani, Jagadish B. Kotra, Karthik Rao, M. Kandemir","doi":"10.1145/3508027","DOIUrl":null,"url":null,"abstract":"Stacked DRAMs have been studied, evaluated in multiple scenarios, and even productized in the last decade. The large available bandwidth they offer make them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM increases the capacity and the bandwidth without increasing the package size. This integration of 3D stacked DRAMs aids in satisfying the capacity requirements of emerging workloads like deep learning. Though this vertical 3D integration of stacked DRAMs also increases the total available bandwidth, we observe that the bandwidth offered by these 3D stacked DRAMs is severely limited by the heat generated on the GPU. Based on our experiments on a cycle-level simulator, we make a key observation that the sections of the 3D stacked DRAM that are closer to the GPU have lower retention-times compared to the farther layers of stacked DRAM. This thermal-induced variable retention-times causes certain sections of 3D stacked DRAM to be refreshed more frequently compared to the others, thereby resulting in thermal-induced NUMA paradigms. To alleviate such thermal-induced NUMA behavior, we propose and experimentally evaluate three different incarnations of Data Convection, i.e., Intra-layer, Inter-layer, and Intra + Inter-layer, that aim at placing the most-frequently accessed data in a thermal-induced retention-aware fashion, taking into account both bank-level and channel-level parallelism. Our evaluations on a cycle-level GPU simulator indicate that, in a multi-application scenario, our Intra-layer, Inter-layer and Intra + Inter-layer algorithms improve the overall performance by 1.8%, 11.7%, and 14.4%, respectively, over a baseline that already encompasses 3D+2.5D stacked DRAMs.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Stacked DRAMs have been studied, evaluated in multiple scenarios, and even productized in the last decade. The large available bandwidth they offer make them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM increases the capacity and the bandwidth without increasing the package size. This integration of 3D stacked DRAMs aids in satisfying the capacity requirements of emerging workloads like deep learning. Though this vertical 3D integration of stacked DRAMs also increases the total available bandwidth, we observe that the bandwidth offered by these 3D stacked DRAMs is severely limited by the heat generated on the GPU. Based on our experiments on a cycle-level simulator, we make a key observation that the sections of the 3D stacked DRAM that are closer to the GPU have lower retention-times compared to the farther layers of stacked DRAM. This thermal-induced variable retention-times causes certain sections of 3D stacked DRAM to be refreshed more frequently compared to the others, thereby resulting in thermal-induced NUMA paradigms. To alleviate such thermal-induced NUMA behavior, we propose and experimentally evaluate three different incarnations of Data Convection, i.e., Intra-layer, Inter-layer, and Intra + Inter-layer, that aim at placing the most-frequently accessed data in a thermal-induced retention-aware fashion, taking into account both bank-level and channel-level parallelism. Our evaluations on a cycle-level GPU simulator indicate that, in a multi-application scenario, our Intra-layer, Inter-layer and Intra + Inter-layer algorithms improve the overall performance by 1.8%, 11.7%, and 14.4%, respectively, over a baseline that already encompasses 3D+2.5D stacked DRAMs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数据对流

在过去的十年中，堆叠dram已经在多种情况下进行了研究，评估，甚至产品化。它们提供的大可用带宽使它们成为一个有吸引力的选择，特别是在高性能计算(HPC)环境中。因此，许多先前的研究工作已经研究和评估了基于3D堆叠dram的设计。尽管提供高带宽，但堆叠dram受到所提供的整体内存容量的严重限制。在本文中，我们研究和评估了以3D方式将堆叠DRAM集成在GPU上，与2.5D堆叠DRAM串联在一起，在不增加封装尺寸的情况下增加容量和带宽。这种3D堆叠dram的集成有助于满足深度学习等新兴工作负载的容量需求。虽然这种堆叠dram的垂直3D集成也增加了总可用带宽，但我们观察到这些3D堆叠dram提供的带宽受到GPU上产生的热量的严重限制。基于我们在周期级模拟器上的实验，我们做出了一个关键的观察，即与更远的堆叠DRAM层相比，靠近GPU的3D堆叠DRAM的部分具有更低的保留时间。这种热诱导的可变保留时间导致3D堆叠DRAM的某些部分比其他部分更频繁地刷新，从而导致热诱导的NUMA范式。为了减轻这种热诱导的NUMA行为，我们提出并实验评估了三种不同的数据对流形式，即层内、层间和层内+层间，旨在将最频繁访问的数据置于热诱导的保留感知方式中，同时考虑到银行级和通道级并行性。我们在周期级GPU模拟器上的评估表明，在多应用场景下，我们的Intra-layer, Inter-layer和Intra + Inter-layer算法在已经包含3D+2.5D堆叠dram的基线上分别提高了1.8%，11.7%和14.4%的整体性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM on Measurement and Analysis of Computing Systems

CiteScore

3.20

自引率

0.00%

发文量