Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2024-08-05 DOI:10.1109/JETCAS.2024.3439193

Vahid Geraeinejad;Qiran Qian;Masoumeh Ebrahimi

{"title":"Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs","authors":"Vahid Geraeinejad;Qiran Qian;Masoumeh Ebrahimi","doi":"10.1109/JETCAS.2024.3439193","DOIUrl":null,"url":null,"abstract":"GPUs are extensively employed as the primary devices for running a broad spectrum of applications, covering general-purpose applications as well as Artificial Intelligence (AI) applications. Register file, as the largest SRAM on the GPU die, accounts for over 20% of the total GPU energy consumption. Register cache has been introduced to reduce traffic from the register file and thus decrease total energy consumption when CUDA cores are utilized. However, the utilization of register cache has not been thoroughly investigated for Tensor Cores which are integrated into recent GPU architectures to meet AI workload demands. In this paper, we study the usage of register cache in both CUDA and Tensor Cores and conduct a thorough examination of their pros and cons. We have developed an open-source analytical simulator, called RFC-sim, to model and measure the energy consumption of both the register file and register cache. Our results show that while the register cache can reduce energy consumption by up to 40% in CUDA cores, it results in increased energy consumption by up to 23% in Tensor Cores. The main reason lies in the limited space of the register cache, which is not sufficient for the demand of Tensor cores to capture locality.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10623472/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

GPUs are extensively employed as the primary devices for running a broad spectrum of applications, covering general-purpose applications as well as Artificial Intelligence (AI) applications. Register file, as the largest SRAM on the GPU die, accounts for over 20% of the total GPU energy consumption. Register cache has been introduced to reduce traffic from the register file and thus decrease total energy consumption when CUDA cores are utilized. However, the utilization of register cache has not been thoroughly investigated for Tensor Cores which are integrated into recent GPU architectures to meet AI workload demands. In this paper, we study the usage of register cache in both CUDA and Tensor Cores and conduct a thorough examination of their pros and cons. We have developed an open-source analytical simulator, called RFC-sim, to model and measure the energy consumption of both the register file and register cache. Our results show that while the register cache can reduce energy consumption by up to 40% in CUDA cores, it results in increased energy consumption by up to 23% in Tensor Cores. The main reason lies in the limited space of the register cache, which is not sufficient for the demand of Tensor cores to capture locality.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

研究寄存器缓存行为：对 GPU 上 CUDA 和张量核心工作负载的影响

GPU 被广泛用作运行各种应用的主要设备，包括通用应用和人工智能（AI）应用。寄存器文件是 GPU 芯片上最大的 SRAM，占 GPU 总能耗的 20% 以上。引入寄存器缓存是为了减少寄存器文件的流量，从而降低使用 CUDA 内核时的总能耗。然而，对于集成到最近的 GPU 架构中以满足人工智能工作负载需求的张量核，寄存器缓存的利用率尚未得到深入研究。在本文中，我们研究了寄存器缓存在 CUDA 和 Tensor Cores 中的使用情况，并对它们的利弊进行了深入探讨。我们开发了一个名为 RFC-sim 的开源分析模拟器，对寄存器文件和寄存器缓存的能耗进行建模和测量。我们的结果表明，虽然寄存器缓存在 CUDA 内核中可以减少高达 40% 的能耗，但在 Tensor 内核中却会导致能耗增加高达 23%。主要原因在于寄存器缓存的空间有限，无法满足 Tensor 内核捕捉定位的需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.

期刊最新文献

Introducing IEEE Collabratec Table of Contents IEEE Journal on Emerging and Selected Topics in Circuits and Systems Information for Authors IEEE Circuits and Systems Society Information IEEE Journal on Emerging and Selected Topics in Circuits and Systems Publication Information