{"title":"Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs","authors":"Vahid Geraeinejad;Qiran Qian;Masoumeh Ebrahimi","doi":"10.1109/JETCAS.2024.3439193","DOIUrl":null,"url":null,"abstract":"GPUs are extensively employed as the primary devices for running a broad spectrum of applications, covering general-purpose applications as well as Artificial Intelligence (AI) applications. Register file, as the largest SRAM on the GPU die, accounts for over 20% of the total GPU energy consumption. Register cache has been introduced to reduce traffic from the register file and thus decrease total energy consumption when CUDA cores are utilized. However, the utilization of register cache has not been thoroughly investigated for Tensor Cores which are integrated into recent GPU architectures to meet AI workload demands. In this paper, we study the usage of register cache in both CUDA and Tensor Cores and conduct a thorough examination of their pros and cons. We have developed an open-source analytical simulator, called RFC-sim, to model and measure the energy consumption of both the register file and register cache. Our results show that while the register cache can reduce energy consumption by up to 40% in CUDA cores, it results in increased energy consumption by up to 23% in Tensor Cores. The main reason lies in the limited space of the register cache, which is not sufficient for the demand of Tensor cores to capture locality.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10623472/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
GPUs are extensively employed as the primary devices for running a broad spectrum of applications, covering general-purpose applications as well as Artificial Intelligence (AI) applications. Register file, as the largest SRAM on the GPU die, accounts for over 20% of the total GPU energy consumption. Register cache has been introduced to reduce traffic from the register file and thus decrease total energy consumption when CUDA cores are utilized. However, the utilization of register cache has not been thoroughly investigated for Tensor Cores which are integrated into recent GPU architectures to meet AI workload demands. In this paper, we study the usage of register cache in both CUDA and Tensor Cores and conduct a thorough examination of their pros and cons. We have developed an open-source analytical simulator, called RFC-sim, to model and measure the energy consumption of both the register file and register cache. Our results show that while the register cache can reduce energy consumption by up to 40% in CUDA cores, it results in increased energy consumption by up to 23% in Tensor Cores. The main reason lies in the limited space of the register cache, which is not sufficient for the demand of Tensor cores to capture locality.
期刊介绍:
The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.