GPU寄存器打包:动态利用窄宽度操作数来提高性能

2017 IEEE Trustcom/BigDataSE/ICESS Pub Date : 2017-08-01 DOI:10.1109/Trustcom/BigDataSE/ICESS.2017.308

Xin Wang, Wei Zhang

{"title":"GPU寄存器打包:动态利用窄宽度操作数来提高性能","authors":"Xin Wang, Wei Zhang","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.308","DOIUrl":null,"url":null,"abstract":"Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"GPU Register Packing: Dynamically Exploiting Narrow-Width Operands to Improve Performance\",\"authors\":\"Xin Wang, Wei Zhang\",\"doi\":\"10.1109/Trustcom/BigDataSE/ICESS.2017.308\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.\",\"PeriodicalId\":170253,\"journal\":{\"name\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.308\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

图形处理单元(gpu)已经越来越多地用于加速通用计算。通过利用大规模线程级并行性(TLP)， gpu可以实现高吞吐量和内存延迟隐藏。因此，通常需要一个非常大的寄存器文件(RF)来实现成千上万个活动线程之间快速和低成本的上下文切换。然而，RF资源仍然不足以实现所有线程级别的并行性，并且RF资源的缺乏会通过限制GPU线程的占用而损害性能。此外，如果可用的RF容量不能满足线程块的要求，GPU需要从本地内存中获取一些变量，这可能会导致较长的内存访问延迟。通过观察到对于许多GPGPU应用程序，与32位寄存器的全宽度相比，大部分计算结果实际上具有更少的有效位，我们提出了一种GPU寄存器打包方案，以动态利用窄宽度操作数并将多个操作数打包到单个全宽度寄存器中。通过使用动态注册封装，更多的RF空间可用，这允许GPU通过在SMs(流多处理器)上分配额外的线程块来启用更多的TLP，从而提高性能。实验结果表明，我们的GPU寄存器打包方案可以实现高达1.96倍的加速，平均速度为1.18倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GPU Register Packing: Dynamically Exploiting Narrow-Width Operands to Improve Performance

Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE Trustcom/BigDataSE/ICESS

自引率

0.00%

发文量

期刊最新文献

Insider Threat Detection Through Attributed Graph Clustering SEEAD: A Semantic-Based Approach for Automatic Binary Code De-obfuscation A Public Key Encryption Scheme for String Identification Vehicle Incident Hot Spots Identification: An Approach for Big Data Implementing Chain of Custody Requirements in Database Audit Records for Forensic Purposes