GPU寄存器打包:动态利用窄宽度操作数来提高性能

Xin Wang, Wei Zhang
{"title":"GPU寄存器打包:动态利用窄宽度操作数来提高性能","authors":"Xin Wang, Wei Zhang","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.308","DOIUrl":null,"url":null,"abstract":"Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"GPU Register Packing: Dynamically Exploiting Narrow-Width Operands to Improve Performance\",\"authors\":\"Xin Wang, Wei Zhang\",\"doi\":\"10.1109/Trustcom/BigDataSE/ICESS.2017.308\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.\",\"PeriodicalId\":170253,\"journal\":{\"name\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.308\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

图形处理单元(gpu)已经越来越多地用于加速通用计算。通过利用大规模线程级并行性(TLP), gpu可以实现高吞吐量和内存延迟隐藏。因此,通常需要一个非常大的寄存器文件(RF)来实现成千上万个活动线程之间快速和低成本的上下文切换。然而,RF资源仍然不足以实现所有线程级别的并行性,并且RF资源的缺乏会通过限制GPU线程的占用而损害性能。此外,如果可用的RF容量不能满足线程块的要求,GPU需要从本地内存中获取一些变量,这可能会导致较长的内存访问延迟。通过观察到对于许多GPGPU应用程序,与32位寄存器的全宽度相比,大部分计算结果实际上具有更少的有效位,我们提出了一种GPU寄存器打包方案,以动态利用窄宽度操作数并将多个操作数打包到单个全宽度寄存器中。通过使用动态注册封装,更多的RF空间可用,这允许GPU通过在SMs(流多处理器)上分配额外的线程块来启用更多的TLP,从而提高性能。实验结果表明,我们的GPU寄存器打包方案可以实现高达1.96倍的加速,平均速度为1.18倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GPU Register Packing: Dynamically Exploiting Narrow-Width Operands to Improve Performance
Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Insider Threat Detection Through Attributed Graph Clustering SEEAD: A Semantic-Based Approach for Automatic Binary Code De-obfuscation A Public Key Encryption Scheme for String Identification Vehicle Incident Hot Spots Identification: An Approach for Big Data Implementing Chain of Custody Requirements in Database Audit Records for Forensic Purposes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1