Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI:10.1145/2485922.2485953

Minsoo Rhu, M. Erez

{"title":"Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation","authors":"Minsoo Rhu, M. Erez","doi":"10.1145/2485922.2485953","DOIUrl":null,"url":null,"abstract":"Current GPUs maintain high programmability by abstracting the SIMD nature of the hardware as independent concurrent threads of control with hardware responsible for generating predicate masks to utilize the SIMD hardware for different flows of control. This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges. Prior research suggests that SIMD groups be formed dynamically by compacting a large number of threads into groups, mitigating the impact of divergence. To maintain hardware efficiency, however, the alignment of a thread to a SIMD lane is fixed, limiting the potential for compaction. We observe that control frequently diverges in a manner that prevents compaction because of the way in which the fixed alignment of threads to lanes is done. This paper presents an in-depth analysis on the causes for ineffective compaction. An important observation is that in many cases, control diverges because of programmatic branches, which do not depend on input data. This behavior, when combined with the default mapping of threads to lanes, severely restricts compaction. We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment. SLP seeks to rearrange how threads are mapped to lanes to allow even programmatic branches to be compacted effectively, improving SIMD utilization up to 34% accompanied by a maximum 25% performance boost.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2485922.2485953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

Abstract

Current GPUs maintain high programmability by abstracting the SIMD nature of the hardware as independent concurrent threads of control with hardware responsible for generating predicate masks to utilize the SIMD hardware for different flows of control. This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges. Prior research suggests that SIMD groups be formed dynamically by compacting a large number of threads into groups, mitigating the impact of divergence. To maintain hardware efficiency, however, the alignment of a thread to a SIMD lane is fixed, limiting the potential for compaction. We observe that control frequently diverges in a manner that prevents compaction because of the way in which the fixed alignment of threads to lanes is done. This paper presents an in-depth analysis on the causes for ineffective compaction. An important observation is that in many cases, control diverges because of programmatic branches, which do not depend on input data. This behavior, when combined with the default mapping of threads to lanes, severely restricts compaction. We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment. SLP seeks to rearrange how threads are mapped to lanes to allow even programmatic branches to be compacted effectively, improving SIMD utilization up to 34% accompanied by a maximum 25% performance boost.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

最大化SIMD通道排列的gpgpu中的SIMD资源利用率

当前的gpu通过将硬件的SIMD特性抽象为独立的并发控制线程来保持高可编程性，其中硬件负责生成谓词掩码，以利用SIMD硬件进行不同的控制流。当对同一SIMD组中的不同线程的控制出现分歧时，这种动态屏蔽会导致SIMD资源利用率低下。先前的研究表明，SIMD组可以通过将大量线程压缩成组来动态地形成，从而减轻发散的影响。但是，为了保持硬件效率，线程与SIMD通道的对齐是固定的，从而限制了压缩的可能性。我们观察到，由于线程对通道的固定对齐方式，控制经常以一种防止压缩的方式偏离。本文对压实无效的原因进行了深入分析。一个重要的观察是，在许多情况下，由于不依赖于输入数据的编程分支，控制发生了发散。当与线程到通道的默认映射结合使用时，这种行为严重限制了压缩。然后，我们提出SIMD车道排列(SLP)作为一种优化，以扩大在这种车道对齐情况下压缩的适用性。SLP试图重新安排线程映射到通道的方式，以允许有效地压缩编程分支，从而将SIMD利用率提高到34%，同时最大提高25%的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 40th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量