SIMD divergence optimization through intra-warp compaction

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI:10.1145/2485922.2485954

A. S. Vaidya, A. Shayesteh, Dong Hyuk Woo, Roy Saharoy, M. Azimi

{"title":"SIMD divergence optimization through intra-warp compaction","authors":"A. S. Vaidya, A. Shayesteh, Dong Hyuk Woo, Roy Saharoy, M. Azimi","doi":"10.1145/2485922.2485954","DOIUrl":null,"url":null,"abstract":"SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2485922.2485954","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过曲内压缩优化SIMD散度

gpu中的SIMD执行单元越来越多地用于通用应用程序的高性能和节能加速。然而，SIMD控制流发散效应会导致一类GPGPU应用程序的执行效率降低，这些应用程序被归类为发散应用程序。因此，提高SIMD效率有可能为广泛的此类数据并行应用带来显著的性能和能源优势。最近，SIMD散度问题受到了越来越多的关注，并且已经提出了几种微体系结构技术来解决这个问题的各个方面。然而，这些技术通常相当复杂，因此不太可能用于实际实现。在本文中，我们提出了两种针对GPGPU架构的微架构优化，当指令流中存在特定的关闭通道组时，它们利用相对简单的执行周期压缩技术。我们将这些优化分别称为基本循环压缩(BCC)和混合循环压缩(SCC)。在本文中，我们将概述在所研究的GPGPU架构上下文中实现这些优化的附加要求。我们对来自OpenCL (GPGPU)和OpenGL(图形)应用程序的不同SIMD工作负载的评估表明，BCC和SCC将不同应用程序的执行周期缩短了42%(平均20%)。对于不同工作负载的子集，当前gpu的执行时间平均减少了7%，而具有更好内存子系统的未来gpu的执行时间平均减少了18%。我们工作的关键贡献在于简化了微架构，以提供发散优化，同时提供了更复杂方法的大部分好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 40th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量