Thread Similarity Matrix: Visualizing Branch Divergence in GPGPU Programs

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI:10.1109/ICPP.2016.27

Zhibin Yu, L. Eeckhout, Chengzhong Xu

{"title":"Thread Similarity Matrix: Visualizing Branch Divergence in GPGPU Programs","authors":"Zhibin Yu, L. Eeckhout, Chengzhong Xu","doi":"10.1109/ICPP.2016.27","DOIUrl":null,"url":null,"abstract":"Graphics processing units (GPUs) have recently evolved into popular accelerators for general-purpose parallel programs -- so-called GPGPU computing. Although programming models such as CUDA and OpenCL significantly improve GPGPU programmability, optimizing GPGPU programs is still far from trivial. Branch divergence is one of the root causes reducing GPGPU performance. Existing approaches are able to calculate the branch divergence rate but are unable to reveal how the branches diverge in a GPGPU program. In this paper, we propose the Thread Similarity Matrix (TSM) to visualize how branches diverge and in turn help find optimization opportunities. TSM contains an element for each pair of threads, representing the difference in code being executed by the pair of threads. The darker the element, the more similar the threads are, the lighter, the more dissimilar. TSM therefore allows GPGPU programmers to easily understand an application's branch divergence behavior and pinpoint performance anomalies. We present a case study to demonstrate how TSM can help optimize GPGPU programs: we improve the performance of a highly-optimized GPGPU kernel by 35% by reorganizing its thread organization to reduce its branch divergence rate.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 45th International Conference on Parallel Processing (ICPP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2016.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Graphics processing units (GPUs) have recently evolved into popular accelerators for general-purpose parallel programs -- so-called GPGPU computing. Although programming models such as CUDA and OpenCL significantly improve GPGPU programmability, optimizing GPGPU programs is still far from trivial. Branch divergence is one of the root causes reducing GPGPU performance. Existing approaches are able to calculate the branch divergence rate but are unable to reveal how the branches diverge in a GPGPU program. In this paper, we propose the Thread Similarity Matrix (TSM) to visualize how branches diverge and in turn help find optimization opportunities. TSM contains an element for each pair of threads, representing the difference in code being executed by the pair of threads. The darker the element, the more similar the threads are, the lighter, the more dissimilar. TSM therefore allows GPGPU programmers to easily understand an application's branch divergence behavior and pinpoint performance anomalies. We present a case study to demonstrate how TSM can help optimize GPGPU programs: we improve the performance of a highly-optimized GPGPU kernel by 35% by reorganizing its thread organization to reduce its branch divergence rate.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

线程相似矩阵:GPGPU程序中分支发散的可视化

图形处理单元(gpu)最近已经发展成为通用并行程序的流行加速器——即所谓的GPGPU计算。尽管CUDA和OpenCL等编程模型显著提高了GPGPU的可编程性，但优化GPGPU程序仍然远非易事。分支发散是导致GPGPU性能下降的根本原因之一。现有的方法能够计算分支发散率，但无法揭示GPGPU程序中的分支如何发散。在本文中，我们提出了线程相似矩阵(TSM)来可视化分支如何发散，从而帮助找到优化机会。TSM为每对线程包含一个元素，表示这对线程执行的代码的差异。颜色越深，线越相似，颜色越浅，线越不相似。因此，TSM允许GPGPU程序员轻松地理解应用程序的分支偏离行为并查明性能异常。我们提供了一个案例研究来演示TSM如何帮助优化GPGPU程序:我们通过重新组织线程组织以降低分支发散率，将高度优化的GPGPU内核的性能提高了35%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 45th International Conference on Parallel Processing (ICPP)

自引率

0.00%

发文量

期刊最新文献

Parallel k-Means++ for Multiple Shared-Memory Architectures RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing Partial Flattening: A Compilation Technique for Irregular Nested Parallelism on GPGPUs Improving RAID Performance Using an Endurable SSD Cache PARVMEC: An Efficient, Scalable Implementation of the Variational Moments Equilibrium Code