Thread Similarity Matrix: Visualizing Branch Divergence in GPGPU Programs

Zhibin Yu, L. Eeckhout, Chengzhong Xu
{"title":"Thread Similarity Matrix: Visualizing Branch Divergence in GPGPU Programs","authors":"Zhibin Yu, L. Eeckhout, Chengzhong Xu","doi":"10.1109/ICPP.2016.27","DOIUrl":null,"url":null,"abstract":"Graphics processing units (GPUs) have recently evolved into popular accelerators for general-purpose parallel programs -- so-called GPGPU computing. Although programming models such as CUDA and OpenCL significantly improve GPGPU programmability, optimizing GPGPU programs is still far from trivial. Branch divergence is one of the root causes reducing GPGPU performance. Existing approaches are able to calculate the branch divergence rate but are unable to reveal how the branches diverge in a GPGPU program. In this paper, we propose the Thread Similarity Matrix (TSM) to visualize how branches diverge and in turn help find optimization opportunities. TSM contains an element for each pair of threads, representing the difference in code being executed by the pair of threads. The darker the element, the more similar the threads are, the lighter, the more dissimilar. TSM therefore allows GPGPU programmers to easily understand an application's branch divergence behavior and pinpoint performance anomalies. We present a case study to demonstrate how TSM can help optimize GPGPU programs: we improve the performance of a highly-optimized GPGPU kernel by 35% by reorganizing its thread organization to reduce its branch divergence rate.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 45th International Conference on Parallel Processing (ICPP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2016.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Graphics processing units (GPUs) have recently evolved into popular accelerators for general-purpose parallel programs -- so-called GPGPU computing. Although programming models such as CUDA and OpenCL significantly improve GPGPU programmability, optimizing GPGPU programs is still far from trivial. Branch divergence is one of the root causes reducing GPGPU performance. Existing approaches are able to calculate the branch divergence rate but are unable to reveal how the branches diverge in a GPGPU program. In this paper, we propose the Thread Similarity Matrix (TSM) to visualize how branches diverge and in turn help find optimization opportunities. TSM contains an element for each pair of threads, representing the difference in code being executed by the pair of threads. The darker the element, the more similar the threads are, the lighter, the more dissimilar. TSM therefore allows GPGPU programmers to easily understand an application's branch divergence behavior and pinpoint performance anomalies. We present a case study to demonstrate how TSM can help optimize GPGPU programs: we improve the performance of a highly-optimized GPGPU kernel by 35% by reorganizing its thread organization to reduce its branch divergence rate.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
线程相似矩阵:GPGPU程序中分支发散的可视化
图形处理单元(gpu)最近已经发展成为通用并行程序的流行加速器——即所谓的GPGPU计算。尽管CUDA和OpenCL等编程模型显著提高了GPGPU的可编程性,但优化GPGPU程序仍然远非易事。分支发散是导致GPGPU性能下降的根本原因之一。现有的方法能够计算分支发散率,但无法揭示GPGPU程序中的分支如何发散。在本文中,我们提出了线程相似矩阵(TSM)来可视化分支如何发散,从而帮助找到优化机会。TSM为每对线程包含一个元素,表示这对线程执行的代码的差异。颜色越深,线越相似,颜色越浅,线越不相似。因此,TSM允许GPGPU程序员轻松地理解应用程序的分支偏离行为并查明性能异常。我们提供了一个案例研究来演示TSM如何帮助优化GPGPU程序:我们通过重新组织线程组织以降低分支发散率,将高度优化的GPGPU内核的性能提高了35%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Parallel k-Means++ for Multiple Shared-Memory Architectures RCHC: A Holistic Runtime System for Concurrent Heterogeneous Computing Partial Flattening: A Compilation Technique for Irregular Nested Parallelism on GPGPUs Improving RAID Performance Using an Endurable SSD Cache PARVMEC: An Efficient, Scalable Implementation of the Variational Moments Equilibrium Code
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1