A Taxonomy of GPGPU Performance Scaling

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI:10.1109/IISWC.2015.22

Abhinandan Majumdar, Gene Y. Wu, K. Dev, J. Greathouse, Indrani Paul, Wei Huang, Arjun Venugopal, Leonardo Piga, Chip Freitag, Sooraj Puthoor

{"title":"A Taxonomy of GPGPU Performance Scaling","authors":"Abhinandan Majumdar, Gene Y. Wu, K. Dev, J. Greathouse, Indrani Paul, Wei Huang, Arjun Venugopal, Leonardo Piga, Chip Freitag, Sooraj Puthoor","doi":"10.1109/IISWC.2015.22","DOIUrl":null,"url":null,"abstract":"Graphics processing units (GPUs) range from small, embedded designs to large, high-powered discrete cards. While the performance of graphics workloads is generally understood, there has been little study of the performance of GPGPU applications across a variety of hardware configurations. This work presents performance scaling data gathered for 267 GPGPU kernels from 97 programs run on 891 hardware configurations of a modern GPU. We study the performance of these kernels across a 5× change in core frequency, 8.3× change in memory bandwidth, and 11× difference in compute units. We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Symposium on Workload Characterization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2015.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Graphics processing units (GPUs) range from small, embedded designs to large, high-powered discrete cards. While the performance of graphics workloads is generally understood, there has been little study of the performance of GPGPU applications across a variety of hardware configurations. This work presents performance scaling data gathered for 267 GPGPU kernels from 97 programs run on 891 hardware configurations of a modern GPU. We study the performance of these kernels across a 5× change in core frequency, 8.3× change in memory bandwidth, and 11× difference in compute units. We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GPGPU性能扩展的分类

图形处理单元(gpu)的范围从小型的嵌入式设计到大型的高性能分立卡。虽然图形工作负载的性能通常被理解，但对GPGPU应用程序在各种硬件配置下的性能的研究很少。这项工作展示了在现代GPU的891个硬件配置上运行的97个程序中收集的267个GPGPU内核的性能缩放数据。我们研究了这些内核在核心频率变化5倍、内存带宽变化8.3倍、计算单元变化11倍的情况下的性能。我们说明了许多内核以直观的方式扩展，例如那些直接通过添加计算能力或内存带宽进行扩展的内核。我们还发现许多内核以不明显的方式扩展，例如当添加更多处理单元时性能会下降，或者随着频率和带宽的增加而趋于稳定。此外，我们表明，许多当前的基准套件不能扩展到现代GPU尺寸，这意味着需要新的基准或新的输入。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 IEEE International Symposium on Workload Characterization

自引率

0.00%

发文量

期刊最新文献

Fast Computational GPU Design with GT-Pin On Power-Performance Characterization of Concurrent Throughput Kernels CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores Exploring Parallel Programming Models for Heterogeneous Computing Systems Revealing Critical Loads and Hidden Data Locality in GPGPU Applications