Fast Computational GPU Design with GT-Pin

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI:10.1109/IISWC.2015.14

Melanie Kambadur, Sunpyo Hong, Juan Cabral, H. Patil, C. Luk, S. Sajid, Martha A. Kim

{"title":"Fast Computational GPU Design with GT-Pin","authors":"Melanie Kambadur, Sunpyo Hong, Juan Cabral, H. Patil, C. Luk, S. Sajid, Martha A. Kim","doi":"10.1109/IISWC.2015.14","DOIUrl":null,"url":null,"abstract":"As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for in-depth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Symposium on Workload Characterization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2015.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for in-depth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于gt引脚的快速计算GPU设计

随着图形处理单元的计算应用变得越来越普遍，必须开发新的硬件设计来满足这些工作负载的独特需求。性能模拟是评估候选设计如何满足这些需求的重要步骤，但不幸的是，计算GPU程序如此之大，以至于详细模拟它们的速度非常慢。这项工作以三种方式解决了理解非常大的计算GPU程序的需要。首先，介绍了一个快速跟踪工具，该工具使用二进制工具对现有体系结构上的本机执行进行深入分析。其次，它描述了25个商业和基准的OpenCL应用程序，平均每个应用程序有3080亿个GPU指令，是迄今为止在这个细节级别上进行本机分析的最大的基准测试。第三，它通过精确定位OpenCL应用程序的小子集来加速对未来硬件的模拟，这些子集可以模拟为具有代表性的替代品，而不是完整的程序。我们的快速选择方法不需要模拟本身，并允许用户导航精度/模拟速度权衡空间，从极其准确的合理加速(35倍的模拟速度增加0.3%的误差)到合理准确的极端加速(223X模拟加速3.0%的误差)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 IEEE International Symposium on Workload Characterization

自引率

0.00%

发文量

期刊最新文献

Fast Computational GPU Design with GT-Pin On Power-Performance Characterization of Concurrent Throughput Kernels CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores Exploring Parallel Programming Models for Heterogeneous Computing Systems Revealing Critical Loads and Hidden Data Locality in GPGPU Applications