Accelerating GPGPU architecture simulation

Measurement and Modeling of Computer Systems Pub Date : 2013-06-17 DOI:10.1145/2465529.2465540

Zhibin Yu, L. Eeckhout, Nilanjan Goswami, Tao Li, L. John, Hai Jin, Chengzhong Xu

{"title":"Accelerating GPGPU architecture simulation","authors":"Zhibin Yu, L. Eeckhout, Nilanjan Goswami, Tao Li, L. John, Hai Jin, Chengzhong Xu","doi":"10.1145/2465529.2465540","DOIUrl":null,"url":null,"abstract":"Recently, graphics processing units (GPUs) have opened up new opportunities for speeding up general-purpose parallel applications due to their massive computational power and up to hundreds of thousands of threads enabled by programming models such as CUDA. However, due to the serial nature of existing micro-architecture simulators, these massively parallel architectures and workloads need to be simulated sequentially. As a result, simulating GPGPU architectures with typical benchmarks and input data sets is extremely time-consuming. This paper addresses the GPGPU architecture simulation challenge by generating miniature, yet representative GPGPU kernels. We first summarize the static characteristics of an existing GPGPU kernel in a profile, and analyze its dynamic behavior using the novel concept of the divergence flow statistics graph (DFSG). We subsequently use a GPGPU kernel synthesizing framework to generate a miniature proxy of the original kernel, which can reduce simulation time significantly. The key idea is to reduce the number of simulated instructions by decreasing per-thread iteration counts of loops. Our experimental results show that our approach can accelerate GPGPU architecture simulation by a factor of 88X on average and up to 589X with an average IPC relative error of 5.6%.","PeriodicalId":306456,"journal":{"name":"Measurement and Modeling of Computer Systems","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2465529.2465540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Recently, graphics processing units (GPUs) have opened up new opportunities for speeding up general-purpose parallel applications due to their massive computational power and up to hundreds of thousands of threads enabled by programming models such as CUDA. However, due to the serial nature of existing micro-architecture simulators, these massively parallel architectures and workloads need to be simulated sequentially. As a result, simulating GPGPU architectures with typical benchmarks and input data sets is extremely time-consuming. This paper addresses the GPGPU architecture simulation challenge by generating miniature, yet representative GPGPU kernels. We first summarize the static characteristics of an existing GPGPU kernel in a profile, and analyze its dynamic behavior using the novel concept of the divergence flow statistics graph (DFSG). We subsequently use a GPGPU kernel synthesizing framework to generate a miniature proxy of the original kernel, which can reduce simulation time significantly. The key idea is to reduce the number of simulated instructions by decreasing per-thread iteration counts of loops. Our experimental results show that our approach can accelerate GPGPU architecture simulation by a factor of 88X on average and up to 589X with an average IPC relative error of 5.6%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

加速GPGPU架构仿真

最近，图形处理单元(gpu)为加速通用并行应用程序开辟了新的机会，因为它们具有巨大的计算能力，并且通过CUDA等编程模型支持多达数十万个线程。然而，由于现有微架构模拟器的串行特性，这些大规模并行架构和工作负载需要顺序模拟。因此，使用典型基准测试和输入数据集模拟GPGPU架构非常耗时。本文通过生成具有代表性的微型GPGPU内核来解决GPGPU架构仿真的挑战。我们首先总结了现有GPGPU内核的静态特征，并利用发散流统计图(DFSG)的新概念分析了其动态行为。随后，我们使用GPGPU内核合成框架生成原始内核的微型代理，从而大大减少了仿真时间。关键思想是通过减少每个线程的循环迭代次数来减少模拟指令的数量。实验结果表明，该方法可以将GPGPU架构仿真的平均速度提高88倍，最高可达589X，平均IPC相对误差为5.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Measurement and Modeling of Computer Systems

自引率

0.00%

发文量

期刊最新文献

Queueing delays in buffered multistage interconnection networks Data dissemination performance in large-scale sensor networks Index policies for a multi-class queue with convex holding cost and abandonments Neighbor-cell assisted error correction for MLC NAND flash memories Collecting, organizing, and sharing pins in pinterest: interest-driven or social-driven?