Metrics and Design of an Instruction Roofline Model for AMD GPUs

IF 1.2 Q3 COMPUTER SCIENCE, THEORY & METHODS ACM Transactions on Parallel Computing Pub Date : 2021-10-15 DOI:10.1145/3505285

M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, S. Chandrasekaran

{"title":"Metrics and Design of an Instruction Roofline Model for AMD GPUs","authors":"M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, S. Chandrasekaran","doi":"10.1145/3505285","DOIUrl":null,"url":null,"abstract":"Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 14"},"PeriodicalIF":1.2000,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3505285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 4

Abstract

Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

AMD GPU指令屋顶线模型的度量与设计

由于最近Frontier超级计算机的发布，许多科学应用程序开发人员正在努力使他们的应用程序与AMD (CPU- gpu)架构兼容，这意味着远离传统的CPU和NVIDIA-GPU系统。由于目前AMD gpu的分析工具的局限性，这种转变在如何测量AMD gpu上的应用程序性能方面留下了空白。在本文中，我们使用AMD的ROCProfiler和基准测试工具BabelStream (HIP实现)为AMD gpu设计了一个指令线模型，作为在新的AMD硬件上测量应用程序在指令和内存事务方面的性能的一种方法。具体来说，我们为一个案例研究科学应用程序PIConGPU创建了指令线模型，PIConGPU是一个开源的粒子单元模拟应用程序，用于NVIDIA V100、AMD Radeon Instinct MI60和AMD Instinct MI100 gpu上的等离子体和激光等离子体物理。当在PIConGPU中查看多个感兴趣的内核的性能时，我们发现尽管AMD MI100 GPU实现了与NVIDIA V100 GPU相似或更好的执行时间，但分析工具的差异使得比较这两种架构的性能变得困难。在查看执行时间、GIPS和指令强度时，AMD MI60在本工作中使用的三种gpu中性能最差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊