基于GPU性能预测的高级硬件特征提取

Toomas Remmelg, Bastian Hagedorn, Lu Li, Michel Steuwer, S. Gorlatch, Christophe Dubach
{"title":"基于GPU性能预测的高级硬件特征提取","authors":"Toomas Remmelg, Bastian Hagedorn, Lu Li, Michel Steuwer, S. Gorlatch, Christophe Dubach","doi":"10.1145/3366428.3380769","DOIUrl":null,"url":null,"abstract":"High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs. Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features. Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"149 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"High-level hardware feature extraction for GPU performance prediction of stencils\",\"authors\":\"Toomas Remmelg, Bastian Hagedorn, Lu Li, Michel Steuwer, S. Gorlatch, Christophe Dubach\",\"doi\":\"10.1145/3366428.3380769\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs. Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features. Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.\",\"PeriodicalId\":266831,\"journal\":{\"name\":\"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit\",\"volume\":\"149 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-02-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3366428.3380769\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366428.3380769","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

高级函数式编程抽象已经开始在高性能计算(HPC)中显示出有希望的结果。Lift、Futhark或Delite等方法已经表明,即使对于像模板这样的高性能计算工作负载,也可以同时拥有高级抽象和性能。此外,这些高级函数抽象也可用于在编译器本身中表示程序及其优化的变体。然而,这种高级方法严重依赖编译器来优化程序,这在针对gpu时是出了名的困难。编译器要么使用手工设计的启发式来指导优化,要么使用迭代编译来搜索优化空间。第一种方法的编译时间很快,但是,它不能跨不同设备进行性能移植,并且需要大量人力来构建启发式方法。迭代编译则具有自动搜索优化空间的能力,能够适应不同的设备。然而,这个过程通常非常耗时,因为必须评估数千个变体。提出了基于统计技术的性能模型,以加快优化空间探索。然而,它们依赖于低级硬件特性,以性能计数器或低级静态代码特性的形式。使用Lift框架,本文演示了如何直接从高级功能表示中提取低级的、特定于gpu的特性。提升IR(中间表示)实际上是一个非常合适的选择,因为所有的优化选择都暴露在IR级别。本文展示了如何提取低级特征,例如每次warp访问的唯一缓存行数量,这对于构建准确的GPU性能模型至关重要。使用这种方法,我们能够在许多模板应用程序中,在AMD GPU上加速2000倍的空间探索,在Nvidia上平均加速450倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
High-level hardware feature extraction for GPU performance prediction of stencils
High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs. Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features. Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems GPGPU performance estimation for frequency scaling using cross-benchmarking Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL Automatic generation of specialized direct convolutions for mobile GPUs High-level hardware feature extraction for GPU performance prediction of stencils
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1