GPU 上 Affine 程序的能量感知瓦片尺寸选择

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-03-02 DOI:10.1109/CGO57630.2024.10444795

Malith Jayaweera, Martin Kong, Yanzhi Wang, D. Kaeli

{"title":"GPU 上 Affine 程序的能量感知瓦片尺寸选择","authors":"Malith Jayaweera, Martin Kong, Yanzhi Wang, D. Kaeli","doi":"10.1109/CGO57630.2024.10444795","DOIUrl":null,"url":null,"abstract":"Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5 × and 1.2 × improvement and obtain up to 6.3 × improvement on non-Polybench high-dimensional affine kernels.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"62 11","pages":"13-27"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Energy-Aware Tile Size Selection for Affine Programs on GPUs\",\"authors\":\"Malith Jayaweera, Martin Kong, Yanzhi Wang, D. Kaeli\",\"doi\":\"10.1109/CGO57630.2024.10444795\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5 × and 1.2 × improvement and obtain up to 6.3 × improvement on non-Polybench high-dimensional affine kernels.\",\"PeriodicalId\":517814,\"journal\":{\"name\":\"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)\",\"volume\":\"62 11\",\"pages\":\"13-27\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CGO57630.2024.10444795\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CGO57630.2024.10444795","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

循环平铺是一种高阶变换，用于提高数据局部性和性能。虽然之前的工作已经考虑了它在多个领域和架构中的应用，但它对能效的潜在影响在很大程度上被忽视了。在这项工作中，我们针对 GPU 的仿射程序提出了一种节能瓦片大小选择方案（EATSS）。我们自动推导仿射程序的非线性整数公式，并使用 Z3 求解器找到有效的磁贴大小，以满足架构资源限制，同时最大限度地提高性能和降低能耗。我们的方法基于这样一种见解，即减少缓存内数据的延迟，同时利用自动功率缩放，可以大幅提高性能和能效。我们在英伟达 Xavier 和 GA100 GPU 上对 EATSS 进行了评估，并报告了在几个仿射内核上相对于 PPCG 的每瓦性能改进中值。在 Polybench 内核上，我们实现了 1.5 倍和 1.2 倍的改进，在非 Polybench 高维仿射内核上，我们实现了高达 6.3 倍的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Energy-Aware Tile Size Selection for Affine Programs on GPUs

Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5 × and 1.2 × improvement and obtain up to 6.3 × improvement on non-Polybench high-dimensional affine kernels.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

自引率

0.00%

发文量

期刊最新文献

PresCount: Effective Register Allocation for Bank Conflict Reduction High-Throughput, Formal-Methods-Assisted Fuzzing for LLVM CGO 2024 Organization SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems Representing Data Collections in an SSA Form