具有内存级和线程级并行意识的GPU架构的分析模型

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-20 DOI:10.1145/1555754.1555775

Sunpyo Hong, Hyesoon Kim

{"title":"具有内存级和线程级并行意识的GPU架构的分析模型","authors":"Sunpyo Hong, Hyesoon Kim","doi":"10.1145/1555754.1555775","DOIUrl":null,"url":null,"abstract":"GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.\n To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"28 1","pages":"152-163"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"699","resultStr":"{\"title\":\"An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness\",\"authors\":\"Sunpyo Hong, Hyesoon Kim\",\"doi\":\"10.1145/1555754.1555775\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.\\n To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.\",\"PeriodicalId\":91388,\"journal\":{\"name\":\"Proceedings. International Symposium on Computer Architecture\",\"volume\":\"28 1\",\"pages\":\"152-163\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"699\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1555754.1555775\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1555754.1555775","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 699

摘要

GPU架构由于其大量并行处理器而在多核时代变得越来越重要。对软件工程师来说，编程数千个大规模并行线程是一个巨大的挑战，但了解GPU架构上这些并行程序的性能瓶颈以提高应用程序的性能则更加困难。当前的方法依赖于程序员在不完全了解应用程序的性能特征的情况下，通过充分利用设计空间来调优应用程序。为了深入了解GPU架构上并行应用程序的性能瓶颈，我们提出了一个简单的分析模型，用于估计大规模并行程序的执行时间。我们模型的关键组件是通过考虑运行线程的数量和内存带宽来估计并行内存请求的数量(我们称之为内存翘曲并行性)。基于内存翘曲并行度，该模型估计内存请求的成本，从而估计程序的总体执行时间。将模型结果与几种GPU的实际执行时间进行比较，结果表明，我们的模型在微基准测试中的绝对误差几何平均值为5.4%，在GPU计算应用中的绝对误差几何平均值为13.3%。所有应用程序都是用CUDA编程语言编写的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. International Symposium on Computer Architecture

自引率

0.00%

发文量

期刊最新文献

ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022 Special-purpose and future architectures Computer memory systems Basics of the central processing unit FRONT MATTER