A Survey of GPGPU Parallel Processing Architecture Performance Optimization

2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall) Pub Date : 2021-10-13 DOI:10.1109/icisfall51598.2021.9627400

Shiwei Jia, Z. Tian, Yueyuan Ma, Chenglu Sun, Yimen Zhang, Yuming Zhang

{"title":"A Survey of GPGPU Parallel Processing Architecture Performance Optimization","authors":"Shiwei Jia, Z. Tian, Yueyuan Ma, Chenglu Sun, Yimen Zhang, Yuming Zhang","doi":"10.1109/icisfall51598.2021.9627400","DOIUrl":null,"url":null,"abstract":"General purpose graphic processor unit (GPGPU) supports various applications' execution in different fields with high-performance computing capability due to its powerful parallel processing architecture. However, GPGPU parallel processing architecture also has the “memory wall” issue. When memory access in application is intensive or irregular, memory resource competition occurs and then degrade the performance of memory system. In addition, with multithreads' requirement for different on-chip resources such as register and warp slot being inconsistant, as well as the branch divergence irregular computing applications, the development of thread level parallelism (TLP) is severely restrited. Due to the restrictions of memory access and TLP, the acceleration capability of GPGPU large-scale parallel processing architecture has not been developed effectively. Alleviating memory resource contention and improving TLP is the performance optimization hotspot for current GPGPU architecture. In this paper we research how memory access optimization and TLP improvement could contribute to the optimization of parallel processing architecture performance. First we find that memory access optimization could be accomplished by three ways: reducing the number of global memory access, improving memory access latency hiding capability and optimizing cache subsystem performance. Then in order to improve TLP, optimizing thread allocation scheme, developing data approximation and redundancy, as well as compacting branch divergence, researches of these three aspects are surveyed. We also analyze the working mechanism, advantages and challenges of each research. At the end, we suggest the direction of future GPGPU parallel processing architecture optimization.","PeriodicalId":240142,"journal":{"name":"2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icisfall51598.2021.9627400","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

General purpose graphic processor unit (GPGPU) supports various applications' execution in different fields with high-performance computing capability due to its powerful parallel processing architecture. However, GPGPU parallel processing architecture also has the “memory wall” issue. When memory access in application is intensive or irregular, memory resource competition occurs and then degrade the performance of memory system. In addition, with multithreads' requirement for different on-chip resources such as register and warp slot being inconsistant, as well as the branch divergence irregular computing applications, the development of thread level parallelism (TLP) is severely restrited. Due to the restrictions of memory access and TLP, the acceleration capability of GPGPU large-scale parallel processing architecture has not been developed effectively. Alleviating memory resource contention and improving TLP is the performance optimization hotspot for current GPGPU architecture. In this paper we research how memory access optimization and TLP improvement could contribute to the optimization of parallel processing architecture performance. First we find that memory access optimization could be accomplished by three ways: reducing the number of global memory access, improving memory access latency hiding capability and optimizing cache subsystem performance. Then in order to improve TLP, optimizing thread allocation scheme, developing data approximation and redundancy, as well as compacting branch divergence, researches of these three aspects are surveyed. We also analyze the working mechanism, advantages and challenges of each research. At the end, we suggest the direction of future GPGPU parallel processing architecture optimization.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GPGPU并行处理架构性能优化研究综述

通用图形处理器单元(GPGPU)由于其强大的并行处理架构，能够以高性能的计算能力支持不同领域的各种应用程序的执行。然而，GPGPU并行处理架构也存在“内存墙”问题。当应用程序中内存访问频繁或不规律时，会引起内存资源的竞争，从而降低内存系统的性能。此外，由于多线程对不同片上资源(如寄存器和warp slot)的要求不一致，以及分支发散的不规则计算应用，严重制约了线程级并行(TLP)的发展。由于内存访问和TLP的限制，GPGPU大规模并行处理架构的加速能力没有得到有效开发。缓解内存资源争用和提高TLP是当前GPGPU架构性能优化的热点。本文研究了内存访问优化和TLP改进如何有助于优化并行处理体系结构的性能。首先，我们发现可以通过减少全局内存访问次数、提高内存访问延迟隐藏能力和优化缓存子系统性能来实现内存访问优化。然后从提高TLP、优化线程分配方案、发展数据逼近和冗余、压缩分支散度三个方面进行了综述。分析了各项研究的工作机制、优势和挑战。最后，提出了未来GPGPU并行处理架构优化的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)

自引率

0.00%

发文量