{"title":"A Survey of GPGPU Parallel Processing Architecture Performance Optimization","authors":"Shiwei Jia, Z. Tian, Yueyuan Ma, Chenglu Sun, Yimen Zhang, Yuming Zhang","doi":"10.1109/icisfall51598.2021.9627400","DOIUrl":null,"url":null,"abstract":"General purpose graphic processor unit (GPGPU) supports various applications' execution in different fields with high-performance computing capability due to its powerful parallel processing architecture. However, GPGPU parallel processing architecture also has the “memory wall” issue. When memory access in application is intensive or irregular, memory resource competition occurs and then degrade the performance of memory system. In addition, with multithreads' requirement for different on-chip resources such as register and warp slot being inconsistant, as well as the branch divergence irregular computing applications, the development of thread level parallelism (TLP) is severely restrited. Due to the restrictions of memory access and TLP, the acceleration capability of GPGPU large-scale parallel processing architecture has not been developed effectively. Alleviating memory resource contention and improving TLP is the performance optimization hotspot for current GPGPU architecture. In this paper we research how memory access optimization and TLP improvement could contribute to the optimization of parallel processing architecture performance. First we find that memory access optimization could be accomplished by three ways: reducing the number of global memory access, improving memory access latency hiding capability and optimizing cache subsystem performance. Then in order to improve TLP, optimizing thread allocation scheme, developing data approximation and redundancy, as well as compacting branch divergence, researches of these three aspects are surveyed. We also analyze the working mechanism, advantages and challenges of each research. At the end, we suggest the direction of future GPGPU parallel processing architecture optimization.","PeriodicalId":240142,"journal":{"name":"2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icisfall51598.2021.9627400","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
General purpose graphic processor unit (GPGPU) supports various applications' execution in different fields with high-performance computing capability due to its powerful parallel processing architecture. However, GPGPU parallel processing architecture also has the “memory wall” issue. When memory access in application is intensive or irregular, memory resource competition occurs and then degrade the performance of memory system. In addition, with multithreads' requirement for different on-chip resources such as register and warp slot being inconsistant, as well as the branch divergence irregular computing applications, the development of thread level parallelism (TLP) is severely restrited. Due to the restrictions of memory access and TLP, the acceleration capability of GPGPU large-scale parallel processing architecture has not been developed effectively. Alleviating memory resource contention and improving TLP is the performance optimization hotspot for current GPGPU architecture. In this paper we research how memory access optimization and TLP improvement could contribute to the optimization of parallel processing architecture performance. First we find that memory access optimization could be accomplished by three ways: reducing the number of global memory access, improving memory access latency hiding capability and optimizing cache subsystem performance. Then in order to improve TLP, optimizing thread allocation scheme, developing data approximation and redundancy, as well as compacting branch divergence, researches of these three aspects are surveyed. We also analyze the working mechanism, advantages and challenges of each research. At the end, we suggest the direction of future GPGPU parallel processing architecture optimization.