{"title":"Collaborative Coalescing of Redundant Memory Access for GPU System","authors":"Fan Jiang, Chengeng Li, Wei Zhang, Jiang Xu","doi":"10.1109/ASP-DAC58780.2024.10473837","DOIUrl":null,"url":null,"abstract":"GPU-based computing serves as the primary solution driving the performance of HPC systems. However, modern GPU systems encounter performance bottlenecks resulting from heavy memory access traffic and insufficient NoC bandwidth. In this work, we propose a collaborative coalescing mechanism aimed at eliminating redundant memory access and boosting GPU system performance. To achieve this, we design a coalescing unit for each memory partition, effectively merging requests from both inter-cluster and intra-cluster SMs. Additionally, we introduce a hierarchical multicast module to replicate and distribute the coalesced reply messages to multiple destination SMs. Experimental results show that our method achieves 20.6% improvement on performance and 27.1% reduction on NoC traffic over the baseline.","PeriodicalId":518586,"journal":{"name":"2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"99 1","pages":"195-200"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASP-DAC58780.2024.10473837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
GPU-based computing serves as the primary solution driving the performance of HPC systems. However, modern GPU systems encounter performance bottlenecks resulting from heavy memory access traffic and insufficient NoC bandwidth. In this work, we propose a collaborative coalescing mechanism aimed at eliminating redundant memory access and boosting GPU system performance. To achieve this, we design a coalescing unit for each memory partition, effectively merging requests from both inter-cluster and intra-cluster SMs. Additionally, we introduce a hierarchical multicast module to replicate and distribute the coalesced reply messages to multiple destination SMs. Experimental results show that our method achieves 20.6% improvement on performance and 27.1% reduction on NoC traffic over the baseline.