{"title":"AMD GPU上基于原子的HIP整数和约简研究","authors":"Zheming Jin, J. Vetter","doi":"10.1145/3547276.3548627","DOIUrl":null,"url":null,"abstract":"Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU\",\"authors\":\"Zheming Jin, J. Vetter\",\"doi\":\"10.1145/3547276.3548627\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.\",\"PeriodicalId\":255540,\"journal\":{\"name\":\"Workshop Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3547276.3548627\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3547276.3548627","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.