通过可选的线程块调度提高GPGPU资源利用率

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-06-19 DOI:10.1109/HPCA.2014.6835937

Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu

{"title":"通过可选的线程块调度提高GPGPU资源利用率","authors":"Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu","doi":"10.1109/HPCA.2014.6835937","DOIUrl":null,"url":null,"abstract":"High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"155","resultStr":"{\"title\":\"Improving GPGPU resource utilization through alternative thread block scheduling\",\"authors\":\"Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu\",\"doi\":\"10.1109/HPCA.2014.6835937\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.\",\"PeriodicalId\":164587,\"journal\":{\"name\":\"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"155\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2014.6835937\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2014.6835937","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 155

摘要

通过最大化并行性和充分利用可用资源来获得GPGPU工作负载下的高性能。数千个线程以CTA(合作线程阵列)或线程块的单位分配给每个内核，每个线程块由多个经线或波阵组成。线程的调度会对整体性能产生重大影响。在这项工作中，探索替代线程块或CTA调度;特别是，我们利用线程块调度器和warp调度器之间的交互来提高性能。我们探讨了线程块调度的两个方面:(1)LCS(惰性CTA调度)，它限制了分配给每个核心的线程块的最大数量，以及(2)BCS(块CTA调度)，其中连续的线程块被分配给同一个核心。对于LCS，我们利用贪婪的warp调度器来帮助通过仅测量发出的指令数量来确定线程块的最佳数量，而对于BCS，我们提出了一个替代的warp调度器，它知道分配给核心的cta的“块”。有了LCS，并且观察到cta的最大数量并不一定会使性能最大化，我们还提出了混合并发内核执行，它允许将多个内核分配到同一个内核，以最大限度地提高资源利用率并提高整体性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Improving GPGPU resource utilization through alternative thread block scheduling

High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量