A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI:10.1145/3337821.3337837

David Troendle, T. Ta, B. Jang

{"title":"A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs","authors":"David Troendle, T. Ta, B. Jang","doi":"10.1145/3337821.3337837","DOIUrl":null,"url":null,"abstract":"The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在gpu上调度不规则工作负载的专用并发队列

持久线程模型为加速图形处理单元(gpu)上数据不规则的工作负载提供了一个可行的解决方案。然而，随着活动线程数量的增加，共享资源上的争用和重试限制了持久线程之间任务调度的效率。为了解决这个问题，我们提出了一个高度可扩展的、非阻塞的并发队列，适合用作GPU持久线程任务调度器。所提出的并发队列具有两个新特性:1)由于原子操作不会失败，并且队列空异常被重构，因此支持的排队/脱队列操作不会遭受重试开销;2)队列对任意数量的队列条目进行操作，其成本与单个条目相同。每个线程组中的代理线程代表该组中的所有线程执行所有原子操作。这两个新特性大大减少了由GPU的锁步单指令多线程(SIMT)执行模型引起的线程争用。为了验证所提出的队列的性能和可扩展性，我们使用1)所提出的并发队列和2)两个传统并发队列实现了基于持久线程模型的自上而下的广度优先搜索(BFS);分析了该算法在不同输入图数据集和硬件配置下的性能和可扩展性特点。我们的实验表明，基于我们提出的队列的BFS实现不仅优于基于传统队列的BFS实现，而且优于文献中发现的最先进的BFS实现，最小为1.26倍，最大为36.23倍。我们还观察到，对于高端离散gpu支持的最大线程数(在我们的实验中为14K线程)，我们提出的队列的可伸缩性在理想线性加速的10%以内。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助