A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs

David Troendle, T. Ta, B. Jang
{"title":"A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs","authors":"David Troendle, T. Ta, B. Jang","doi":"10.1145/3337821.3337837","DOIUrl":null,"url":null,"abstract":"The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在gpu上调度不规则工作负载的专用并发队列
持久线程模型为加速图形处理单元(gpu)上数据不规则的工作负载提供了一个可行的解决方案。然而,随着活动线程数量的增加,共享资源上的争用和重试限制了持久线程之间任务调度的效率。为了解决这个问题,我们提出了一个高度可扩展的、非阻塞的并发队列,适合用作GPU持久线程任务调度器。所提出的并发队列具有两个新特性:1)由于原子操作不会失败,并且队列空异常被重构,因此支持的排队/脱队列操作不会遭受重试开销;2)队列对任意数量的队列条目进行操作,其成本与单个条目相同。每个线程组中的代理线程代表该组中的所有线程执行所有原子操作。这两个新特性大大减少了由GPU的锁步单指令多线程(SIMT)执行模型引起的线程争用。为了验证所提出的队列的性能和可扩展性,我们使用1)所提出的并发队列和2)两个传统并发队列实现了基于持久线程模型的自上而下的广度优先搜索(BFS);分析了该算法在不同输入图数据集和硬件配置下的性能和可扩展性特点。我们的实验表明,基于我们提出的队列的BFS实现不仅优于基于传统队列的BFS实现,而且优于文献中发现的最先进的BFS实现,最小为1.26倍,最大为36.23倍。我们还观察到,对于高端离散gpu支持的最大线程数(在我们的实验中为14K线程),我们提出的队列的可伸缩性在理想线性加速的10%以内。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Express Link Placement for NoC-Based Many-Core Platforms Cartesian Collective Communication Artemis A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs diBELLA: Distributed Long Read to Long Read Alignment
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1