The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI:10.1145/3205289.3205291

B. Kerbl, Michael Kenzel, J. H. Mueller, D. Schmalstieg, M. Steinberger

{"title":"The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU","authors":"B. Kerbl, Michael Kenzel, J. H. Mueller, D. Schmalstieg, M. Steinberger","doi":"10.1145/3205289.3205291","DOIUrl":null,"url":null,"abstract":"Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We show that previous queuing approaches are unfit for this task, as they either (1) do not work well in a massively parallel environment, or (2) obstruct the use of individual threads on top of single-instruction-multiple-data (SIMD) cores, or (3) block during access, thus prohibiting multi-queue setups. With these issues in mind, we present the Broker Queue, a highly efficient, fully linearizable FIFO queue for fine-granular parallel work distribution on the GPU. We evaluate its performance and usability on modern GPU models against a wide range of existing algorithms. The Broker Queue is up to three orders of magnitude faster than nonblocking queues and can even outperform significantly simpler techniques that lack desired properties for fine-granular work distribution.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205289.3205291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We show that previous queuing approaches are unfit for this task, as they either (1) do not work well in a massively parallel environment, or (2) obstruct the use of individual threads on top of single-instruction-multiple-data (SIMD) cores, or (3) block during access, thus prohibiting multi-queue setups. With these issues in mind, we present the Broker Queue, a highly efficient, fully linearizable FIFO queue for fine-granular parallel work distribution on the GPU. We evaluate its performance and usability on modern GPU models against a wide range of existing algorithms. The Broker Queue is up to three orders of magnitude faster than nonblocking queues and can even outperform significantly simpler techniques that lack desired properties for fine-granular work distribution.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

代理队列:用于GPU上细粒度工作分配的快速、线性FIFO队列

对于显示动态或非均匀工作负载的算法来说，利用图形处理单元(GPU)等大规模并行设备的能力是很困难的。为了实现高性能，这种高级算法需要可扩展的并发队列来收集和分发工作。我们表明，以前的排队方法不适合此任务，因为它们要么(1)在大规模并行环境中不能很好地工作，要么(2)阻碍在单指令多数据(SIMD)内核上使用单个线程，或者(3)在访问期间阻塞，从而禁止多队列设置。考虑到这些问题，我们提出了Broker Queue，这是一种高效、完全线性化的FIFO队列，用于GPU上的细粒度并行工作分配。我们针对各种现有算法评估了其在现代GPU模型上的性能和可用性。代理队列比非阻塞队列快三个数量级，甚至可以明显优于缺乏细粒度工作分发所需属性的简单技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2018 International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

ComPEND CELIA PA-SSD: A Page-Type Aware TLC SSD for Improved Write/Read Performance and Storage Efficiency GRU Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation