{"title":"A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs","authors":"David Troendle, T. Ta, B. Jang","doi":"10.1145/3337821.3337837","DOIUrl":null,"url":null,"abstract":"The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).