Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture Pub Date : 2014-12-13 DOI:10.1109/MICRO.2014.24

J. Kim, C. Batten

{"title":"Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists","authors":"J. Kim, C. Batten","doi":"10.1109/MICRO.2014.24","DOIUrl":null,"url":null,"abstract":"Although GPGPUs are traditionally used to accelerate workloads with regular control and memory-access structure, recent work has shown that GPGPUs can also achieve significant speedups on more irregular algorithms. Data-driven implementations of irregular algorithms are algorithmically more efficient than topology-driven implementations, but issues with memory contention and memory-access irregularity can make the former perform worse in certain cases. In this paper, we propose a novel fine-grain hardware work list for GPGPUs that addresses the weaknesses of data-driven implementations. We detail multiple work redistribution schemes of varying complexity that can be employed to improve load balancing. Furthermore, a virtualization mechanism supports seamless work spilling to memory. A convenient shared work list software API is provided to simplify using our proposed mechanisms when implementing irregular algorithms. We evaluate challenging irregular algorithms from the Lonestar GPU benchmark suite on a cycle-level simulator. Our findings show that data-driven implementations running on a GPGPU using the hardware work list outperform highly optimized software-based implementations of these benchmarks running on a baseline GPGPU with speedups ranging from 1.2 - 2.4× and marginal area overhead.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"6 1","pages":"75-87"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

Abstract

Although GPGPUs are traditionally used to accelerate workloads with regular control and memory-access structure, recent work has shown that GPGPUs can also achieve significant speedups on more irregular algorithms. Data-driven implementations of irregular algorithms are algorithmically more efficient than topology-driven implementations, but issues with memory contention and memory-access irregularity can make the former perform worse in certain cases. In this paper, we propose a novel fine-grain hardware work list for GPGPUs that addresses the weaknesses of data-driven implementations. We detail multiple work redistribution schemes of varying complexity that can be employed to improve load balancing. Furthermore, a virtualization mechanism supports seamless work spilling to memory. A convenient shared work list software API is provided to simplify using our proposed mechanisms when implementing irregular algorithms. We evaluate challenging irregular algorithms from the Lonestar GPU benchmark suite on a cycle-level simulator. Our findings show that data-driven implementations running on a GPGPU using the hardware work list outperform highly optimized software-based implementations of these benchmarks running on a baseline GPGPU with speedups ranging from 1.2 - 2.4× and marginal area overhead.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用细粒度硬件工作列表加速gpgpu上的不规则算法

虽然gpgpu传统上用于加速具有常规控制和内存访问结构的工作负载，但最近的研究表明，gpgpu也可以在更不规则的算法上实现显着的加速。不规则算法的数据驱动实现在算法上比拓扑驱动实现更高效，但是内存争用和内存访问不规则的问题在某些情况下会使前者的性能更差。在本文中，我们为gpgpu提出了一种新的细粒度硬件工作列表，以解决数据驱动实现的弱点。我们详细介绍了可用于改善负载平衡的不同复杂性的多个工作再分配方案。此外，虚拟化机制支持无缝地将工作溢出到内存。提供了一个方便的共享工作列表软件API，以便在实现不规则算法时简化使用我们提出的机制。我们在周期级模拟器上评估来自龙星GPU基准套件的具有挑战性的不规则算法。我们的研究结果表明，在使用硬件工作列表的GPGPU上运行的数据驱动实现比在基线GPGPU上运行的这些基准测试的高度优化的基于软件的实现性能更好，速度范围从1.2 - 2.4倍不等，面积开销很小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

自引率

0.00%

发文量