PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2020-12-01 DOI:10.1109/HiPC50609.2020.00039

Vivek Kumar

{"title":"PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks","authors":"Vivek Kumar","doi":"10.1109/HiPC50609.2020.00039","DOIUrl":null,"url":null,"abstract":"Due to the challenges in providing adequate memory access to many cores on a single processor, Multi-Die and Multi-Socket based multicore systems are becoming mainstream. These systems offer cache-coherent Non-Uniform Memory Access (NUMA) across several memory banks and cache hierarchy to increase memory capacity and bandwidth. Random work-stealing is a widely used technique for dynamic load balancing of tasks on multicore processors. However, it scales poorly on such NUMA systems for memory-bound applications due to cache misses and remote memory access latency. Hierarchical Place Tree (HPT) [1] is a popular approach for improving the locality of a task-based parallel programming model, albeit it requires the programmer to map the dynamically unfolding tasks over a NUMA system evenly. Specifying data-affinity hints provides a more natural way to map the tasks than HPT. Still, a scalable work-stealing implementation for the same is mostly unexplored for modern NUMA systems. This paper presents PufferFish, a new async-finish parallel programming model and work-stealing runtime for NUMA systems that provide a close coupling of the data-affinity hints provided for an asynchronous task with the HPTs in Habanero C/C++ library (HClib). PufferFish introduces Hierarchical Elastic Tasks (HET) that improves the locality by shrinking itself to run on a single worker inside a place or puffing up across multiple workers depending on the work imbalance at a particular place in an HPT. We use a set of widely used memory-bound benchmarks exhibiting regular and irregular execution graphs for evaluating PufferFish. On these benchmarks, we show that PufferFish achieves a geometric mean speedup of 1.5× and 1.9× over HPT implementation in HClib and random work-stealing in CilkPlus, respectively, on a 32-core NUMA AMD EPYC processor.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC50609.2020.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Due to the challenges in providing adequate memory access to many cores on a single processor, Multi-Die and Multi-Socket based multicore systems are becoming mainstream. These systems offer cache-coherent Non-Uniform Memory Access (NUMA) across several memory banks and cache hierarchy to increase memory capacity and bandwidth. Random work-stealing is a widely used technique for dynamic load balancing of tasks on multicore processors. However, it scales poorly on such NUMA systems for memory-bound applications due to cache misses and remote memory access latency. Hierarchical Place Tree (HPT) [1] is a popular approach for improving the locality of a task-based parallel programming model, albeit it requires the programmer to map the dynamically unfolding tasks over a NUMA system evenly. Specifying data-affinity hints provides a more natural way to map the tasks than HPT. Still, a scalable work-stealing implementation for the same is mostly unexplored for modern NUMA systems. This paper presents PufferFish, a new async-finish parallel programming model and work-stealing runtime for NUMA systems that provide a close coupling of the data-affinity hints provided for an asynchronous task with the HPTs in Habanero C/C++ library (HClib). PufferFish introduces Hierarchical Elastic Tasks (HET) that improves the locality by shrinking itself to run on a single worker inside a place or puffing up across multiple workers depending on the work imbalance at a particular place in an HPT. We use a set of widely used memory-bound benchmarks exhibiting regular and irregular execution graphs for evaluating PufferFish. On these benchmarks, we show that PufferFish achieves a geometric mean speedup of 1.5× and 1.9× over HPT implementation in HClib and random work-stealing in CilkPlus, respectively, on a 32-core NUMA AMD EPYC processor.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PufferFish:使用弹性任务的NUMA-Aware工作窃取库

由于在单个处理器上为多个核心提供足够的内存访问的挑战，基于多模和多插槽的多核系统正在成为主流。这些系统提供跨多个内存库和缓存层次结构的缓存一致非统一内存访问(NUMA)，以增加内存容量和带宽。随机工作窃取是一种广泛应用于多核处理器任务动态负载平衡的技术。然而，由于缓存丢失和远程内存访问延迟，它在这样的NUMA系统上对于内存受限的应用程序伸缩性很差。分层位置树(HPT)[1]是一种用于改进基于任务的并行编程模型的局部性的流行方法，尽管它要求程序员在NUMA系统上均匀地映射动态展开的任务。指定数据关联提示提供了一种比HPT更自然的映射任务的方法。尽管如此，对于现代NUMA系统来说，同样的可扩展工作窃取实现基本上还没有被探索过。本文提出了PufferFish，一个新的异步完成并行编程模型和用于NUMA系统的偷工运行时，它提供了为异步任务提供的数据关联提示与Habanero C/ c++库(HClib)中的HPTs的紧密耦合。PufferFish引入了分层弹性任务(HET)，它通过缩小自身以在一个地方的单个工人上运行，或者根据HPT中特定位置的工作不平衡在多个工人上膨胀来提高局部性。我们使用了一组广泛使用的内存约束基准，展示了规则和不规则的执行图来评估PufferFish。在这些基准测试中，我们表明PufferFish在32核NUMA AMD EPYC处理器上分别比HClib中的HPT实现和CilkPlus中的随机工作窃取分别实现了1.5倍和1.9倍的几何平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

自引率

0.00%

发文量

期刊最新文献

HiPC 2020 ORGANIZATION HiPC 2020 Industry Sponsors PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks Algorithms for Preemptive Co-scheduling of Kernels on GPUs 27th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020) Technical program