PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks

Vivek Kumar
{"title":"PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks","authors":"Vivek Kumar","doi":"10.1109/HiPC50609.2020.00039","DOIUrl":null,"url":null,"abstract":"Due to the challenges in providing adequate memory access to many cores on a single processor, Multi-Die and Multi-Socket based multicore systems are becoming mainstream. These systems offer cache-coherent Non-Uniform Memory Access (NUMA) across several memory banks and cache hierarchy to increase memory capacity and bandwidth. Random work-stealing is a widely used technique for dynamic load balancing of tasks on multicore processors. However, it scales poorly on such NUMA systems for memory-bound applications due to cache misses and remote memory access latency. Hierarchical Place Tree (HPT) [1] is a popular approach for improving the locality of a task-based parallel programming model, albeit it requires the programmer to map the dynamically unfolding tasks over a NUMA system evenly. Specifying data-affinity hints provides a more natural way to map the tasks than HPT. Still, a scalable work-stealing implementation for the same is mostly unexplored for modern NUMA systems. This paper presents PufferFish, a new async-finish parallel programming model and work-stealing runtime for NUMA systems that provide a close coupling of the data-affinity hints provided for an asynchronous task with the HPTs in Habanero C/C++ library (HClib). PufferFish introduces Hierarchical Elastic Tasks (HET) that improves the locality by shrinking itself to run on a single worker inside a place or puffing up across multiple workers depending on the work imbalance at a particular place in an HPT. We use a set of widely used memory-bound benchmarks exhibiting regular and irregular execution graphs for evaluating PufferFish. On these benchmarks, we show that PufferFish achieves a geometric mean speedup of 1.5× and 1.9× over HPT implementation in HClib and random work-stealing in CilkPlus, respectively, on a 32-core NUMA AMD EPYC processor.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC50609.2020.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Due to the challenges in providing adequate memory access to many cores on a single processor, Multi-Die and Multi-Socket based multicore systems are becoming mainstream. These systems offer cache-coherent Non-Uniform Memory Access (NUMA) across several memory banks and cache hierarchy to increase memory capacity and bandwidth. Random work-stealing is a widely used technique for dynamic load balancing of tasks on multicore processors. However, it scales poorly on such NUMA systems for memory-bound applications due to cache misses and remote memory access latency. Hierarchical Place Tree (HPT) [1] is a popular approach for improving the locality of a task-based parallel programming model, albeit it requires the programmer to map the dynamically unfolding tasks over a NUMA system evenly. Specifying data-affinity hints provides a more natural way to map the tasks than HPT. Still, a scalable work-stealing implementation for the same is mostly unexplored for modern NUMA systems. This paper presents PufferFish, a new async-finish parallel programming model and work-stealing runtime for NUMA systems that provide a close coupling of the data-affinity hints provided for an asynchronous task with the HPTs in Habanero C/C++ library (HClib). PufferFish introduces Hierarchical Elastic Tasks (HET) that improves the locality by shrinking itself to run on a single worker inside a place or puffing up across multiple workers depending on the work imbalance at a particular place in an HPT. We use a set of widely used memory-bound benchmarks exhibiting regular and irregular execution graphs for evaluating PufferFish. On these benchmarks, we show that PufferFish achieves a geometric mean speedup of 1.5× and 1.9× over HPT implementation in HClib and random work-stealing in CilkPlus, respectively, on a 32-core NUMA AMD EPYC processor.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PufferFish:使用弹性任务的NUMA-Aware工作窃取库
由于在单个处理器上为多个核心提供足够的内存访问的挑战,基于多模和多插槽的多核系统正在成为主流。这些系统提供跨多个内存库和缓存层次结构的缓存一致非统一内存访问(NUMA),以增加内存容量和带宽。随机工作窃取是一种广泛应用于多核处理器任务动态负载平衡的技术。然而,由于缓存丢失和远程内存访问延迟,它在这样的NUMA系统上对于内存受限的应用程序伸缩性很差。分层位置树(HPT)[1]是一种用于改进基于任务的并行编程模型的局部性的流行方法,尽管它要求程序员在NUMA系统上均匀地映射动态展开的任务。指定数据关联提示提供了一种比HPT更自然的映射任务的方法。尽管如此,对于现代NUMA系统来说,同样的可扩展工作窃取实现基本上还没有被探索过。本文提出了PufferFish,一个新的异步完成并行编程模型和用于NUMA系统的偷工运行时,它提供了为异步任务提供的数据关联提示与Habanero C/ c++库(HClib)中的HPTs的紧密耦合。PufferFish引入了分层弹性任务(HET),它通过缩小自身以在一个地方的单个工人上运行,或者根据HPT中特定位置的工作不平衡在多个工人上膨胀来提高局部性。我们使用了一组广泛使用的内存约束基准,展示了规则和不规则的执行图来评估PufferFish。在这些基准测试中,我们表明PufferFish在32核NUMA AMD EPYC处理器上分别比HClib中的HPT实现和CilkPlus中的随机工作窃取分别实现了1.5倍和1.9倍的几何平均加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HiPC 2020 ORGANIZATION HiPC 2020 Industry Sponsors PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks Algorithms for Preemptive Co-scheduling of Kernels on GPUs 27th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020) Technical program
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1