Dynamic Fine-Grain Scheduling of Pipeline Parallelism

Daniel Sánchez, David Lo, Richard M. Yoo, J. Sugerman, C. Kozyrakis
{"title":"Dynamic Fine-Grain Scheduling of Pipeline Parallelism","authors":"Daniel Sánchez, David Lo, Richard M. Yoo, J. Sugerman, C. Kozyrakis","doi":"10.1109/PACT.2011.9","DOIUrl":null,"url":null,"abstract":"Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or the underlying hardware causes load imbalance, hindering performance. On the other hand, existing schemes for dynamic fine-grain load balancing (such as task-stealing) do not work well on pipeline-parallel programs: they cannot guarantee memory footprint bounds, and do not adequately schedule complex graphs or graphs with ordered queues. We present a scheduler implementation for pipeline-parallel programs that performs fine-grain dynamic load balancing efficiently. Specifically, we implement the first real runtime for GRAMPS, a recently proposed programming model that focuses on supporting irregular pipeline and data-parallel applications (in contrast to classical stream programming models and schedulers, which require programs to be regular). Task-stealing with per-stage queues and queuing policies, coupled with a backpressure mechanism, allow us to maintain strict footprint bounds, and a buffer management scheme based on packet-stealing allows low-overhead and locality-aware dynamic allocation of queue data. We evaluate our runtime on a multi-core SMP and find that it provides low-overhead scheduling of irregular workloads while maintaining locality. We also show that the GRAMPS scheduler outperforms several other commonly used scheduling approaches. Specifically, while a typical task-stealing scheduler performs on par with GRAMPS on simple graphs, it does significantly worse on complex ones, a canonical GPGPU scheduler cannot exploit pipeline parallelism and suffers from large memory footprints, and a typical static, streaming scheduler achieves somewhat better locality, but suffers significant load imbalance on a general-purpose multi-core due to fine-grain architecture variability (e.g., cache misses and SMT).","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 61

Abstract

Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or the underlying hardware causes load imbalance, hindering performance. On the other hand, existing schemes for dynamic fine-grain load balancing (such as task-stealing) do not work well on pipeline-parallel programs: they cannot guarantee memory footprint bounds, and do not adequately schedule complex graphs or graphs with ordered queues. We present a scheduler implementation for pipeline-parallel programs that performs fine-grain dynamic load balancing efficiently. Specifically, we implement the first real runtime for GRAMPS, a recently proposed programming model that focuses on supporting irregular pipeline and data-parallel applications (in contrast to classical stream programming models and schedulers, which require programs to be regular). Task-stealing with per-stage queues and queuing policies, coupled with a backpressure mechanism, allow us to maintain strict footprint bounds, and a buffer management scheme based on packet-stealing allows low-overhead and locality-aware dynamic allocation of queue data. We evaluate our runtime on a multi-core SMP and find that it provides low-overhead scheduling of irregular workloads while maintaining locality. We also show that the GRAMPS scheduler outperforms several other commonly used scheduling approaches. Specifically, while a typical task-stealing scheduler performs on par with GRAMPS on simple graphs, it does significantly worse on complex ones, a canonical GPGPU scheduler cannot exploit pipeline parallelism and suffers from large memory footprints, and a typical static, streaming scheduler achieves somewhat better locality, but suffers significant load imbalance on a general-purpose multi-core due to fine-grain architecture variability (e.g., cache misses and SMT).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
管道并行性的动态细粒度调度
调度管道并行程序(定义为通过队列显式通信的阶段图)具有挑战性。当应用程序是常规的并且底层体系结构可以保证可预测的执行时间时,存在几种技术来计算高度优化的静态调度。然而,这些调度不允许运行时负载平衡,因此应用程序或底层硬件引入的可变性会导致负载不平衡,从而影响性能。另一方面,现有的动态细粒度负载平衡方案(例如任务窃取)在管道并行程序上不能很好地工作:它们不能保证内存占用范围,并且不能充分调度复杂的图或具有有序队列的图。我们提出了一种用于管道并行程序的调度器实现,它可以有效地执行细粒度动态负载平衡。具体来说,我们实现了GRAMPS的第一个真实运行时,这是一个最近提出的编程模型,专注于支持不规则的管道和数据并行应用程序(与传统的流编程模型和调度程序相反,它们要求程序是规则的)。使用每阶段队列和队列策略的任务窃取,加上反压机制,使我们能够保持严格的内存占用范围,并且基于数据包窃取的缓冲区管理方案允许低开销和位置感知的队列数据动态分配。我们在多核SMP上评估了我们的运行时,发现它在保持局部性的同时提供了不规则工作负载的低开销调度。我们还展示了GRAMPS调度器优于其他几种常用的调度方法。具体来说,虽然典型的任务窃取调度器在简单图形上的性能与GRAMPS相当,但在复杂图形上的性能却明显差得多;规范的GPGPU调度器不能利用管道并行性,并且会占用大量内存;典型的静态流调度器可以实现更好的局部性,但由于细粒度架构可变性(例如,缓存丢失和SMT),在通用多核上会遭受严重的负载不平衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Modeling and Performance Evaluation of TSO-Preserving Binary Optimization An Alternative Memory Access Scheduling in Manycore Accelerators DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1