Dynamic Fine-Grain Scheduling of Pipeline Parallelism

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI:10.1109/PACT.2011.9

Daniel Sánchez, David Lo, Richard M. Yoo, J. Sugerman, C. Kozyrakis

{"title":"Dynamic Fine-Grain Scheduling of Pipeline Parallelism","authors":"Daniel Sánchez, David Lo, Richard M. Yoo, J. Sugerman, C. Kozyrakis","doi":"10.1109/PACT.2011.9","DOIUrl":null,"url":null,"abstract":"Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or the underlying hardware causes load imbalance, hindering performance. On the other hand, existing schemes for dynamic fine-grain load balancing (such as task-stealing) do not work well on pipeline-parallel programs: they cannot guarantee memory footprint bounds, and do not adequately schedule complex graphs or graphs with ordered queues. We present a scheduler implementation for pipeline-parallel programs that performs fine-grain dynamic load balancing efficiently. Specifically, we implement the first real runtime for GRAMPS, a recently proposed programming model that focuses on supporting irregular pipeline and data-parallel applications (in contrast to classical stream programming models and schedulers, which require programs to be regular). Task-stealing with per-stage queues and queuing policies, coupled with a backpressure mechanism, allow us to maintain strict footprint bounds, and a buffer management scheme based on packet-stealing allows low-overhead and locality-aware dynamic allocation of queue data. We evaluate our runtime on a multi-core SMP and find that it provides low-overhead scheduling of irregular workloads while maintaining locality. We also show that the GRAMPS scheduler outperforms several other commonly used scheduling approaches. Specifically, while a typical task-stealing scheduler performs on par with GRAMPS on simple graphs, it does significantly worse on complex ones, a canonical GPGPU scheduler cannot exploit pipeline parallelism and suffers from large memory footprints, and a typical static, streaming scheduler achieves somewhat better locality, but suffers significant load imbalance on a general-purpose multi-core due to fine-grain architecture variability (e.g., cache misses and SMT).","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 61

Abstract

Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or the underlying hardware causes load imbalance, hindering performance. On the other hand, existing schemes for dynamic fine-grain load balancing (such as task-stealing) do not work well on pipeline-parallel programs: they cannot guarantee memory footprint bounds, and do not adequately schedule complex graphs or graphs with ordered queues. We present a scheduler implementation for pipeline-parallel programs that performs fine-grain dynamic load balancing efficiently. Specifically, we implement the first real runtime for GRAMPS, a recently proposed programming model that focuses on supporting irregular pipeline and data-parallel applications (in contrast to classical stream programming models and schedulers, which require programs to be regular). Task-stealing with per-stage queues and queuing policies, coupled with a backpressure mechanism, allow us to maintain strict footprint bounds, and a buffer management scheme based on packet-stealing allows low-overhead and locality-aware dynamic allocation of queue data. We evaluate our runtime on a multi-core SMP and find that it provides low-overhead scheduling of irregular workloads while maintaining locality. We also show that the GRAMPS scheduler outperforms several other commonly used scheduling approaches. Specifically, while a typical task-stealing scheduler performs on par with GRAMPS on simple graphs, it does significantly worse on complex ones, a canonical GPGPU scheduler cannot exploit pipeline parallelism and suffers from large memory footprints, and a typical static, streaming scheduler achieves somewhat better locality, but suffers significant load imbalance on a general-purpose multi-core due to fine-grain architecture variability (e.g., cache misses and SMT).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

管道并行性的动态细粒度调度

调度管道并行程序(定义为通过队列显式通信的阶段图)具有挑战性。当应用程序是常规的并且底层体系结构可以保证可预测的执行时间时，存在几种技术来计算高度优化的静态调度。然而，这些调度不允许运行时负载平衡，因此应用程序或底层硬件引入的可变性会导致负载不平衡，从而影响性能。另一方面，现有的动态细粒度负载平衡方案(例如任务窃取)在管道并行程序上不能很好地工作:它们不能保证内存占用范围，并且不能充分调度复杂的图或具有有序队列的图。我们提出了一种用于管道并行程序的调度器实现，它可以有效地执行细粒度动态负载平衡。具体来说，我们实现了GRAMPS的第一个真实运行时，这是一个最近提出的编程模型，专注于支持不规则的管道和数据并行应用程序(与传统的流编程模型和调度程序相反，它们要求程序是规则的)。使用每阶段队列和队列策略的任务窃取，加上反压机制，使我们能够保持严格的内存占用范围，并且基于数据包窃取的缓冲区管理方案允许低开销和位置感知的队列数据动态分配。我们在多核SMP上评估了我们的运行时，发现它在保持局部性的同时提供了不规则工作负载的低开销调度。我们还展示了GRAMPS调度器优于其他几种常用的调度方法。具体来说，虽然典型的任务窃取调度器在简单图形上的性能与GRAMPS相当，但在复杂图形上的性能却明显差得多;规范的GPGPU调度器不能利用管道并行性，并且会占用大量内存;典型的静态流调度器可以实现更好的局部性，但由于细粒度架构可变性(例如，缓存丢失和SMT)，在通用多核上会遭受严重的负载不平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量