Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory

IF 1.2 Q3 COMPUTER SCIENCE, THEORY & METHODS ACM Transactions on Parallel Computing Pub Date : 2016-08-08 DOI:10.1145/2948975

Roshan Dathathri, Ravi Teja Mullapudi, Uday Bondhugula

{"title":"Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory","authors":"Roshan Dathathri, Ravi Teja Mullapudi, Uday Bondhugula","doi":"10.1145/2948975","DOIUrl":null,"url":null,"abstract":"Current de-facto parallel programming models like OpenMP and MPI make it difficult to extract task-level dataflow parallelism as opposed to bulk-synchronous parallelism. Task parallel approaches that use point-to-point synchronization between dependent tasks in conjunction with dynamic scheduling dataflow runtimes are thus becoming attractive. Although good performance can be extracted for both shared and distributed memory using these approaches, there is little compiler support for them.\n In this article, we describe the design of compiler--runtime interaction to automatically extract coarse-grained dataflow parallelism in affine loop nests for both shared and distributed-memory architectures. We use techniques from the polyhedral compiler framework to extract tasks and generate components of the runtime that are used to dynamically schedule the generated tasks. The runtime includes a distributed decentralized scheduler that dynamically schedules tasks on a node. The schedulers on different nodes cooperate with each other through asynchronous point-to-point communication, and all of this is achieved by code automatically generated by the compiler. On a set of six representative affine loop nest benchmarks, while running on 32 nodes with 8 threads each, our compiler-assisted runtime yields a geometric mean speedup of 143.6× (70.3× to 474.7× ) over the sequential version and a geometric mean speedup of 1.64× (1.04× to 2.42× ) over the state-of-the-art automatic parallelization approach that uses bulk synchronization. We also compare our system with past work that addresses some of these challenges on shared memory, and an emerging runtime (Intel Concurrent Collections) that demands higher programmer input and effort in parallelizing. To the best of our knowledge, ours is also the first automatic scheme that allows for dynamic scheduling of affine loop nests on a cluster of multicores.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"17 1","pages":"12:1-12:28"},"PeriodicalIF":1.2000,"publicationDate":"2016-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2948975","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 9

Abstract

Current de-facto parallel programming models like OpenMP and MPI make it difficult to extract task-level dataflow parallelism as opposed to bulk-synchronous parallelism. Task parallel approaches that use point-to-point synchronization between dependent tasks in conjunction with dynamic scheduling dataflow runtimes are thus becoming attractive. Although good performance can be extracted for both shared and distributed memory using these approaches, there is little compiler support for them. In this article, we describe the design of compiler--runtime interaction to automatically extract coarse-grained dataflow parallelism in affine loop nests for both shared and distributed-memory architectures. We use techniques from the polyhedral compiler framework to extract tasks and generate components of the runtime that are used to dynamically schedule the generated tasks. The runtime includes a distributed decentralized scheduler that dynamically schedules tasks on a node. The schedulers on different nodes cooperate with each other through asynchronous point-to-point communication, and all of this is achieved by code automatically generated by the compiler. On a set of six representative affine loop nest benchmarks, while running on 32 nodes with 8 threads each, our compiler-assisted runtime yields a geometric mean speedup of 143.6× (70.3× to 474.7× ) over the sequential version and a geometric mean speedup of 1.64× (1.04× to 2.42× ) over the state-of-the-art automatic parallelization approach that uses bulk synchronization. We also compare our system with past work that addresses some of these challenges on shared memory, and an emerging runtime (Intel Concurrent Collections) that demands higher programmer input and effort in parallelizing. To the best of our knowledge, ours is also the first automatic scheme that allows for dynamic scheduling of affine loop nests on a cluster of multicores.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为共享和分布式内存上的动态调度运行时编译仿射循环巢

当前事实上的并行编程模型(如OpenMP和MPI)很难提取任务级数据流并行性，而不是批量同步并行性。因此，在依赖任务之间使用点对点同步并结合动态调度数据流运行时的任务并行方法变得越来越有吸引力。尽管使用这些方法可以为共享和分布式内存提取良好的性能，但编译器对它们的支持很少。在本文中，我们描述了编译器-运行时交互的设计，以自动提取共享和分布式内存体系结构中仿射循环巢中的粗粒度数据流并行性。我们使用来自多面体编译器框架的技术来提取任务并生成用于动态调度生成的任务的运行时组件。运行时包括一个分布式的去中心化调度程序，它可以动态地调度节点上的任务。不同节点上的调度器通过异步点对点通信相互协作，所有这些都是由编译器自动生成的代码实现的。在一组具有代表性的六个仿射循环巢基准测试中，当运行在32个节点上，每个节点有8个线程时，我们的编译器辅助运行时比顺序版本产生了143.6倍(70.3倍到474.7倍)的几何平均加速，比使用批量同步的最先进的自动并行化方法产生了1.64倍(1.04倍到2.42倍)的几何平均加速。我们还将我们的系统与过去的工作进行了比较，这些工作解决了共享内存方面的一些挑战，以及新兴的运行时(Intel Concurrent Collections)，它要求程序员在并行化方面投入更多的精力和精力。据我们所知，我们的方案也是第一个允许在多核集群上动态调度仿射环路巢的自动方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊