Adaptive latency-aware parallel resource mapping: task graph scheduling onto heterogeneous network topology

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery Pub Date : 2013-07-22 DOI:10.1145/2484762.2484787

L. Shih

{"title":"Adaptive latency-aware parallel resource mapping: task graph scheduling onto heterogeneous network topology","authors":"L. Shih","doi":"10.1145/2484762.2484787","DOIUrl":null,"url":null,"abstract":"Given a graph pair, an acyclic task data-flow graph (DFG) and a processor network topology graph with 2-way communication channels, the latency-adaptive A* parallel resource mapping produces an efficient task execution schedule that can also be used to quantify the quality of a parallel software/hardware match. The network latency adaptive parallel mapping framework, from static task DFG, to parallel processor network topology graph, is aimed at automatically optimizing workflow task scheduling among computation cluster nodes or subnets, including CPU, multicore, VLIW and co-processor accelerators such as GPUs, DSPs, FPGA fabric blocks, etc. The latency-adaptive parallel mapper starts scheduling by assigning the highest priority task a centrally located, capable processor in the network topology, and then conservatively assigns additional nearby, capable network processor cores only as needed to improve computation efficiency with fewest, yet sufficient processors scheduled. For slower communication with high inter/intra-processor latency ratios, the adaptive parallel mapper automatically opts for fewer processor cores, or even schedules just a single sequential processor, over parallel processing. The examples tested on a simulated adaptive mapper, demonstrate that the latency-adaptive parallel resource mapping successfully achieves better cost-efficiency in comparison to fixed task-to-processor mapping, in nearly optimal speedup, using only fewer nearby processors, resulting in only 1 or no processor/switch hop in around 90% of the data transfers. Inversely for faster networks, more processors are scheduled automatically due to lower inter-processor latency. In extreme cases, where offloading next task to another processor may be faster than waiting for a processor to finish the current task (i.e., when inter/intra-processor latency ratio < 1), the latency adaptive mapper seems to extrapolate well on how pipeline processing can outperform parallel processing, offering a surprising bonus in this parallel resource mapping study.","PeriodicalId":426819,"journal":{"name":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484762.2484787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Given a graph pair, an acyclic task data-flow graph (DFG) and a processor network topology graph with 2-way communication channels, the latency-adaptive A* parallel resource mapping produces an efficient task execution schedule that can also be used to quantify the quality of a parallel software/hardware match. The network latency adaptive parallel mapping framework, from static task DFG, to parallel processor network topology graph, is aimed at automatically optimizing workflow task scheduling among computation cluster nodes or subnets, including CPU, multicore, VLIW and co-processor accelerators such as GPUs, DSPs, FPGA fabric blocks, etc. The latency-adaptive parallel mapper starts scheduling by assigning the highest priority task a centrally located, capable processor in the network topology, and then conservatively assigns additional nearby, capable network processor cores only as needed to improve computation efficiency with fewest, yet sufficient processors scheduled. For slower communication with high inter/intra-processor latency ratios, the adaptive parallel mapper automatically opts for fewer processor cores, or even schedules just a single sequential processor, over parallel processing. The examples tested on a simulated adaptive mapper, demonstrate that the latency-adaptive parallel resource mapping successfully achieves better cost-efficiency in comparison to fixed task-to-processor mapping, in nearly optimal speedup, using only fewer nearby processors, resulting in only 1 or no processor/switch hop in around 90% of the data transfers. Inversely for faster networks, more processors are scheduled automatically due to lower inter-processor latency. In extreme cases, where offloading next task to another processor may be faster than waiting for a processor to finish the current task (i.e., when inter/intra-processor latency ratio < 1), the latency adaptive mapper seems to extrapolate well on how pipeline processing can outperform parallel processing, offering a surprising bonus in this parallel resource mapping study.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

自适应延迟感知并行资源映射:异构网络拓扑任务图调度

给定一个图对，一个无环任务数据流图(DFG)和一个具有双向通信通道的处理器网络拓扑图，延迟自适应a *并行资源映射产生一个有效的任务执行计划，该计划也可用于量化并行软件/硬件匹配的质量。网络延迟自适应并行映射框架，从静态任务DFG到并行处理器网络拓扑图，旨在自动优化计算集群节点或子网之间的工作流任务调度，包括CPU、多核、VLIW和协处理器加速器(如gpu、dsp、FPGA结构块等)。延迟自适应并行映射器通过将最高优先级的任务分配给网络拓扑中位于中心的、有能力的处理器来开始调度，然后仅在需要时保守地分配附近的、有能力的网络处理器内核，以使用最少但足够的处理器来提高计算效率。对于具有高处理器间/处理器内延迟比的较慢通信，自适应并行映射器会自动选择更少的处理器内核，甚至只调度单个顺序处理器，而不是并行处理。在模拟自适应映射器上测试的示例表明，与固定的任务到处理器映射相比，延迟自适应并行资源映射成功地实现了更好的成本效率，在近乎最佳的加速中，仅使用较少的附近处理器，导致大约90%的数据传输中只有一个或没有处理器/交换机跳。相反，对于更快的网络，由于更低的处理器间延迟，会自动调度更多的处理器。在极端情况下，将下一个任务卸载到另一个处理器可能比等待处理器完成当前任务更快(即，当处理器间/处理器内延迟比< 1时)，延迟自适应映射器似乎可以很好地推断管道处理如何优于并行处理，在并行资源映射研究中提供了令人惊讶的奖励。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助