Adaptive latency-aware parallel resource mapping: task graph scheduling onto heterogeneous network topology

L. Shih
{"title":"Adaptive latency-aware parallel resource mapping: task graph scheduling onto heterogeneous network topology","authors":"L. Shih","doi":"10.1145/2484762.2484787","DOIUrl":null,"url":null,"abstract":"Given a graph pair, an acyclic task data-flow graph (DFG) and a processor network topology graph with 2-way communication channels, the latency-adaptive A* parallel resource mapping produces an efficient task execution schedule that can also be used to quantify the quality of a parallel software/hardware match. The network latency adaptive parallel mapping framework, from static task DFG, to parallel processor network topology graph, is aimed at automatically optimizing workflow task scheduling among computation cluster nodes or subnets, including CPU, multicore, VLIW and co-processor accelerators such as GPUs, DSPs, FPGA fabric blocks, etc. The latency-adaptive parallel mapper starts scheduling by assigning the highest priority task a centrally located, capable processor in the network topology, and then conservatively assigns additional nearby, capable network processor cores only as needed to improve computation efficiency with fewest, yet sufficient processors scheduled. For slower communication with high inter/intra-processor latency ratios, the adaptive parallel mapper automatically opts for fewer processor cores, or even schedules just a single sequential processor, over parallel processing. The examples tested on a simulated adaptive mapper, demonstrate that the latency-adaptive parallel resource mapping successfully achieves better cost-efficiency in comparison to fixed task-to-processor mapping, in nearly optimal speedup, using only fewer nearby processors, resulting in only 1 or no processor/switch hop in around 90% of the data transfers. Inversely for faster networks, more processors are scheduled automatically due to lower inter-processor latency. In extreme cases, where offloading next task to another processor may be faster than waiting for a processor to finish the current task (i.e., when inter/intra-processor latency ratio < 1), the latency adaptive mapper seems to extrapolate well on how pipeline processing can outperform parallel processing, offering a surprising bonus in this parallel resource mapping study.","PeriodicalId":426819,"journal":{"name":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484762.2484787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Given a graph pair, an acyclic task data-flow graph (DFG) and a processor network topology graph with 2-way communication channels, the latency-adaptive A* parallel resource mapping produces an efficient task execution schedule that can also be used to quantify the quality of a parallel software/hardware match. The network latency adaptive parallel mapping framework, from static task DFG, to parallel processor network topology graph, is aimed at automatically optimizing workflow task scheduling among computation cluster nodes or subnets, including CPU, multicore, VLIW and co-processor accelerators such as GPUs, DSPs, FPGA fabric blocks, etc. The latency-adaptive parallel mapper starts scheduling by assigning the highest priority task a centrally located, capable processor in the network topology, and then conservatively assigns additional nearby, capable network processor cores only as needed to improve computation efficiency with fewest, yet sufficient processors scheduled. For slower communication with high inter/intra-processor latency ratios, the adaptive parallel mapper automatically opts for fewer processor cores, or even schedules just a single sequential processor, over parallel processing. The examples tested on a simulated adaptive mapper, demonstrate that the latency-adaptive parallel resource mapping successfully achieves better cost-efficiency in comparison to fixed task-to-processor mapping, in nearly optimal speedup, using only fewer nearby processors, resulting in only 1 or no processor/switch hop in around 90% of the data transfers. Inversely for faster networks, more processors are scheduled automatically due to lower inter-processor latency. In extreme cases, where offloading next task to another processor may be faster than waiting for a processor to finish the current task (i.e., when inter/intra-processor latency ratio < 1), the latency adaptive mapper seems to extrapolate well on how pipeline processing can outperform parallel processing, offering a surprising bonus in this parallel resource mapping study.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
自适应延迟感知并行资源映射:异构网络拓扑任务图调度
给定一个图对,一个无环任务数据流图(DFG)和一个具有双向通信通道的处理器网络拓扑图,延迟自适应a *并行资源映射产生一个有效的任务执行计划,该计划也可用于量化并行软件/硬件匹配的质量。网络延迟自适应并行映射框架,从静态任务DFG到并行处理器网络拓扑图,旨在自动优化计算集群节点或子网之间的工作流任务调度,包括CPU、多核、VLIW和协处理器加速器(如gpu、dsp、FPGA结构块等)。延迟自适应并行映射器通过将最高优先级的任务分配给网络拓扑中位于中心的、有能力的处理器来开始调度,然后仅在需要时保守地分配附近的、有能力的网络处理器内核,以使用最少但足够的处理器来提高计算效率。对于具有高处理器间/处理器内延迟比的较慢通信,自适应并行映射器会自动选择更少的处理器内核,甚至只调度单个顺序处理器,而不是并行处理。在模拟自适应映射器上测试的示例表明,与固定的任务到处理器映射相比,延迟自适应并行资源映射成功地实现了更好的成本效率,在近乎最佳的加速中,仅使用较少的附近处理器,导致大约90%的数据传输中只有一个或没有处理器/交换机跳。相反,对于更快的网络,由于更低的处理器间延迟,会自动调度更多的处理器。在极端情况下,将下一个任务卸载到另一个处理器可能比等待处理器完成当前任务更快(即,当处理器间/处理器内延迟比< 1时),延迟自适应映射器似乎可以很好地推断管道处理如何优于并行处理,在并行资源映射研究中提供了令人惊讶的奖励。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Optimizing utilization across XSEDE platforms Adaptive latency-aware parallel resource mapping: task graph scheduling onto heterogeneous network topology Optimizing the PCIT algorithm on stampede's Xeon and Xeon Phi processors for faster discovery of biological networks Training, education, and outreach: raising the bar Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1