Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer

Salvatore Orlando , Raffaele Perego
{"title":"Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer","authors":"Salvatore Orlando ,&nbsp;Raffaele Perego","doi":"10.1016/0165-6074(96)00002-6","DOIUrl":null,"url":null,"abstract":"<div><p>We consider the problem of scheduling parallel loops whose iterations operate on large array data structures and are characterized by highly varying execution times (<em>unbalanced or non-uniform</em> parallel loops). A general parallel loop implementation template for message-passing distributed-memory multiprocessors (<em>multicomputers</em>) is presented. Assuming that it is impossible to statically determine the distribution of the computational load on the data accessed, the template exploits a hybrid scheduling strategy. The data are partially replicated on the processor's local memories and iterations are statically scheduled until first load imbalances are detected. At this point an effective dynamic scheduling technique is adopted to move iterations among nodes holding the same data. Most of the communications needed to implement dynamic load balancing are overlapped with computations, as a very effective prefetching policy is adopted. The template scales very well, since knowing where data are replicated makes it possible to balance the load without introducing high overheads.</p><p>In the paper a formal characterization of load imbalance related to a generic problem instance is also proposed. This characterization is used to derive an analytical cost model for the template, and in particular, to tune those parameters of the template that depend on the costs related to the specific features of the target machine and the specific problem.</p><p>The template and the related cost model are validated by experiments conducted on a 128-node nCUBE 2, whose results are reported and discussed.</p></div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 8","pages":"Pages 645-658"},"PeriodicalIF":0.0000,"publicationDate":"1996-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(96)00002-6","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microprocessing and Microprogramming","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/0165607496000026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

We consider the problem of scheduling parallel loops whose iterations operate on large array data structures and are characterized by highly varying execution times (unbalanced or non-uniform parallel loops). A general parallel loop implementation template for message-passing distributed-memory multiprocessors (multicomputers) is presented. Assuming that it is impossible to statically determine the distribution of the computational load on the data accessed, the template exploits a hybrid scheduling strategy. The data are partially replicated on the processor's local memories and iterations are statically scheduled until first load imbalances are detected. At this point an effective dynamic scheduling technique is adopted to move iterations among nodes holding the same data. Most of the communications needed to implement dynamic load balancing are overlapped with computations, as a very effective prefetching policy is adopted. The template scales very well, since knowing where data are replicated makes it possible to balance the load without introducing high overheads.

In the paper a formal characterization of load imbalance related to a generic problem instance is also proposed. This characterization is used to derive an analytical cost model for the template, and in particular, to tune those parameters of the template that depend on the costs related to the specific features of the target machine and the specific problem.

The template and the related cost model are validated by experiments conducted on a 128-node nCUBE 2, whose results are reported and discussed.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用多机不平衡并行循环调度中的部分复制
我们考虑调度并行循环的问题,其迭代操作在大型数组数据结构上,并且具有高度变化的执行时间(不平衡或非均匀并行循环)。提出了一种用于消息传递分布式存储多处理器(多计算机)的通用并行循环实现模板。假设不可能静态地确定所访问数据的计算负载分布,该模板利用混合调度策略。数据部分复制到处理器的本地内存中,迭代被静态调度,直到检测到第一次负载不平衡。在这一点上,采用了一种有效的动态调度技术来在持有相同数据的节点之间移动迭代。由于采用了非常有效的预取策略,实现动态负载均衡所需的大部分通信都与计算重叠。模板的可伸缩性非常好,因为知道在哪里复制数据,可以在不引入高开销的情况下平衡负载。本文还提出了一种与一般问题实例相关的负载不平衡的形式化表征。该特性用于导出模板的分析成本模型,特别是用于调整模板的那些参数,这些参数取决于与目标机器的特定特征和特定问题相关的成本。在128节点的nCUBE 2上进行了实验,验证了模板和相关的成本模型,并对实验结果进行了报告和讨论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Mixing floating- and fixed-point formats for neural network learning on neuroprocessors Subject index to volume 41 (1995/1996) A graphical simulator for programmable logic controllers based on Petri nets A neural network-based replacement strategy for high performance computer architectures Modelling and performance assessment of large ATM switching networks on loosely-coupled parallel processors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1