Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer

Microprocessing and Microprogramming Pub Date : 1996-04-01 Epub Date: 2003-10-20 DOI:10.1016/0165-6074(96)00002-6

Salvatore Orlando , Raffaele Perego

{"title":"Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer","authors":"Salvatore Orlando , Raffaele Perego","doi":"10.1016/0165-6074(96)00002-6","DOIUrl":null,"url":null,"abstract":"<div>We consider the problem of scheduling parallel loops whose iterations operate on large array data structures and are characterized by highly varying execution times (unbalanced or non-uniform parallel loops). A general parallel loop implementation template for message-passing distributed-memory multiprocessors (multicomputers) is presented. Assuming that it is impossible to statically determine the distribution of the computational load on the data accessed, the template exploits a hybrid scheduling strategy. The data are partially replicated on the processor's local memories and iterations are statically scheduled until first load imbalances are detected. At this point an effective dynamic scheduling technique is adopted to move iterations among nodes holding the same data. Most of the communications needed to implement dynamic load balancing are overlapped with computations, as a very effective prefetching policy is adopted. The template scales very well, since knowing where data are replicated makes it possible to balance the load without introducing high overheads.In the paper a formal characterization of load imbalance related to a generic problem instance is also proposed. This characterization is used to derive an analytical cost model for the template, and in particular, to tune those parameters of the template that depend on the costs related to the specific features of the target machine and the specific problem.The template and the related cost model are validated by experiments conducted on a 128-node nCUBE 2, whose results are reported and discussed.</div>","PeriodicalId":100927,"journal":{"name":"Microprocessing and Microprogramming","volume":"41 8","pages":"Pages 645-658"},"PeriodicalIF":0.0000,"publicationDate":"1996-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0165-6074(96)00002-6","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microprocessing and Microprogramming","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/0165607496000026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2003/10/20 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

We consider the problem of scheduling parallel loops whose iterations operate on large array data structures and are characterized by highly varying execution times (unbalanced or non-uniform parallel loops). A general parallel loop implementation template for message-passing distributed-memory multiprocessors (multicomputers) is presented. Assuming that it is impossible to statically determine the distribution of the computational load on the data accessed, the template exploits a hybrid scheduling strategy. The data are partially replicated on the processor's local memories and iterations are statically scheduled until first load imbalances are detected. At this point an effective dynamic scheduling technique is adopted to move iterations among nodes holding the same data. Most of the communications needed to implement dynamic load balancing are overlapped with computations, as a very effective prefetching policy is adopted. The template scales very well, since knowing where data are replicated makes it possible to balance the load without introducing high overheads.

In the paper a formal characterization of load imbalance related to a generic problem instance is also proposed. This characterization is used to derive an analytical cost model for the template, and in particular, to tune those parameters of the template that depend on the costs related to the specific features of the target machine and the specific problem.

The template and the related cost model are validated by experiments conducted on a 128-node nCUBE 2, whose results are reported and discussed.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用多机不平衡并行循环调度中的部分复制

我们考虑调度并行循环的问题，其迭代操作在大型数组数据结构上，并且具有高度变化的执行时间(不平衡或非均匀并行循环)。提出了一种用于消息传递分布式存储多处理器(多计算机)的通用并行循环实现模板。假设不可能静态地确定所访问数据的计算负载分布，该模板利用混合调度策略。数据部分复制到处理器的本地内存中，迭代被静态调度，直到检测到第一次负载不平衡。在这一点上，采用了一种有效的动态调度技术来在持有相同数据的节点之间移动迭代。由于采用了非常有效的预取策略，实现动态负载均衡所需的大部分通信都与计算重叠。模板的可伸缩性非常好，因为知道在哪里复制数据，可以在不引入高开销的情况下平衡负载。本文还提出了一种与一般问题实例相关的负载不平衡的形式化表征。该特性用于导出模板的分析成本模型，特别是用于调整模板的那些参数，这些参数取决于与目标机器的特定特征和特定问题相关的成本。在128节点的nCUBE 2上进行了实验，验证了模板和相关的成本模型，并对实验结果进行了报告和讨论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Microprocessing and Microprogramming

自引率

0.00%

发文量