Load Sharing In Hypercube Multicomputers In The Presence Of Node Failures

Proceedings of the Fifth Distributed Memory Computing Conference, 1990. Pub Date : 1990-04-08 DOI:10.1109/DMCC.1990.556410

Yi-Chieh Chang, K. Shin

{"title":"Load Sharing In Hypercube Multicomputers In The Presence Of Node Failures","authors":"Yi-Chieh Chang, K. Shin","doi":"10.1109/DMCC.1990.556410","DOIUrl":null,"url":null,"abstract":"This paper discusses and analyzes two load sharing (LS) issues: adjusting preferred lists and implementing a fault-tolerant mechanism in the presence of node failures. In an early paper, we have proposed to order the nodes in each node's proximity into a preferred list for the purpose of load sharing in distributed real-time systems. The preferred list of each node is constructed in such a way that each node will be selected as the kth preferred node by one and only one other node. Such lists are proven to allow the tasks to be evenly distributed in a system. However, the presence of faulty nodes will destroy the original structure of a preferred list if the faulty nodes are simply skipped in the preferred list. An algorithm is therefore proposed to modify each preferred list to retain its original features regardless of the number of faulty nodes in the system. The communication overhead introduced by this algorithm is shown to be minimal. Based on the modified preferred lists, a simple fault-tolerant mechanism is implemented. Each node is equipped with a backup queue which 'The work reported in this paper was supported in part by the Office of Naval Research under contract N0001485-K-0122, and the NSF under grant DMC-8721492. Any opinions, findings, and recommendations expressed in this publication are those of the authors and do not necessarily reflect the view of the funding agencies. stores and updates the arriving/completing tasks at its most preferrecd node. Whenever a node becomes faulty, its m.ost preferred node will treat the tasks in the baxkup queue as externally axriving tasks. Our simulation results show that this approach, despite of the simplicity, can reduce the number of task losses dramatically, as compared to the approaches without any faulttolerant mechanism.","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"337 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DMCC.1990.556410","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

This paper discusses and analyzes two load sharing (LS) issues: adjusting preferred lists and implementing a fault-tolerant mechanism in the presence of node failures. In an early paper, we have proposed to order the nodes in each node's proximity into a preferred list for the purpose of load sharing in distributed real-time systems. The preferred list of each node is constructed in such a way that each node will be selected as the kth preferred node by one and only one other node. Such lists are proven to allow the tasks to be evenly distributed in a system. However, the presence of faulty nodes will destroy the original structure of a preferred list if the faulty nodes are simply skipped in the preferred list. An algorithm is therefore proposed to modify each preferred list to retain its original features regardless of the number of faulty nodes in the system. The communication overhead introduced by this algorithm is shown to be minimal. Based on the modified preferred lists, a simple fault-tolerant mechanism is implemented. Each node is equipped with a backup queue which 'The work reported in this paper was supported in part by the Office of Naval Research under contract N0001485-K-0122, and the NSF under grant DMC-8721492. Any opinions, findings, and recommendations expressed in this publication are those of the authors and do not necessarily reflect the view of the funding agencies. stores and updates the arriving/completing tasks at its most preferrecd node. Whenever a node becomes faulty, its m.ost preferred node will treat the tasks in the baxkup queue as externally axriving tasks. Our simulation results show that this approach, despite of the simplicity, can reduce the number of task losses dramatically, as compared to the approaches without any faulttolerant mechanism.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

节点故障情况下超立方体多计算机的负载分担

本文讨论并分析了负载共享(LS)的两个问题:调整首选列表和实现节点故障时的容错机制。在早期的一篇论文中，我们提出了在分布式实时系统中，为了负载共享的目的，将每个节点邻近的节点排序到一个首选列表中。构建每个节点的首选列表的方式是，每个节点将被一个且仅一个其他节点选择为第k个首选节点。这样的列表被证明允许任务在系统中均匀分布。但是，如果在首选列表中跳过故障节点，则故障节点的存在将破坏首选列表的原始结构。因此，提出了一种算法，无论系统中故障节点的数量如何，都可以修改每个首选列表以保留其原始特征。该算法带来的通信开销是最小的。基于修改后的首选列表，实现了简单的容错机制。每个节点都配备了一个备份队列，本文报告的工作部分由海军研究办公室根据合同N0001485-K-0122和NSF根据拨款DMC-8721492提供支持。本出版物中表达的任何观点、发现和建议均为作者的观点，并不一定反映资助机构的观点。在其最优选的节点上存储和更新到达/完成的任务。当一个节点发生故障时，它的最优节点将把备份队列中的任务视为外部到达的任务。我们的仿真结果表明，尽管这种方法简单，但与没有容错机制的方法相比，可以显着减少任务损失的数量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Fifth Distributed Memory Computing Conference, 1990.

自引率

0.00%

发文量

期刊最新文献

Reducing Inner Product Computation in the Parallel One-Sided Jacobi Algorithm Experience with Concurrent Aggregates (CA): Implementation and Programming A Distributed Memory Implementation of SISAL Performance Results on the Intel Touchstone Gamma Prototype Quick Recovery of Embedded Structures in Hypercube Computers