Fault-tolerant message switching based on wormhole switching and backtracking

Manabu Sueishi, M. Kitakami, Hideo Ito
{"title":"Fault-tolerant message switching based on wormhole switching and backtracking","authors":"Manabu Sueishi, M. Kitakami, Hideo Ito","doi":"10.1109/PRDC.2004.1276569","DOIUrl":null,"url":null,"abstract":"Parallel computers are now popularly applied to applications where many calculations are required. In a NO Remote memory Access model (NORA) parallel computer, many processors are connected by communication links and calculation results are obtained by communications among processors. The message switching method, which controls message transmission in the parallel computer, is one of the most important parameters to improve the performance of the parallel computer. Since parallel computers include many processors, its failure rate is very high and many fault-tolerant switching methods have been proposed. The existing methods have problems, however, such as low communication throughput, low fault-tolerant capability, and large hardware overhead. We propose fault-tolerant switching by improving wormhole switching. The proposed method inserts dummy flits, having no information, after the header flit, the first flit of the packet. By overwriting the header flit to the dummy flit, backtracking is implemented without hardware overhead. Computer simulation says that in a 16 by 16 2D torus, for example, the throughput of the proposed method is almost equal to that of existing methods which require large hardware overhead if the number of the faulty nodes is less then 40.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2004.1276569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Parallel computers are now popularly applied to applications where many calculations are required. In a NO Remote memory Access model (NORA) parallel computer, many processors are connected by communication links and calculation results are obtained by communications among processors. The message switching method, which controls message transmission in the parallel computer, is one of the most important parameters to improve the performance of the parallel computer. Since parallel computers include many processors, its failure rate is very high and many fault-tolerant switching methods have been proposed. The existing methods have problems, however, such as low communication throughput, low fault-tolerant capability, and large hardware overhead. We propose fault-tolerant switching by improving wormhole switching. The proposed method inserts dummy flits, having no information, after the header flit, the first flit of the packet. By overwriting the header flit to the dummy flit, backtracking is implemented without hardware overhead. Computer simulation says that in a 16 by 16 2D torus, for example, the throughput of the proposed method is almost equal to that of existing methods which require large hardware overhead if the number of the faulty nodes is less then 40.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于虫洞交换和回溯的容错消息交换
并行计算机现在广泛应用于需要大量计算的应用中。在无远程存储器访问模型(NORA)并行计算机中,多个处理器之间通过通信链路连接,通过处理器之间的通信获得计算结果。消息交换方法控制并行计算机中的消息传输,是提高并行计算机性能的重要参数之一。由于并行计算机包含许多处理器,故障率很高,因此提出了许多容错切换方法。但是,现有的方法存在通信吞吐量低、容错能力差、硬件开销大等问题。我们通过改进虫洞交换提出了容错交换。所提出的方法在数据包的头字节(第一个字节)之后插入不包含任何信息的虚拟字节。通过将报头flit覆盖到虚拟flit,可以实现回溯,而无需硬件开销。计算机仿真表明,以一个16 × 16的二维环面为例,当故障节点数小于40时,所提出方法的吞吐量与现有需要大量硬件开销的方法的吞吐量几乎相等。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The system recovery benchmark Availabilities and costs of reliable Fat-Btrees Application-level fault tolerance in the orbital thermal imaging spectrometer Expected-reliability analysis for wireless CORBA with imperfect components Failure handling in a reliable multicast protocol for improving buffer utilization and accommodating heterogeneous receivers
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1