RAS Modeling of an HPC Switch System

D. Tang, William Bryson, Richard Elling
{"title":"RAS Modeling of an HPC Switch System","authors":"D. Tang, William Bryson, Richard Elling","doi":"10.1109/PRDC.2008.19","DOIUrl":null,"url":null,"abstract":"The high end of high performance computing (HPC) systems is now moving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. Interconnection of the sheer number of server nodes in an HPC system plays a vital role in the developments. InfiniBand has emerged as a compelling interconnect technology, and provides more scalability and significantly better cost- performance than any other known protocols. This paper presents a reliability, availability, and serviceability (RAS) modeling and analysis of the Sun Datacenter Switch 3456 system, the world's largest standards-based InfiniBand switch, with direct capacity to host up to 3,456 server nodes, against hardware faults. The results show that the system reliability, in terms of connectivity between the server nodes physically connected to the switch, is high for configurations with redundant ports. The study also shows that practicing deferred repair strategies can significantly reduce unscheduled service events and system downtime. Further, the study identifies optimal service strategies by a tradeoff analysis on reliability and availability.","PeriodicalId":369064,"journal":{"name":"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"142 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2008.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The high end of high performance computing (HPC) systems is now moving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. Interconnection of the sheer number of server nodes in an HPC system plays a vital role in the developments. InfiniBand has emerged as a compelling interconnect technology, and provides more scalability and significantly better cost- performance than any other known protocols. This paper presents a reliability, availability, and serviceability (RAS) modeling and analysis of the Sun Datacenter Switch 3456 system, the world's largest standards-based InfiniBand switch, with direct capacity to host up to 3,456 server nodes, against hardware faults. The results show that the system reliability, in terms of connectivity between the server nodes physically connected to the switch, is high for configurations with redundant ports. The study also shows that practicing deferred repair strategies can significantly reduce unscheduled service events and system downtime. Further, the study identifies optimal service strategies by a tradeoff analysis on reliability and availability.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高性能计算交换系统的RAS建模
高性能计算(HPC)系统的高端正朝着千万亿级部署方向发展,提供千万亿次的计算能力和千兆字节的存储能力。在高性能计算系统中,大量服务器节点的互连在发展中起着至关重要的作用。InfiniBand已经成为一种引人注目的互连技术,并提供比任何其他已知协议更多的可扩展性和显着更好的性价比。本文介绍了Sun数据中心交换机3456系统的可靠性、可用性和可服务性(RAS)建模和分析,该系统是世界上最大的基于标准的InfiniBand交换机,可承载多达3,456个服务器节点,可防止硬件故障。结果表明,对于具有冗余端口的配置,就物理连接到交换机的服务器节点之间的连通性而言,系统可靠性较高。研究还表明,实施延期维修策略可以显著减少计划外服务事件和系统停机时间。此外,该研究通过对可靠性和可用性的权衡分析确定了最佳服务策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RAS Modeling of an HPC Switch System Versatile and Efficient Techniques for Speeding-Up Circuit Level Simulated Fault-Injection Campaigns On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Transient Fault Tolerance on Chip Multiprocessor Based on Dual and Triple Core Redundancy A Peer-to-Peer Filter-Based Algorithm for Internal Clock Synchronization in Presence of Corrupted Processes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1