VAXcluster系统的故障分析与建模

D. Tang, R. Iyer, Sujatha S. Subramani
{"title":"VAXcluster系统的故障分析与建模","authors":"D. Tang, R. Iyer, Sujatha S. Subramani","doi":"10.1109/FTCS.1990.89372","DOIUrl":null,"url":null,"abstract":"The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"697 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"63","resultStr":"{\"title\":\"Failure analysis and modeling of a VAXcluster system\",\"authors\":\"D. Tang, R. Iyer, Sujatha S. Subramani\",\"doi\":\"10.1109/FTCS.1990.89372\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<<ETX>>\",\"PeriodicalId\":174189,\"journal\":{\"name\":\"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium\",\"volume\":\"697 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1990-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"63\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FTCS.1990.89372\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FTCS.1990.89372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 63

摘要

本文讨论了对从DEC VAXcluster多机系统中采集的实际误差数据进行测量分析的结果。除了评估基本的系统可靠性特征,例如单个机器和VAXcluster的错误和故障分布以及危险率之外,他们还开发奖励模型来分析故障对整个系统的影响。结果表明,超过46%的失败是由于共享资源中的错误造成的。尽管这些错误的恢复概率大于0.99。危险率计算表明,在爆炸中不仅会发生错误,而且会发生故障。大约40%的故障发生在突发事件中,涉及多台机器。这一结果表明,相关失效是显著的。对奖励的分析显示,软件错误的奖励最低(0.05 vs .磁盘错误的奖励为0.74)。VAXcluster的预期奖励率(可靠性度量)在7 / 7模型中在18小时内下降到0.5,在3 / 7模型中在80天内下降到0.5。VAXcluster系统运行250天的可用性评估为0.993。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Failure analysis and modeling of a VAXcluster system
The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<>
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A software based approach to achieving optimal performance for signature control flow checking Using certification trails to achieve software fault tolerance Techniques for building dependable distributed systems: multi-version software testing Optimized synthesis of self-testable finite state machines Loss-tolerance for electronic wallets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1