[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium最新文献

英文中文

Failure analysis and modeling of a VAXcluster system VAXcluster系统的故障分析与建模

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89372

D. Tang, R. Iyer, Sujatha S. Subramani

The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<>

本文讨论了对从DEC VAXcluster多机系统中采集的实际误差数据进行测量分析的结果。除了评估基本的系统可靠性特征，例如单个机器和VAXcluster的错误和故障分布以及危险率之外，他们还开发奖励模型来分析故障对整个系统的影响。结果表明，超过46%的失败是由于共享资源中的错误造成的。尽管这些错误的恢复概率大于0.99。危险率计算表明，在爆炸中不仅会发生错误，而且会发生故障。大约40%的故障发生在突发事件中，涉及多台机器。这一结果表明，相关失效是显著的。对奖励的分析显示，软件错误的奖励最低(0.05 vs .磁盘错误的奖励为0.74)。VAXcluster的预期奖励率(可靠性度量)在7 / 7模型中在18小时内下降到0.5，在3 / 7模型中在80天内下降到0.5。VAXcluster系统运行250天的可用性评估为0.993。

{"title":"Failure analysis and modeling of a VAXcluster system","authors":"D. Tang, R. Iyer, Sujatha S. Subramani","doi":"10.1109/FTCS.1990.89372","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89372","url":null,"abstract":"The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"697 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133167034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63

An analysis of a reconfigurable binary tree architecture based on multiple-level redundancy 基于多级冗余的可重构二叉树结构分析

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89366

Yung-Yuan Chen, S. Upadhyaya

The analysis of a multiple-level redundant tree (MLRT) structure is presented for the design of a reconfigurable tree architecture. The MLRT scheme tolerates the catastrophic failure of several locally redundant modules in the corresponding locally redundant modular tree (LRMT) structure. This analysis and experimental study establishes the advantages of the MLRT structure over the LRMT structure. The switch failures are taken into account for an accurate analysis of the reliability. A new measure, called the marginal-switch-to-processing-element-area ratio (MSR), is introduced to characterize the effect of switch complexity on the reliability of the redundant system. It can be used as an evaluation criterion in the design of practical fault-tolerant multiprocessor architectures. A technique for obtaining the best spare distribution in the MLRT structure is presented.<>

通过对多级冗余树(MLRT)结构的分析，设计了一种可重构的树结构。MLRT方案允许相应的局部冗余模块树(LRMT)结构中的多个局部冗余模块发生灾难性故障。通过分析和实验研究，确立了MLRT结构相对于LRMT结构的优势。为了准确地分析可靠性，考虑了开关故障。引入了一种新的度量，称为边际开关与处理单元面积比(MSR)，以表征开关复杂性对冗余系统可靠性的影响。它可以作为实际容错多处理器体系结构设计的评价标准。提出了一种求解MLRT结构中最佳备用分布的方法。

引用次数: 12

Concurrent error detection and correction in real-time systolic sorting arrays 实时收缩排序阵列的并发错误检测和校正

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89398

Sheng-Chiech Liang, S. Kuo

An approach to online error detection and correction for high-throughput VLSI sorting arrays is presented. The error model is defined at the sorting element level, and both functional errors and data errors generated by a faulty element are considered. The functional errors are detected and corrected by exploiting inherent properties of the sorting array, as well as special properties discovered by the authors. Coding techniques and an online fault diagnosis procedure are developed to locate data errors. All the checkers are designed to be totally self-checking, and hence the sorting array is highly reliable. Two-level pipelining is employed in this design, making it very efficient and suitable for real-time application. The hardware overhead is not significant for typical array sizes, and the time penalty is only three clock cycles. The structure is very regular and therefore very attractive for VLSI or WSI implementation.<>

提出了一种高通量VLSI排序阵列的在线错误检测与校正方法。错误模型在排序元素级别定义，并且考虑了功能错误和由故障元素产生的数据错误。利用排序阵列的固有特性和作者发现的特殊特性，检测和修正了功能误差。开发了编码技术和在线故障诊断程序来定位数据错误。所有的检查器都被设计为完全自检，因此排序阵列具有很高的可靠性。本设计采用两级流水线，效率高，适合实时应用。对于典型的数组大小，硬件开销并不大，时间损失只有三个时钟周期。该结构非常规则，因此对于VLSI或WSI实现非常有吸引力。

引用次数: 15

Reliable diagnosis and repair in constant-degree multiprocessor systems 恒定度多处理器系统中的可靠诊断和修复

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89378

D. Blough, A. Pelc

The problem of diagnosis and repair has been studied under a probabilistic fault model that allows permanent or intermittent faults and perfect or imperfect spares. For all of these fault scenarios, it has been shown that correct diagnosis and repair can be achieved with high probability in a large class of constant-degree systems, including rings, grids, meshes, and tori. The total number of tests that must be conducted in the worst case in order to accomplish this diagnosis was shown to increase from O(n) in the case in which faults are permanent and spares are perfect to O(n log/sup 2/n) when faults are intermittent and spares are imperfect.<>

在允许永久性或间歇性故障以及完美或不完美备件的概率故障模型下，对诊断和修复问题进行了研究。研究表明，对于所有这些故障情况，在一大类恒定度系统（包括环、网格、网和环）中都能以高概率实现正确诊断和修复。结果表明，在最坏情况下，为完成诊断而必须进行的测试总数，从故障是永久性的、备件是完美的情况下的 O(n) 增加到故障是间歇性的、备件是不完美的情况下的 O(n log/sup 2/n)。

引用次数: 10

Optimal multiple syndrome probabilistic diagnosis 最佳多综合征概率诊断

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89379

Sunggu Lee, K. Shin

The authors discuss the distributed self-diagnosis of a multiprocessor/multicomputer system based on interprocessor tests with imperfect fault coverage (thus also permitting intermittently fault processors). It is shown that by using multiple fault syndromes, it is possible to achieve significantly better diagnosis than by using a single fault syndrome, even when the amount of time devoted to testing is the same. The authors derive a multiple syndrome diagnosis algorithm that is optimal in the level of diagnostic accuracy achieved (among diagnosis algorithms of a certain type to be defined) and produces good results even with sparse interconnection networks and interprocessor test with low fault coverage. Furthermore, they prove upper and lower bounds are proved on the number of fault syndromes required to produce asymptotically a 100% correct diagnostic as N to infinity . Their solution and another multiple syndrome diagnosis solution by D. Fussell and S. Rangarajan are evaluated both analytically and with simulations.<>

作者讨论了基于不完全故障覆盖的处理器间测试的多处理器/多计算机系统的分布式自诊断(因此也允许间歇故障处理器)。结果表明，即使用于测试的时间相同，通过使用多个故障综合征，也可能比使用单个故障综合征获得更好的诊断结果。作者推导了一种多证候诊断算法，该算法在实现诊断准确率水平上是最优的(在待定义的某种类型的诊断算法中)，即使在稀疏互连网络和低故障覆盖率的处理器间测试下也能产生良好的结果。进一步证明了在N→∞时产生渐近100%正确诊断所需的故障证数的上界和下界。他们的解决方案和D. Fussell和S. Rangarajan的另一个多综合征诊断方案进行了分析和模拟评估

引用次数: 15

Estimates of MTTF and optimal number of spares of fault-tolerant processor arrays 容错处理器阵列的MTTF估计和最优备件数量

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89354

Y. Wang, J. Fortes

Reliability and mean-time-to-failure (MTTF) models of different fault-tolerant processor arrays (FTPAs) are introduced. On the basis of these models, approaches which allow for the analytical estimate of the necessary number of spares (NNS) and the optimal number of spares (ONS) are proposed. Knowledge of the NNS is suited to FTPAs where nonredundant hardware (hardware for which no redundancy is provided) is considered nearly fault free. Knowledge of the ONS is useful when faults can affect the nonredundant hardware, because in this case overall array reliability may actually decrease when the number of spares increases beyond some value. The quick estimates provided here can be used to help designers in the early design phases of an FTPA.<>

介绍了不同容错处理器阵列(ftpa)的可靠性模型和平均故障时间模型。在这些模型的基础上，提出了必要备件数量(NNS)和最优备件数量(ONS)的分析估计方法。NNS的知识适用于ftpa，其中非冗余硬件(不提供冗余的硬件)被认为几乎没有故障。当故障可能影响非冗余硬件时，了解ONS是有用的，因为在这种情况下，当备件数量增加超过某个值时，整体阵列的可靠性实际上可能会降低。这里提供的快速估算可用于帮助设计人员在FTPA的早期设计阶段。

引用次数: 2

Checkpointing and rollback-recovery in distributed object based systems 分布式对象系统中的检查点和回滚恢复

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89340

Luke Lin, M. Ahamad

Checkpointing and rollback-recovery algorithms in distributed object-based systems are presented. By utilizing the structure of objects and operation invocations, the authors have derived efficient algorithms that involve fewer participants than when invocations are treated as messages and existing algorithms for message-based systems are used. It is planned to implement these algorithms and evaluate their performance in the context of the Clouds project at Georgia Tech.<>

提出了分布式对象系统中的检查点和回滚恢复算法。通过利用对象和操作调用的结构，作者推导出了高效的算法，与将调用作为消息处理和使用基于消息的系统的现有算法相比，这些算法涉及的参与者更少。计划实施这些算法，并在佐治亚理工学院云项目的背景下评估它们的性能。

引用次数: 38

On the performance of software testing using multiple versions 关于使用多个版本的软件测试的性能

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89395

S. Brilliant, J. Knight, P. Ammann

The authors present analytic models of the performance of comparison checking (also called back-to-back testing and automatic testing), and they use these models to investigate its effectiveness. A Markov model is used to analyze the observation time required for a test system to uncover a fault using comparison checking. A basis for evaluation is provided by developing a similar Markov model for the analysis of ideal checking, i.e. using a perfect (through unrealizable) oracle. Also presented is a model of the effect of comparison checking on a version's failure probability as testing proceeds. Again, comparison checking is evaluated against ideal checking. The analyses show that comparison checking is a powerful and effective technique.<>

作者提出了比较检查(也称为背靠背测试和自动测试)性能的分析模型，并使用这些模型来研究其有效性。利用马尔可夫模型分析了测试系统通过比较检查发现故障所需的观测时间。通过开发一个类似的马尔可夫模型来分析理想检查，即使用一个完美的(通过不可实现的)oracle，为评估提供了基础。此外，本文还提出了一个模型，说明在测试过程中，比较检查对版本失败概率的影响。同样，比较检查是根据理想检查进行评估的。分析表明，比较检查是一种强大而有效的技术。

引用次数: 17

Signature analysers based on additive cellular automata 基于加性元胞自动机的特征分析器

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89374

A. Das, D. Saha, A. R. Chowdhury, S. Misra, P. P. Chaudhuri

A novel scheme for signature analysis based on cellular automata (CA) is proposed. The state transition behavior of such signature analyzers has been modeled by Markov chain. It has been shown that a special class of such CAs achieves a steady-state aliasing probability lower than 1/2/sup n/ (for an n-cell CA) for specific ranges of input probabilities of the incoming error pattern. The dynamic behavior of linear feedback shift registers (LFSRs) has also been compared with CAs with the same characteristic polynomials. This work establishes the fact that CA-based signature analyzers outperform those based on LFSRs as regards both steady-state and dynamic behavior.<>

提出了一种基于元胞自动机(CA)的签名分析新方案。用马尔可夫链对特征分析器的状态转移行为进行了建模。研究表明，对于输入误差模式的特定输入概率范围，该类CA的稳态混叠概率低于1/2/sup n/(对于n单元CA)。本文还比较了线性反馈移位寄存器(LFSRs)与具有相同特征多项式的ca的动态特性。这项工作确立了这样一个事实，即基于ca的签名分析器在稳态和动态行为方面都优于基于lfsr的签名分析器

引用次数: 24

Impact of reconfiguration logic on the optimization of defect-tolerant integrated circuits 重构逻辑对容错集成电路优化的影响

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

Pub Date : 1990-06-26 DOI: 10.1109/FTCS.1990.89351

C. Thibeault, J. Houle

Two aspects of the impact of reconfiguration logic on the optimization of defect-tolerant integrated circuits (ICs) are analyzed. An important consequence to design decisions of neglecting reconfiguration logic is presented. Expressions are developed to predict the number of transistors necessary to implement the reconfiguration logic of a simple defect-tolerance strategy using CMOS technology. The results show that neglecting this reconfiguration logic can lead to inappropriate design decisions. An example of a fine-grain logic array is presented to demonstrate the latter conclusion.<>

从两个方面分析了重构逻辑对容错集成电路优化的影响。给出了忽略重构逻辑对设计决策的一个重要结论。利用CMOS技术开发了用于预测实现简单缺陷容忍策略的重构逻辑所需的晶体管数量的表达式。结果表明，忽略这种重构逻辑会导致不适当的设计决策。最后给出了一个细粒度逻辑阵列的例子来证明后一个结论。

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀