The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<>
本文讨论了对从DEC VAXcluster多机系统中采集的实际误差数据进行测量分析的结果。除了评估基本的系统可靠性特征,例如单个机器和VAXcluster的错误和故障分布以及危险率之外,他们还开发奖励模型来分析故障对整个系统的影响。结果表明,超过46%的失败是由于共享资源中的错误造成的。尽管这些错误的恢复概率大于0.99。危险率计算表明,在爆炸中不仅会发生错误,而且会发生故障。大约40%的故障发生在突发事件中,涉及多台机器。这一结果表明,相关失效是显著的。对奖励的分析显示,软件错误的奖励最低(0.05 vs .磁盘错误的奖励为0.74)。VAXcluster的预期奖励率(可靠性度量)在7 / 7模型中在18小时内下降到0.5,在3 / 7模型中在80天内下降到0.5。VAXcluster系统运行250天的可用性评估为0.993。
{"title":"Failure analysis and modeling of a VAXcluster system","authors":"D. Tang, R. Iyer, Sujatha S. Subramani","doi":"10.1109/FTCS.1990.89372","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89372","url":null,"abstract":"The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"697 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133167034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The analysis of a multiple-level redundant tree (MLRT) structure is presented for the design of a reconfigurable tree architecture. The MLRT scheme tolerates the catastrophic failure of several locally redundant modules in the corresponding locally redundant modular tree (LRMT) structure. This analysis and experimental study establishes the advantages of the MLRT structure over the LRMT structure. The switch failures are taken into account for an accurate analysis of the reliability. A new measure, called the marginal-switch-to-processing-element-area ratio (MSR), is introduced to characterize the effect of switch complexity on the reliability of the redundant system. It can be used as an evaluation criterion in the design of practical fault-tolerant multiprocessor architectures. A technique for obtaining the best spare distribution in the MLRT structure is presented.<>
{"title":"An analysis of a reconfigurable binary tree architecture based on multiple-level redundancy","authors":"Yung-Yuan Chen, S. Upadhyaya","doi":"10.1109/FTCS.1990.89366","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89366","url":null,"abstract":"The analysis of a multiple-level redundant tree (MLRT) structure is presented for the design of a reconfigurable tree architecture. The MLRT scheme tolerates the catastrophic failure of several locally redundant modules in the corresponding locally redundant modular tree (LRMT) structure. This analysis and experimental study establishes the advantages of the MLRT structure over the LRMT structure. The switch failures are taken into account for an accurate analysis of the reliability. A new measure, called the marginal-switch-to-processing-element-area ratio (MSR), is introduced to characterize the effect of switch complexity on the reliability of the redundant system. It can be used as an evaluation criterion in the design of practical fault-tolerant multiprocessor architectures. A technique for obtaining the best spare distribution in the MLRT structure is presented.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"599 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116279041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An approach to online error detection and correction for high-throughput VLSI sorting arrays is presented. The error model is defined at the sorting element level, and both functional errors and data errors generated by a faulty element are considered. The functional errors are detected and corrected by exploiting inherent properties of the sorting array, as well as special properties discovered by the authors. Coding techniques and an online fault diagnosis procedure are developed to locate data errors. All the checkers are designed to be totally self-checking, and hence the sorting array is highly reliable. Two-level pipelining is employed in this design, making it very efficient and suitable for real-time application. The hardware overhead is not significant for typical array sizes, and the time penalty is only three clock cycles. The structure is very regular and therefore very attractive for VLSI or WSI implementation.<>
{"title":"Concurrent error detection and correction in real-time systolic sorting arrays","authors":"Sheng-Chiech Liang, S. Kuo","doi":"10.1109/FTCS.1990.89398","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89398","url":null,"abstract":"An approach to online error detection and correction for high-throughput VLSI sorting arrays is presented. The error model is defined at the sorting element level, and both functional errors and data errors generated by a faulty element are considered. The functional errors are detected and corrected by exploiting inherent properties of the sorting array, as well as special properties discovered by the authors. Coding techniques and an online fault diagnosis procedure are developed to locate data errors. All the checkers are designed to be totally self-checking, and hence the sorting array is highly reliable. Two-level pipelining is employed in this design, making it very efficient and suitable for real-time application. The hardware overhead is not significant for typical array sizes, and the time penalty is only three clock cycles. The structure is very regular and therefore very attractive for VLSI or WSI implementation.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117309208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of diagnosis and repair has been studied under a probabilistic fault model that allows permanent or intermittent faults and perfect or imperfect spares. For all of these fault scenarios, it has been shown that correct diagnosis and repair can be achieved with high probability in a large class of constant-degree systems, including rings, grids, meshes, and tori. The total number of tests that must be conducted in the worst case in order to accomplish this diagnosis was shown to increase from O(n) in the case in which faults are permanent and spares are perfect to O(n log/sup 2/n) when faults are intermittent and spares are imperfect.<>
{"title":"Reliable diagnosis and repair in constant-degree multiprocessor systems","authors":"D. Blough, A. Pelc","doi":"10.1109/FTCS.1990.89378","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89378","url":null,"abstract":"The problem of diagnosis and repair has been studied under a probabilistic fault model that allows permanent or intermittent faults and perfect or imperfect spares. For all of these fault scenarios, it has been shown that correct diagnosis and repair can be achieved with high probability in a large class of constant-degree systems, including rings, grids, meshes, and tori. The total number of tests that must be conducted in the worst case in order to accomplish this diagnosis was shown to increase from O(n) in the case in which faults are permanent and spares are perfect to O(n log/sup 2/n) when faults are intermittent and spares are imperfect.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127812093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors discuss the distributed self-diagnosis of a multiprocessor/multicomputer system based on interprocessor tests with imperfect fault coverage (thus also permitting intermittently fault processors). It is shown that by using multiple fault syndromes, it is possible to achieve significantly better diagnosis than by using a single fault syndrome, even when the amount of time devoted to testing is the same. The authors derive a multiple syndrome diagnosis algorithm that is optimal in the level of diagnostic accuracy achieved (among diagnosis algorithms of a certain type to be defined) and produces good results even with sparse interconnection networks and interprocessor test with low fault coverage. Furthermore, they prove upper and lower bounds are proved on the number of fault syndromes required to produce asymptotically a 100% correct diagnostic as N to infinity . Their solution and another multiple syndrome diagnosis solution by D. Fussell and S. Rangarajan are evaluated both analytically and with simulations.<>
{"title":"Optimal multiple syndrome probabilistic diagnosis","authors":"Sunggu Lee, K. Shin","doi":"10.1109/FTCS.1990.89379","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89379","url":null,"abstract":"The authors discuss the distributed self-diagnosis of a multiprocessor/multicomputer system based on interprocessor tests with imperfect fault coverage (thus also permitting intermittently fault processors). It is shown that by using multiple fault syndromes, it is possible to achieve significantly better diagnosis than by using a single fault syndrome, even when the amount of time devoted to testing is the same. The authors derive a multiple syndrome diagnosis algorithm that is optimal in the level of diagnostic accuracy achieved (among diagnosis algorithms of a certain type to be defined) and produces good results even with sparse interconnection networks and interprocessor test with low fault coverage. Furthermore, they prove upper and lower bounds are proved on the number of fault syndromes required to produce asymptotically a 100% correct diagnostic as N to infinity . Their solution and another multiple syndrome diagnosis solution by D. Fussell and S. Rangarajan are evaluated both analytically and with simulations.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128077298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reliability and mean-time-to-failure (MTTF) models of different fault-tolerant processor arrays (FTPAs) are introduced. On the basis of these models, approaches which allow for the analytical estimate of the necessary number of spares (NNS) and the optimal number of spares (ONS) are proposed. Knowledge of the NNS is suited to FTPAs where nonredundant hardware (hardware for which no redundancy is provided) is considered nearly fault free. Knowledge of the ONS is useful when faults can affect the nonredundant hardware, because in this case overall array reliability may actually decrease when the number of spares increases beyond some value. The quick estimates provided here can be used to help designers in the early design phases of an FTPA.<>
{"title":"Estimates of MTTF and optimal number of spares of fault-tolerant processor arrays","authors":"Y. Wang, J. Fortes","doi":"10.1109/FTCS.1990.89354","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89354","url":null,"abstract":"Reliability and mean-time-to-failure (MTTF) models of different fault-tolerant processor arrays (FTPAs) are introduced. On the basis of these models, approaches which allow for the analytical estimate of the necessary number of spares (NNS) and the optimal number of spares (ONS) are proposed. Knowledge of the NNS is suited to FTPAs where nonredundant hardware (hardware for which no redundancy is provided) is considered nearly fault free. Knowledge of the ONS is useful when faults can affect the nonredundant hardware, because in this case overall array reliability may actually decrease when the number of spares increases beyond some value. The quick estimates provided here can be used to help designers in the early design phases of an FTPA.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"214 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125689793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Checkpointing and rollback-recovery algorithms in distributed object-based systems are presented. By utilizing the structure of objects and operation invocations, the authors have derived efficient algorithms that involve fewer participants than when invocations are treated as messages and existing algorithms for message-based systems are used. It is planned to implement these algorithms and evaluate their performance in the context of the Clouds project at Georgia Tech.<>
{"title":"Checkpointing and rollback-recovery in distributed object based systems","authors":"Luke Lin, M. Ahamad","doi":"10.1109/FTCS.1990.89340","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89340","url":null,"abstract":"Checkpointing and rollback-recovery algorithms in distributed object-based systems are presented. By utilizing the structure of objects and operation invocations, the authors have derived efficient algorithms that involve fewer participants than when invocations are treated as messages and existing algorithms for message-based systems are used. It is planned to implement these algorithms and evaluate their performance in the context of the Clouds project at Georgia Tech.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133353134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors present analytic models of the performance of comparison checking (also called back-to-back testing and automatic testing), and they use these models to investigate its effectiveness. A Markov model is used to analyze the observation time required for a test system to uncover a fault using comparison checking. A basis for evaluation is provided by developing a similar Markov model for the analysis of ideal checking, i.e. using a perfect (through unrealizable) oracle. Also presented is a model of the effect of comparison checking on a version's failure probability as testing proceeds. Again, comparison checking is evaluated against ideal checking. The analyses show that comparison checking is a powerful and effective technique.<>
{"title":"On the performance of software testing using multiple versions","authors":"S. Brilliant, J. Knight, P. Ammann","doi":"10.1109/FTCS.1990.89395","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89395","url":null,"abstract":"The authors present analytic models of the performance of comparison checking (also called back-to-back testing and automatic testing), and they use these models to investigate its effectiveness. A Markov model is used to analyze the observation time required for a test system to uncover a fault using comparison checking. A basis for evaluation is provided by developing a similar Markov model for the analysis of ideal checking, i.e. using a perfect (through unrealizable) oracle. Also presented is a model of the effect of comparison checking on a version's failure probability as testing proceeds. Again, comparison checking is evaluated against ideal checking. The analyses show that comparison checking is a powerful and effective technique.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134439266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Das, D. Saha, A. R. Chowdhury, S. Misra, P. P. Chaudhuri
A novel scheme for signature analysis based on cellular automata (CA) is proposed. The state transition behavior of such signature analyzers has been modeled by Markov chain. It has been shown that a special class of such CAs achieves a steady-state aliasing probability lower than 1/2/sup n/ (for an n-cell CA) for specific ranges of input probabilities of the incoming error pattern. The dynamic behavior of linear feedback shift registers (LFSRs) has also been compared with CAs with the same characteristic polynomials. This work establishes the fact that CA-based signature analyzers outperform those based on LFSRs as regards both steady-state and dynamic behavior.<>
{"title":"Signature analysers based on additive cellular automata","authors":"A. Das, D. Saha, A. R. Chowdhury, S. Misra, P. P. Chaudhuri","doi":"10.1109/FTCS.1990.89374","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89374","url":null,"abstract":"A novel scheme for signature analysis based on cellular automata (CA) is proposed. The state transition behavior of such signature analyzers has been modeled by Markov chain. It has been shown that a special class of such CAs achieves a steady-state aliasing probability lower than 1/2/sup n/ (for an n-cell CA) for specific ranges of input probabilities of the incoming error pattern. The dynamic behavior of linear feedback shift registers (LFSRs) has also been compared with CAs with the same characteristic polynomials. This work establishes the fact that CA-based signature analyzers outperform those based on LFSRs as regards both steady-state and dynamic behavior.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134501351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two aspects of the impact of reconfiguration logic on the optimization of defect-tolerant integrated circuits (ICs) are analyzed. An important consequence to design decisions of neglecting reconfiguration logic is presented. Expressions are developed to predict the number of transistors necessary to implement the reconfiguration logic of a simple defect-tolerance strategy using CMOS technology. The results show that neglecting this reconfiguration logic can lead to inappropriate design decisions. An example of a fine-grain logic array is presented to demonstrate the latter conclusion.<>
{"title":"Impact of reconfiguration logic on the optimization of defect-tolerant integrated circuits","authors":"C. Thibeault, J. Houle","doi":"10.1109/FTCS.1990.89351","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89351","url":null,"abstract":"Two aspects of the impact of reconfiguration logic on the optimization of defect-tolerant integrated circuits (ICs) are analyzed. An important consequence to design decisions of neglecting reconfiguration logic is presented. Expressions are developed to predict the number of transistors necessary to implement the reconfiguration logic of a simple defect-tolerance strategy using CMOS technology. The results show that neglecting this reconfiguration logic can lead to inappropriate design decisions. An example of a fine-grain logic array is presented to demonstrate the latter conclusion.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"47 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122433342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}