H. Kopetz, H. Kantz, G. Grünsteidl, P. Puschner, J. Reisinger
The concepts of transient fault handling in the MARS architecture are discussed. After an overview of the MARS architecture, the mechanisms for the detection of transient faults are discussed in detail. In addition to extensive checks in the hardware and in the operating system, time-redundant execution of application tasks is proposed for the detection of transient faults. The time difference between the effective and the maximum execution time of an application task is used for this purpose. Whenever a transient fault has been detected, the affected component is turned off and reintegrated immediately by retrieving the uncorrupted state of the actively redundant partner component. In order to reduce the probability of spare exhaustion (in the case of permanent faults) 'shadow components' are introduced. The reliability improvement, which can be realized by these techniques, is calculated by detailed reliability models of the architecture, where the parameters are based on experimental results measured on the present MARS prototype implementation.<>
{"title":"Tolerating transient faults in MARS","authors":"H. Kopetz, H. Kantz, G. Grünsteidl, P. Puschner, J. Reisinger","doi":"10.1109/FTCS.1990.89384","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89384","url":null,"abstract":"The concepts of transient fault handling in the MARS architecture are discussed. After an overview of the MARS architecture, the mechanisms for the detection of transient faults are discussed in detail. In addition to extensive checks in the hardware and in the operating system, time-redundant execution of application tasks is proposed for the detection of transient faults. The time difference between the effective and the maximum execution time of an application task is used for this purpose. Whenever a transient fault has been detected, the affected component is turned off and reintegrated immediately by retrieving the uncorrupted state of the actively redundant partner component. In order to reduce the probability of spare exhaustion (in the case of permanent faults) 'shadow components' are introduced. The reliability improvement, which can be realized by these techniques, is calculated by detailed reliability models of the architecture, where the parameters are based on experimental results measured on the present MARS prototype implementation.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122774014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors describe a practical method for realizing fault-tolerant global control of resources in distributed computing systems. The method is particularly suitable for systems that are based on a centralized arbiter for making control decisions. Many applications in LAN-based computing, online transactions, and telecommunication systems fall into this category. The method exploits the inherent physical separation of distributed computing systems to achieve high reliability in the face of decentralized arbiter failures. A significant feature of the method is that the fault-tolerance mechanisms are imbedded in the normal control signal flow so that the overhead is practically negligible in the absence of faults. The principles behind the method, its internal structure, and its operations are explained. Also, the experience gained through its application is discussed.<>
{"title":"A fault-tolerant strategy for hierarchical control in distributed computing systems","authors":"P. Goyer, Parham Momtahan, B. Selić","doi":"10.1109/FTCS.1990.89343","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89343","url":null,"abstract":"The authors describe a practical method for realizing fault-tolerant global control of resources in distributed computing systems. The method is particularly suitable for systems that are based on a centralized arbiter for making control decisions. Many applications in LAN-based computing, online transactions, and telecommunication systems fall into this category. The method exploits the inherent physical separation of distributed computing systems to achieve high reliability in the face of decentralized arbiter failures. A significant feature of the method is that the fault-tolerance mechanisms are imbedded in the normal control signal flow so that the overhead is practically negligible in the absence of faults. The principles behind the method, its internal structure, and its operations are explained. Also, the experience gained through its application is discussed.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132858672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Codes capable of correcting burst asymmetric and unidirectional errors are described. The proposed codes need approximately b+log/sub 2/k check bits to correct a burst of b asymmetric/unidirectional errors, where k is the number of information bits. In most cases, the proposed codes require fewer check bits than the equivalent burst symmetric error-correcting codes. The optimality of the codes is also considered. In addition, efficient codes capable of detecting double burst unidirectional errors are given.<>
{"title":"Burst asymmetric/unidirectional error correcting/detecting codes","authors":"Seungjin Park, B. Bose","doi":"10.1109/FTCS.1990.89375","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89375","url":null,"abstract":"Codes capable of correcting burst asymmetric and unidirectional errors are described. The proposed codes need approximately b+log/sub 2/k check bits to correct a burst of b asymmetric/unidirectional errors, where k is the number of information bits. In most cases, the proposed codes require fewer check bits than the equivalent burst symmetric error-correcting codes. The optimality of the codes is also considered. In addition, efficient codes capable of detecting double burst unidirectional errors are given.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121526637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The author presents a method for detecting anomalous events in communication networks and other similarly characterized environments in which performance anomalies are indicative of failure. The methodology, based on automatically learning the difference between normal and abnormal behavior, has been implemented as part of an automated diagnosis system from which performance results are drawn and presented. The dynamic nature of the model enables a diagnostic system to deal with continuously changing environments without explicit control, reaching to the way the world is now, as opposed to the way the world was planned to be. Results of successful deployment in a noisy, real-time monitoring environment are shown.<>
{"title":"Anomaly detection for diagnosis","authors":"R. Maxion","doi":"10.1109/FTCS.1990.89362","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89362","url":null,"abstract":"The author presents a method for detecting anomalous events in communication networks and other similarly characterized environments in which performance anomalies are indicative of failure. The methodology, based on automatically learning the difference between normal and abnormal behavior, has been implemented as part of an automated diagnosis system from which performance results are drawn and presented. The dynamic nature of the model enables a diagnostic system to deal with continuously changing environments without explicit control, reaching to the way the world is now, as opposed to the way the world was planned to be. Results of successful deployment in a noisy, real-time monitoring environment are shown.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"517 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115348785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Control flow checking techniques are discussed. Invariant properties of the control flow can be checked at two different levels: verification of the sequencing in the controller of the microprocessor or verification of the control flow in the application program. Control flow checking has been implemented, at the two levels, in different versions of a 32-b microprocessor designed in a CMOS 1.5- mu technology. Integration of the monitors on silicon is detailed. The silicon overhead due to the different online test devices is precisely discussed. Different versions of this microprocessor have been designed and implemented in order to make real cost comparisons on components with identical functionality but different integrated monitors. Here only the hardware cost of concurrent checking is considered.<>
{"title":"Design of microprocessors with built-in on-line test","authors":"R. Leveugle, T. Michel, G. Saucier","doi":"10.1109/FTCS.1990.89381","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89381","url":null,"abstract":"Control flow checking techniques are discussed. Invariant properties of the control flow can be checked at two different levels: verification of the sequencing in the controller of the microprocessor or verification of the control flow in the application program. Control flow checking has been implemented, at the two levels, in different versions of a 32-b microprocessor designed in a CMOS 1.5- mu technology. Integration of the monitors on silicon is detailed. The silicon overhead due to the different online test devices is precisely discussed. Different versions of this microprocessor have been designed and implemented in order to make real cost comparisons on components with identical functionality but different integrated monitors. Here only the hardware cost of concurrent checking is considered.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115338688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A synthesis procedure for self-testable finite state machines is presented. Testability comes under consideration when the behavioral description of the circuit is being transformed into a structural description. To this end, a novel state encoding algorithm, as well as a modified self-test architecture, is developed. Experimental results show that this approach leads to a significant reduction of hardware overhead. Self-testing circuits generally employ linear feedback shift registers for pattern generation. The impact of choosing a particular feedback polynomial on the state encoding is discussed.<>
{"title":"Optimized synthesis of self-testable finite state machines","authors":"B. Eschermann, H. Wunderlich","doi":"10.1109/FTCS.1990.89393","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89393","url":null,"abstract":"A synthesis procedure for self-testable finite state machines is presented. Testability comes under consideration when the behavioral description of the circuit is being transformed into a structural description. To this end, a novel state encoding algorithm, as well as a modified self-test architecture, is developed. Experimental results show that this approach leads to a significant reduction of hardware overhead. Self-testing circuits generally employ linear feedback shift registers for pattern generation. The impact of choosing a particular feedback polynomial on the state encoding is discussed.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114862303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Advanced Automation System (AAS), a distributed real-time system intended to replace the present en-route and terminal approach US air traffic control computer systems over the next decade, is discussed. High availability of air traffic control services is an essential requirement of the system. The authors discuss the general approach to fault tolerance adopted in the AAS by reviewing some of the questions asked during the system design, various alternative solutions considered, and the reasons for the design choices made.<>
{"title":"Fault-tolerance in the Advanced Automation System","authors":"F. Cristian, Bob Dancey, Jonathan Dehn","doi":"10.1145/504136.504156","DOIUrl":"https://doi.org/10.1145/504136.504156","url":null,"abstract":"The Advanced Automation System (AAS), a distributed real-time system intended to replace the present en-route and terminal approach US air traffic control computer systems over the next decade, is discussed. High availability of air traffic control services is an essential requirement of the system. The authors discuss the general approach to fault tolerance adopted in the AAS by reviewing some of the questions asked during the system design, various alternative solutions considered, and the reasons for the design choices made.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115118751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Nijhuis, B. Höfflinger, A. V. Schaik, L. Spaanenburg
Input data and hardware fault tolerance of neural networks are discussed. It is shown that fault-tolerant behavior is not self-evident but must be activated by an appropriate learning scheme. Practical limitations are demonstrated by an example of neural character recognition. The results show that the effects of learning and synapse weight decay on fault tolerance largely influence the practicality of large-scale silicon implementations. It is anticipated that, owing to implementation issues, such as the use of volatile memories, some neural VLSI architectures will not be sufficiently fault tolerant.<>
{"title":"Limits to the fault-tolerance of a feedforward neural network with learning","authors":"J. Nijhuis, B. Höfflinger, A. V. Schaik, L. Spaanenburg","doi":"10.1109/FTCS.1990.89370","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89370","url":null,"abstract":"Input data and hardware fault tolerance of neural networks are discussed. It is shown that fault-tolerant behavior is not self-evident but must be activated by an appropriate learning scheme. Practical limitations are demonstrated by an example of neural character recognition. The results show that the effects of learning and synapse weight decay on fault tolerance largely influence the practicality of large-scale silicon implementations. It is anticipated that, owing to implementation issues, such as the use of volatile memories, some neural VLSI architectures will not be sufficiently fault tolerant.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115523442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A method is presented for detecting stuck-open faults, as well as stuck-at faults, in CMOS combinational circuits by short test sequences of fixed length. The discussion is based on the assumption that outputs of all the gates in a circuit are observable. This assumption will become reasonable when a new testability solution called CrossCheck, or a new test equipment, called on electron-beam tester, is used. The concept of k-UCP (uniform, having a (k+1)-Color solution and compatible polarity) circuits is introduced, and it is shown that 2(k+1) kinds of test sequences of length k(k+1)+1 are sufficient to detect stuck-open faults, as well as stuck-at faults in a k-UCP circuit. Furthermore, it is shown that single stuck-open faults can be located by using a fault diagnosis table. A method which can speed up the generation of a fault diagnosis table is also proposed.<>
{"title":"Fault detection and diagnosis of k-UCP circuits under totally observable condition","authors":"X. Wen, K. Kinoshita","doi":"10.1109/FTCS.1990.89392","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89392","url":null,"abstract":"A method is presented for detecting stuck-open faults, as well as stuck-at faults, in CMOS combinational circuits by short test sequences of fixed length. The discussion is based on the assumption that outputs of all the gates in a circuit are observable. This assumption will become reasonable when a new testability solution called CrossCheck, or a new test equipment, called on electron-beam tester, is used. The concept of k-UCP (uniform, having a (k+1)-Color solution and compatible polarity) circuits is introduced, and it is shown that 2(k+1) kinds of test sequences of length k(k+1)+1 are sufficient to detect stuck-open faults, as well as stuck-at faults in a k-UCP circuit. Furthermore, it is shown that single stuck-open faults can be located by using a fault diagnosis table. A method which can speed up the generation of a fault diagnosis table is also proposed.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126730924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<>
本文讨论了对从DEC VAXcluster多机系统中采集的实际误差数据进行测量分析的结果。除了评估基本的系统可靠性特征,例如单个机器和VAXcluster的错误和故障分布以及危险率之外,他们还开发奖励模型来分析故障对整个系统的影响。结果表明,超过46%的失败是由于共享资源中的错误造成的。尽管这些错误的恢复概率大于0.99。危险率计算表明,在爆炸中不仅会发生错误,而且会发生故障。大约40%的故障发生在突发事件中,涉及多台机器。这一结果表明,相关失效是显著的。对奖励的分析显示,软件错误的奖励最低(0.05 vs .磁盘错误的奖励为0.74)。VAXcluster的预期奖励率(可靠性度量)在7 / 7模型中在18小时内下降到0.5,在3 / 7模型中在80天内下降到0.5。VAXcluster系统运行250天的可用性评估为0.993。
{"title":"Failure analysis and modeling of a VAXcluster system","authors":"D. Tang, R. Iyer, Sujatha S. Subramani","doi":"10.1109/FTCS.1990.89372","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89372","url":null,"abstract":"The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"697 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133167034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}