Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466968
E. Fujiwara, M. Kitakami
Error control codes are now being successfully applied to computer systems, especially to memory systems. This paper proposes a new class of error control codes to protect the fixed-byte in computer words from errors. The fixed-byte stores valuable and important information such as control and address information in communication messages or pointer information in database words. 'Fixed-byte' means the clustered information digits in the word whose position is determined in advance. As a simple class of these unequal error protection codes, this paper proposes two types of optimal fixed-byte error protection codes: single-bit error correction and fixed b-bit byte error correction (SEC-FbEC) codes and single-bit error correction, double-bit error detection, and fixed b-bit byte error detection (SEC-DED-FbED) codes. The obtained optimal SEC-FbEC codes where byte length b=7 bits and information length k=64 bits, for example, require a check-bit length of only 8 bits, which is the same as that of the conventional SEC-DED codes with k=64 bits.<>
{"title":"A class of optimal fixed-byte error protection codes for computer systems","authors":"E. Fujiwara, M. Kitakami","doi":"10.1109/FTCS.1995.466968","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466968","url":null,"abstract":"Error control codes are now being successfully applied to computer systems, especially to memory systems. This paper proposes a new class of error control codes to protect the fixed-byte in computer words from errors. The fixed-byte stores valuable and important information such as control and address information in communication messages or pointer information in database words. 'Fixed-byte' means the clustered information digits in the word whose position is determined in advance. As a simple class of these unequal error protection codes, this paper proposes two types of optimal fixed-byte error protection codes: single-bit error correction and fixed b-bit byte error correction (SEC-FbEC) codes and single-bit error correction, double-bit error detection, and fixed b-bit byte error detection (SEC-DED-FbED) codes. The obtained optimal SEC-FbEC codes where byte length b=7 bits and information length k=64 bits, for example, require a check-bit length of only 8 bits, which is the same as that of the conventional SEC-DED codes with k=64 bits.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126628943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466950
M. Peercy, P. Banerjee
Ideally, a multicomputer system should cope with a processor failure by reconstructing itself-and the application running on itself-in order to maintain the available computational power of the remaining processors. We discuss the continuance of running applications through permanent processor failures. We take advantage of the characteristics of the actor model of parallel computation and dynamically checkpoint the activity of the application. Consequently, the runtime system is able to continue an application through multiple nonconcurrent processor failures. We have implemented our techniques through modifications of the runtime system of the parallel language Charm on an Intel iPSC/s hypercube. After discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of one or more faults.<>
{"title":"Software schemes of reconfiguration and recovery in distributed memory multicomputers using the actor model","authors":"M. Peercy, P. Banerjee","doi":"10.1109/FTCS.1995.466950","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466950","url":null,"abstract":"Ideally, a multicomputer system should cope with a processor failure by reconstructing itself-and the application running on itself-in order to maintain the available computational power of the remaining processors. We discuss the continuance of running applications through permanent processor failures. We take advantage of the characteristics of the actor model of parallel computation and dynamically checkpoint the activity of the application. Consequently, the runtime system is able to continue an application through multiple nonconcurrent processor failures. We have implemented our techniques through modifications of the runtime system of the parallel language Charm on an Intel iPSC/s hypercube. After discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of one or more faults.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126766184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466987
A. Olson, K. Shin, B. Jambor
We present a probabilistic synchronization algorithm which sends periodic synchronization messages, instead of periodic bursts of synchronization messages as other algorithms do. Our "continuous" approach therefore avoids the burst network loads of other algorithms. Nodes always have current estimates of other nodes' clocks, allowing them to monitor the state of system synchronization, and adjust their clocks as needed. The algorithm is fault-tolerant, and may be easily adapted to a wide variety of systems and networks. We analyze and simulate the algorithm's performance on a 64-node hypercube, and show that the algorithm provides tight synchronization while imposing only a light load on the network.<>
{"title":"Fault-tolerant clock synchronization for distributed systems using continuous synchronization messages","authors":"A. Olson, K. Shin, B. Jambor","doi":"10.1109/FTCS.1995.466987","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466987","url":null,"abstract":"We present a probabilistic synchronization algorithm which sends periodic synchronization messages, instead of periodic bursts of synchronization messages as other algorithms do. Our \"continuous\" approach therefore avoids the burst network loads of other algorithms. Nodes always have current estimates of other nodes' clocks, allowing them to monitor the state of system synchronization, and adjust their clocks as needed. The algorithm is fault-tolerant, and may be easily adapted to a wide variety of systems and networks. We analyze and simulate the algorithm's performance on a 64-node hypercube, and show that the algorithm provides tight synchronization while imposing only a light load on the network.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131562028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466988
R. Buskens, R. Bianchini
The paper presents the RatchetFT distributed fault tolerant mutual exclusion algorithm for processor rings. RatchetFT is self-stabilizing, in that if mutual exclusion is lost due to any sequence of online failures and repairs of processors, mutual exclusion will eventually be regained. This research demonstrates that self-stabilization can be achieved in the presence of faulty processors, provided that these faulty processors always appear to behave incorrectly. Self-stabilization is achievable even if faulty processor behavior is not restricted to transient failures or other simple failure models. The key results of the paper include the specification of RatchetFT and a detailed sketch of its correctness proof.<>
{"title":"Self-stabilizing mutual exclusion in the presence of faulty nodes","authors":"R. Buskens, R. Bianchini","doi":"10.1109/FTCS.1995.466988","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466988","url":null,"abstract":"The paper presents the RatchetFT distributed fault tolerant mutual exclusion algorithm for processor rings. RatchetFT is self-stabilizing, in that if mutual exclusion is lost due to any sequence of online failures and repairs of processors, mutual exclusion will eventually be regained. This research demonstrates that self-stabilization can be achieved in the presence of faulty processors, provided that these faulty processors always appear to behave incorrectly. Self-stabilization is achievable even if faulty processor behavior is not restricted to transient failures or other simple failure models. The key results of the paper include the specification of RatchetFT and a detailed sketch of its correctness proof.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130510911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466946
Ronald Riter
The paper discusses modeling and fault insertion testing of the Boeing 777 "fly-by-wire" Primary Flight Computer (PFC) system. The 777 PFC was modeled to perform a behavior analysis. The simulation model includes all systems communicating with the Primary Flight Computers (PFC). The simulation environment allows errors to be injected into the communication portion of the model and into selected PFC internal variables. The model is used to test the system response to errors in the PFC input data and to PFC internal errors. The behavior analysis tests have been chosen to stress the fault tolerant design and to investigate PFC anomalies encountered during either laboratory tests or during flight test. The effects of both input and PFC internal errors were studied and the effects of asynchronous communication were examined. The paper is composed of the following: 1. Introduction which briefly describes both the airplane "fly-by-wire" features and the simulation. 2. PFC description which gives more details about the PFC. 3. Failure model. 4. Simulation description which describes the simulation environment and facilities. 5. Fault-tolerant testing which gives some examples. 6. Summary.<>
{"title":"Modeling and testing a critical fault-tolerant multi-process system","authors":"Ronald Riter","doi":"10.1109/FTCS.1995.466946","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466946","url":null,"abstract":"The paper discusses modeling and fault insertion testing of the Boeing 777 \"fly-by-wire\" Primary Flight Computer (PFC) system. The 777 PFC was modeled to perform a behavior analysis. The simulation model includes all systems communicating with the Primary Flight Computers (PFC). The simulation environment allows errors to be injected into the communication portion of the model and into selected PFC internal variables. The model is used to test the system response to errors in the PFC input data and to PFC internal errors. The behavior analysis tests have been chosen to stress the fault tolerant design and to investigate PFC anomalies encountered during either laboratory tests or during flight test. The effects of both input and PFC internal errors were studied and the effects of asynchronous communication were examined. The paper is composed of the following: 1. Introduction which briefly describes both the airplane \"fly-by-wire\" features and the simulation. 2. PFC description which gives more details about the PFC. 3. Failure model. 4. Simulation description which describes the simulation environment and facilities. 5. Fault-tolerant testing which gives some examples. 6. Summary.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115028062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466992
I. Pomeranz, S. Reddy
We present a method to generate test sequences that detect large numbers of faults (close to or higher than the number of faults that can be detected by deterministic methods) at a cost which is significantly lower than any existing test generation procedure. The generated sequences can be used alone or as prefixes to deterministic test sequences. To generate the sequences, we study the test sequences generated by several deterministic test generation procedures. We show that when deterministic test sequences are applied, the fault free circuits go through sequences of state transitions that have distinct characteristics which are independent of the specific circuit considered. Test sequences with the same characteristics are generated by using logic simulation only on the fault free circuit and considering several random patterns as candidates for inclusion in the test sequence at every time unit. By fault simulating these sequences, we find that the fault coverage achieved is very close to the fault coverage achieved by deterministic sequences and sometimes even higher. In most cases the fault coverage is higher than the fault coverage achieved by nondeterministic procedures based on genetic optimization.<>
{"title":"LOCSTEP: a logic simulation based test generation procedure","authors":"I. Pomeranz, S. Reddy","doi":"10.1109/FTCS.1995.466992","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466992","url":null,"abstract":"We present a method to generate test sequences that detect large numbers of faults (close to or higher than the number of faults that can be detected by deterministic methods) at a cost which is significantly lower than any existing test generation procedure. The generated sequences can be used alone or as prefixes to deterministic test sequences. To generate the sequences, we study the test sequences generated by several deterministic test generation procedures. We show that when deterministic test sequences are applied, the fault free circuits go through sequences of state transitions that have distinct characteristics which are independent of the specific circuit considered. Test sequences with the same characteristics are generated by using logic simulation only on the fault free circuit and considering several random patterns as candidates for inclusion in the test sequence at every time unit. By fault simulating these sequences, we find that the fault coverage achieved is very close to the fault coverage achieved by deterministic sequences and sometimes even higher. In most cases the fault coverage is higher than the fault coverage achieved by nondeterministic procedures based on genetic optimization.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134632141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.467000
H. Vin, P. Shenoy, Sriram Rao
In this paper, we present a novel disk failure recovery method that utilizes the inherent redundancy in video streams (rather than error-correcting codes) to ensure that the user-invoked on-the-fly failure recovery process does not impose any additional load on the disk array. We also present a disk array architecture that enhances the scalability of multimedia servers by: (1) integrating the recovery process with the decompression of video streams, and thereby distributing the reconstruction process across the clients; and (2) supporting graceful degradation in the quality of recovered images with increase in the number of disk failures.<>
{"title":"Efficient failure recovery in multi-disk multimedia servers","authors":"H. Vin, P. Shenoy, Sriram Rao","doi":"10.1109/FTCS.1995.467000","DOIUrl":"https://doi.org/10.1109/FTCS.1995.467000","url":null,"abstract":"In this paper, we present a novel disk failure recovery method that utilizes the inherent redundancy in video streams (rather than error-correcting codes) to ensure that the user-invoked on-the-fly failure recovery process does not impose any additional load on the disk array. We also present a disk array architecture that enhances the scalability of multimedia servers by: (1) integrating the recovery process with the decompression of video streams, and thereby distributing the reconstruction process across the clients; and (2) supporting graceful degradation in the quality of recovered images with increase in the number of disk failures.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126825726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466972
M. Balakrishnan, Kishor S. Trivedi
Fault trees and Markov chains are commonly used for dependability modeling. Markov chains are powerful in that various kinds of dependencies can be easily modeled that fault tree models have difficulty capturing, but the state space grows exponentially in the number of components. Fault tree models are adequate for computing the reliability of nonrepairable systems, but a state space description becomes necessary for repairable systems due to induced dependencies (even when all failure and repair processes are otherwise independent). We demonstrate that a decomposition approach can be used to avoid a full-system Markov reliability model for repairable systems with independent failure and repair processes. For an n-component system, n 3-state sub-models can replace a full-system monolithic model. This is an approximation because the parameters used in the sub-model are approximately derived from the monolithic model.<>
{"title":"Componentwise decomposition for an efficient reliability computation of systems with repairable components","authors":"M. Balakrishnan, Kishor S. Trivedi","doi":"10.1109/FTCS.1995.466972","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466972","url":null,"abstract":"Fault trees and Markov chains are commonly used for dependability modeling. Markov chains are powerful in that various kinds of dependencies can be easily modeled that fault tree models have difficulty capturing, but the state space grows exponentially in the number of components. Fault tree models are adequate for computing the reliability of nonrepairable systems, but a state space description becomes necessary for repairable systems due to induced dependencies (even when all failure and repair processes are otherwise independent). We demonstrate that a decomposition approach can be used to avoid a full-system Markov reliability model for repairable systems with independent failure and repair processes. For an n-component system, n 3-state sub-models can replace a full-system monolithic model. This is an approximation because the parameters used in the sub-model are approximately derived from the monolithic model.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123608400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466965
P. J. Thadikaran, S. Chakravarty, J. Patel
The notion of indistinguishable pairs is introduced. Two methods to compute such pairs-an explicit scheme and an implicit scheme-are presented. The resulting fault simulation algorithms, list-based scheme and tree-based scheme are compared using a variety of faultlists and test sets. The performance of the tree-based scheme is found to be superior to the list-based scheme. Applications where the list-based scheme perform better are discussed.<>
{"title":"Fault simulation of I/sub DDQ/ tests for bridging faults in sequential circuits","authors":"P. J. Thadikaran, S. Chakravarty, J. Patel","doi":"10.1109/FTCS.1995.466965","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466965","url":null,"abstract":"The notion of indistinguishable pairs is introduced. Two methods to compute such pairs-an explicit scheme and an implicit scheme-are presented. The resulting fault simulation algorithms, list-based scheme and tree-based scheme are compared using a variety of faultlists and test sets. The performance of the tree-based scheme is found to be superior to the list-based scheme. Applications where the list-based scheme perform better are discussed.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"125 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126278577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466959
J. Bright, G. Sullivan, G. Masson
We describe a general checking the integrity of data structures corrupted by memory faults. Our approach is based on a recursive checksum technique. Basic methods of using checksums have been previously seen to be useful for detecting faults at the bit or word level; among our results is their extension to the node level. The major contributions of our paper are threefold. First, we show how the recursive checksum procedure can be applied to tree data structures that are dynamically changing, whereas the previous work concentrated on trees that were static in their structure. This results in a asymptotic improvement in running time for applications where it; is natural to model the underlying data as a tree. Second, we present a C++ implementation of this scheme. Significantly, it is seen that our software can be used with existing applications which manipulate trees with only minor modification of the application programs. Finally, we have performed fault injection experiments which confirm the fault detection capability of our integrity checking approach.<>
{"title":"Checking the integrity of trees","authors":"J. Bright, G. Sullivan, G. Masson","doi":"10.1109/FTCS.1995.466959","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466959","url":null,"abstract":"We describe a general checking the integrity of data structures corrupted by memory faults. Our approach is based on a recursive checksum technique. Basic methods of using checksums have been previously seen to be useful for detecting faults at the bit or word level; among our results is their extension to the node level. The major contributions of our paper are threefold. First, we show how the recursive checksum procedure can be applied to tree data structures that are dynamically changing, whereas the previous work concentrated on trees that were static in their structure. This results in a asymptotic improvement in running time for applications where it; is natural to model the underlying data as a tree. Second, we present a C++ implementation of this scheme. Significantly, it is seen that our software can be used with existing applications which manipulate trees with only minor modification of the application programs. Finally, we have performed fault injection experiments which confirm the fault detection capability of our integrity checking approach.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115818233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}