Three cache-aided error-recovery algorithms for use in shared-memory multiprocessor systems are presented. They rely on hardware and specially designed cache memory for all their soft error management operations and can be easily incorporated into existing cache-coherence protocols. An example illustrating their use in a multiprocessor system employing Dragon as its cache-coherence protocol is given, and the results of a tradeoff analysis are presented.<>
{"title":"Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems","authors":"R. E. Ahmed, R. Frazier, P. Marinos","doi":"10.1109/FTCS.1990.89338","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89338","url":null,"abstract":"Three cache-aided error-recovery algorithms for use in shared-memory multiprocessor systems are presented. They rely on hardware and specially designed cache memory for all their soft error management operations and can be easily incorporated into existing cache-coherence protocols. An example illustrating their use in a multiprocessor system employing Dragon as its cache-coherence protocol is given, and the results of a tradeoff analysis are presented.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116627579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A three-valued (0, 1, and 1/2) neural network, which is an extension of the binary Hopfield model, is proposed, and it is shown that the test generation problem can be solved by the three-valued model more effectively than by the binary model. In the three-valued model, the energy function of networks, hyperplanes of neurons, and update rules of neuron states are extended so that the third value, 1/2, can be treated satisfactorily. It is proved that the proposed three-valued model always converges. To escape from local minima, an extension of Boltzmann machines, in which the update rules are modified by introducing probabilities of neuron states, is presented.<>
{"title":"Three-valued neural networks for test generation","authors":"H. Fujiwara","doi":"10.1109/FTCS.1990.89336","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89336","url":null,"abstract":"A three-valued (0, 1, and 1/2) neural network, which is an extension of the binary Hopfield model, is proposed, and it is shown that the test generation problem can be solved by the three-valued model more effectively than by the binary model. In the three-valued model, the energy function of networks, hyperplanes of neurons, and update rules of neuron states are extended so that the third value, 1/2, can be treated satisfactorily. It is proved that the proposed three-valued model always converges. To escape from local minima, an extension of Boltzmann machines, in which the update rules are modified by introducing probabilities of neuron states, is presented.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130001770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A technique for achieving fault tolerance in hardware and software systems is introduced. When used for software fault tolerance, this technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the results. In addition, this program leaves behind a trail of data, called a certification trail. In the second phase, another program is run, and it solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are compared, and if they agree, the results are accepted as correct; otherwise, an error is indicated. Cases in which the second phase can be run concurrently with the first and act as a monitor are discussed.<>
{"title":"Using certification trails to achieve software fault tolerance","authors":"G. Sullivan, G. Masson","doi":"10.1109/FTCS.1990.89397","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89397","url":null,"abstract":"A technique for achieving fault tolerance in hardware and software systems is introduced. When used for software fault tolerance, this technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the results. In addition, this program leaves behind a trail of data, called a certification trail. In the second phase, another program is run, and it solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are compared, and if they agree, the results are accepted as correct; otherwise, an error is indicated. Cases in which the second phase can be run concurrently with the first and act as a monitor are discussed.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114215560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Identification, characterization, and construction of fault patterns that are catastrophic for linear systolic arrays are discussed. It is shown that for a given link configuration in the array, it is possible to identify all PE (processing element) catastrophic fault patterns. The requirement on the minimum number of faults in a fault pattern and its spectrum (spread out) for it to be catastrophic is shown to be a function of the length of the longest bypass link available, and not of the total number of bypass links. The paper also gives bounds on the width of a catastrophic fault spectrum.<>
{"title":"Fault-intolerance of reconfigurable systolic arrays","authors":"A. Nayak, N. Santoro, Richard Tan","doi":"10.1109/FTCS.1990.89367","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89367","url":null,"abstract":"Identification, characterization, and construction of fault patterns that are catastrophic for linear systolic arrays are discussed. It is shown that for a given link configuration in the array, it is possible to identify all PE (processing element) catastrophic fault patterns. The requirement on the minimum number of faults in a fault pattern and its spectrum (spread out) for it to be catastrophic is shown to be a function of the length of the longest bypass link available, and not of the total number of bypass links. The paper also gives bounds on the width of a catastrophic fault spectrum.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":" 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132124511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A class of combinational circuits, called the (k,K)-circuits is presented, and a polynomial-time algorithm to detect any single or multiple stuckfault in such circuits is introduced. The (k,K)-circuits are a generalization of H. Fujiwara's (1988) K-bounded circuits. The fault detection problem is formulated as an energy minimization problem using the bidirectional neural net model proposed earlier. A minimizing point of the energy function corresponds to a test. A polynomial-time algorithm is presented here to solve the single and multiple fault-detection problem for the (k,K)-circuits by recursively eliminating variables in the energy function.<>
{"title":"Polynomial time solvable fault detection problems","authors":"S. Chakradhar, V. Agrawal, M. Bushnell","doi":"10.1109/FTCS.1990.89335","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89335","url":null,"abstract":"A class of combinational circuits, called the (k,K)-circuits is presented, and a polynomial-time algorithm to detect any single or multiple stuckfault in such circuits is introduced. The (k,K)-circuits are a generalization of H. Fujiwara's (1988) K-bounded circuits. The fault detection problem is formulated as an energy minimization problem using the bidirectional neural net model proposed earlier. A minimizing point of the energy function corresponds to a test. A polynomial-time algorithm is presented here to solve the single and multiple fault-detection problem for the (k,K)-circuits by recursively eliminating variables in the energy function.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130501622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To investigate the effectiveness of serializable back-to-back testing and other issues in multiversion software systems, an experiment was performed. The authors discuss the use of multiple implementations for fault prevention throughout development, particularly during the testing phase. The specifications chosen were written in languages that meet industrial standards. The application is a communication protocol based on the Open Systems Interconnection (OSI) layered model adopted by the International Organization for Standardization (ISO) in 1979. The OSI layered model is introduced, the generation of appropriate test cases is discussed, and the testing environment is presented. The serializable back-to-back testing paradigm is presented in detail, along with testing results.<>
{"title":"Techniques for building dependable distributed systems: multi-version software testing","authors":"John P. J. Kelly, T. McVittie, S. C. Murphy","doi":"10.1109/FTCS.1990.89394","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89394","url":null,"abstract":"To investigate the effectiveness of serializable back-to-back testing and other issues in multiversion software systems, an experiment was performed. The authors discuss the use of multiple implementations for fault prevention throughout development, particularly during the testing phase. The specifications chosen were written in languages that meet industrial standards. The application is a communication protocol based on the Open Systems Interconnection (OSI) layered model adopted by the International Organization for Standardization (ISO) in 1979. The OSI layered model is introduced, the generation of appropriate test cases is discussed, and the testing environment is presented. The serializable back-to-back testing paradigm is presented in detail, along with testing results.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114410539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Three kinds of faults are considered: stuck-at faults, bridging faults, and crosspoint faults. A new way of repairing bridging faults is introduced. It is shown that the problem of finding a minimum cover is NP-complete but that a special case of this problem can be formulated as a 2-SAT problem, which can be solved in polynomial time. The problem of finding a feasible cover for RPLAs (reconfigurable programmable logic arrays) with bridging faults alone is shown to be NP-complete. A necessary and sufficient condition on the number of spares for the existence of a feasible cover and an algorithm for finding a minimum feasible cover are presented.<>
{"title":"Fault covers in reconfigurable PLAs","authors":"N. Hasan, C. Liu","doi":"10.1109/FTCS.1990.89352","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89352","url":null,"abstract":"Three kinds of faults are considered: stuck-at faults, bridging faults, and crosspoint faults. A new way of repairing bridging faults is introduced. It is shown that the problem of finding a minimum cover is NP-complete but that a special case of this problem can be formulated as a 2-SAT problem, which can be solved in polynomial time. The problem of finding a feasible cover for RPLAs (reconfigurable programmable logic arrays) with bridging faults alone is shown to be NP-complete. A necessary and sufficient condition on the number of spares for the existence of a feasible cover and an algorithm for finding a minimum feasible cover are presented.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123838307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Nicola, Marvin K. Nakayama, P. Heidelberger, A. Goyal
An approach to simulating models of highly dependable systems with general failure and repair time distributions is described. The approach combines importance sampling with event rescheduling in order to obtain variance reduction in such rare event simulations. The approach is general in nature and allows effective simulation of a variety of features commonly arising in dependability modeling. For example, it is shown how the technique can be applied to systems with periodic maintenance. The effects on the steady-state availability of the maintenance period and of different failure time distributions are explored. Some of the trade-offs involved in the design of specific rescheduling rules are described, and their potential effectiveness in simulations of systems with nonexponential failure and repair time distributions are demonstrated. It is found that an effective method for selecting the rescheduling distribution is to keep the probability of a failure transition in the range between 0.1 and 0.5.<>
{"title":"Fast simulation of dependability models with general failure, repair and maintenance processes","authors":"V. Nicola, Marvin K. Nakayama, P. Heidelberger, A. Goyal","doi":"10.1109/FTCS.1990.89387","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89387","url":null,"abstract":"An approach to simulating models of highly dependable systems with general failure and repair time distributions is described. The approach combines importance sampling with event rescheduling in order to obtain variance reduction in such rare event simulations. The approach is general in nature and allows effective simulation of a variety of features commonly arising in dependability modeling. For example, it is shown how the technique can be applied to systems with periodic maintenance. The effects on the steady-state availability of the maintenance period and of different failure time distributions are explored. Some of the trade-offs involved in the design of specific rescheduling rules are described, and their potential effectiveness in simulations of systems with nonexponential failure and repair time distributions are demonstrated. It is found that an effective method for selecting the rescheduling distribution is to keep the probability of a failure transition in the range between 0.1 and 0.5.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"749 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123866814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The design of two reconfiguration strategies for hypercube multicomputer architectures under failures is discussed. The first scheme uses spare processors attached to certain processors in the hypercube by means of a novel embedding technique. The second approach places spare processors between specific links in the hypercube. Both schemes involve the mapping of logical links of a virtual hypercube onto a set of physical links in the final reconfigured hypercube and hence suffer some performance degradation.<>
{"title":"Strategies for reconfiguring hypercubes under faults","authors":"P. Banerjee","doi":"10.1109/FTCS.1990.89368","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89368","url":null,"abstract":"The design of two reconfiguration strategies for hypercube multicomputer architectures under failures is discussed. The first scheme uses spare processors attached to certain processors in the hypercube by means of a novel embedding technique. The second approach places spare processors between specific links in the hypercube. Both schemes involve the mapping of logical links of a virtual hypercube onto a set of physical links in the final reconfigured hypercube and hence suffer some performance degradation.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123436447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors present a distributed table-filling algorithm for point-to-point routing in a degraded hypercube system. This algorithm finds the shortest length existing path from each source to each destination in the faulty hypercube and fills the routing tables so that messages are routed along these paths. A novel scheme for broadcast routing with tables is proposed, and the algorithm required to fill the broadcast tables, given the point-to-point routing tables, is presented. In addition, the modifications necessary to make these algorithms ensure deadlock-free routing are given. A quantitative and equalitative comparison of previously proposed reroute strategies with table routing, where the tables are filled by the authors' algorithms, are presented.<>
{"title":"Distributed algorithms for shortest-path, deadlock-free routing and broadcasting in arbitrarily faulty hypercubes","authors":"M. Peercy, P. Banerjee","doi":"10.1109/FTCS.1990.89369","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89369","url":null,"abstract":"The authors present a distributed table-filling algorithm for point-to-point routing in a degraded hypercube system. This algorithm finds the shortest length existing path from each source to each destination in the faulty hypercube and fills the routing tables so that messages are routed along these paths. A novel scheme for broadcast routing with tables is proposed, and the algorithm required to fill the broadcast tables, given the point-to-point routing tables, is presented. In addition, the modifications necessary to make these algorithms ensure deadlock-free routing are given. A quantitative and equalitative comparison of previously proposed reroute strategies with table routing, where the tables are filled by the authors' algorithms, are presented.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123739399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}