Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243580
P. Veríssimo, Luís E. T. Rodrigues
The authors present a clock synchronization algorithm, a posteriori agreement, based on a new variant of the well-known convergence nonaveraging technique. Exploiting an obvious characteristic of broadcast networks, largely reduces the effect of message delivery delay variance. In consequence, the precision achieved by the algorithm is drastically improved. Accuracy preservation is near to optimal. The solution does not require the use of dedicated hardware.<>
{"title":"A posteriori agreement for fault-tolerant clock synchronization on broadcast networks","authors":"P. Veríssimo, Luís E. T. Rodrigues","doi":"10.1109/FTCS.1992.243580","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243580","url":null,"abstract":"The authors present a clock synchronization algorithm, a posteriori agreement, based on a new variant of the well-known convergence nonaveraging technique. Exploiting an obvious characteristic of broadcast networks, largely reduces the effect of message delivery delay variance. In consequence, the precision achieved by the algorithm is drastically improved. Accuracy preservation is near to optimal. The solution does not require the use of dedicated hardware.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122883476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243594
Yoichi Koyanagi, Y. Tohma
The authors discuss the influence of stuck-at faults in neural networks for solving optimization problems. They use a Hopfield model of a neural network, applying it to the traveling salesman problem of five cities. The asymmetric nature of fault tolerance of the network against stuck-at-zero and stuck-at-one faults is revealed. A method to alleviate this asymmetry and enhance the fault tolerance greatly is proposed.<>
{"title":"Fault tolerant neural networks in optimization problems","authors":"Yoichi Koyanagi, Y. Tohma","doi":"10.1109/FTCS.1992.243594","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243594","url":null,"abstract":"The authors discuss the influence of stuck-at faults in neural networks for solving optimization problems. They use a Hopfield model of a neural network, applying it to the traveling salesman problem of five cities. The asymmetric nature of fault tolerance of the network against stuck-at-zero and stuck-at-one faults is revealed. A method to alleviate this asymmetry and enhance the fault tolerance greatly is proposed.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122060038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243603
C. Raghavendra, P. Yang, S. Tien
In the n-dimensional hypercube, Q/sub n/, for large n, faults can occur with relatively high probability. How to use the inherent redundancy present in the hypercube to obtain fault tolerance is discussed, along with computing in faulty hypercubes. The authors study the fault tolerance independently present in hypercubes by defining and using the concept of free dimensions. Briefly, in Q/sub n/, a dimension is said to be free if no pair of nodes across the dimension link are both faulty. Efficient algorithms are presented for finding free dimensions, given a set of faulty nodes, and it is shown that at least n-f+1 free dimensions exist with f>
{"title":"Free dimensions-an effective approach to achieving fault tolerance in hypercube","authors":"C. Raghavendra, P. Yang, S. Tien","doi":"10.1109/FTCS.1992.243603","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243603","url":null,"abstract":"In the n-dimensional hypercube, Q/sub n/, for large n, faults can occur with relatively high probability. How to use the inherent redundancy present in the hypercube to obtain fault tolerance is discussed, along with computing in faulty hypercubes. The authors study the fault tolerance independently present in hypercubes by defining and using the concept of free dimensions. Briefly, in Q/sub n/, a dimension is said to be free if no pair of nodes across the dimension link are both faulty. Efficient algorithms are presented for finding free dimensions, given a set of faulty nodes, and it is shown that at least n-f+1 free dimensions exist with f<or=n faulty nodes. Free dimensions can be used to partition Q/sub n/ into subcubes such that each subcube contains at most one fault. Such a partitioning helps in achieving fault tolerance via emulation, embedding, and reconfiguration. It also helps in designing efficient routing and broadcasting algorithms in faulty hypercubes.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130563477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243613
Y. Amir, D. Dolev, S. Kramer, D. Malkhi
The authors describe Transis, a communication subsystem for high availability. Transis is a transport layer that supports reliable multicast services. The main novelty is in the efficient implementation using broadcast. The basis of Transis is automatic maintenance of dynamic membership. The membership algorithm is symmetrical, operates within the regular flow of messages, and overcomes partitions and remerging. The higher layer provides various multicast services for sets of processes.<>
{"title":"Transis: a communication subsystem for high availability","authors":"Y. Amir, D. Dolev, S. Kramer, D. Malkhi","doi":"10.1109/FTCS.1992.243613","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243613","url":null,"abstract":"The authors describe Transis, a communication subsystem for high availability. Transis is a transport layer that supports reliable multicast services. The main novelty is in the efficient implementation using broadcast. The basis of Transis is automatic maintenance of dynamic membership. The membership algorithm is symmetrical, operates within the regular flow of messages, and overcomes partitions and remerging. The higher layer provides various multicast services for sets of processes.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121743054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243586
M. Sullivan, R. Chillarege
An analysis of software defects reported at customer sites in two large IBM database management products, DB2 and IMS, is presented. The analysis considers several different error classification systems and compares the results to those of an earlier study of field defects in IBM's MVS operating system. The authors compare the error type, defect type, and error trigger distributions of the DB2, IMS, and MVS products; show that there may exist an asymptotic behavior in the error type distribution as a function of a defect type; and discuss the undefined state errors that dominate the error type distribution.<>
{"title":"A comparison of software defects in database management systems and operating systems","authors":"M. Sullivan, R. Chillarege","doi":"10.1109/FTCS.1992.243586","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243586","url":null,"abstract":"An analysis of software defects reported at customer sites in two large IBM database management products, DB2 and IMS, is presented. The analysis considers several different error classification systems and compares the results to those of an earlier study of field defects in IBM's MVS operating system. The authors compare the error type, defect type, and error trigger distributions of the DB2, IMS, and MVS products; show that there may exist an asymptotic behavior in the error type distribution as a function of a defect type; and discuss the undefined state errors that dominate the error type distribution.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"34 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131372609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243601
M. S. Alam, R. Melhem
The authors consider a class of modular multiprocessor architectures in which spares are added to each module to cover for faulty nodes within that module, thus forming a fault tolerant basic block (FTBB). The goal is to preserve the logical adjacency between active nodes by means of a routing algorithm which delivers messages successfully to their destinations. Two phase routing strategies are introduced that route messages first to their destination FTBB, and then to the destination nodes within the destination FTBB. This strategy may be applied to a variety of architectures including binary hypercubes and 3-D tori. In the presence of f faults in these systems. It is shown that the worst case length of the message route is max( sigma +f, (K+1) sigma )+M, where sigma is the shortest path in the absence of faults, and M and K are the numbers of primary nodes and spare nodes in a FTBB, respectively. The average routing overhead is much lower than the worst case overhead.<>
{"title":"Routing in modular fault tolerant multiprocessor systems","authors":"M. S. Alam, R. Melhem","doi":"10.1109/FTCS.1992.243601","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243601","url":null,"abstract":"The authors consider a class of modular multiprocessor architectures in which spares are added to each module to cover for faulty nodes within that module, thus forming a fault tolerant basic block (FTBB). The goal is to preserve the logical adjacency between active nodes by means of a routing algorithm which delivers messages successfully to their destinations. Two phase routing strategies are introduced that route messages first to their destination FTBB, and then to the destination nodes within the destination FTBB. This strategy may be applied to a variety of architectures including binary hypercubes and 3-D tori. In the presence of f faults in these systems. It is shown that the worst case length of the message route is max( sigma +f, (K+1) sigma )+M, where sigma is the shortest path in the absence of faults, and M and K are the numbers of primary nodes and spare nodes in a FTBB, respectively. The average routing overhead is much lower than the worst case overhead.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"2009 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131904718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243563
K. Huang, V. Agarwal, L. LaForge
A novel diagnosis scheme is proposed for wafer testing, in which the test access port of each die is utilized to perform comparison tests on its neighbors. A probabilistic diagnosis algorithm is presented, which correctly identifies almost all dies, even when the probability of failure of a die is larger than 0.5. The algorithm is shown to be particularly suitable for constant degree structures, such as rectangular and octagonal grids. The algorithm is designed for wafer scale structures, where the boundary dies do not have a complete regular structure. The algorithm also allows for the fault coverage of the tests to be imperfect. In addition, diagnosis is done locally. Both the test time and the diagnosis time are invariant with respect to the number of dies on the wafer. The algorithm can also tolerate some systematic errors. The dies are tested in parallel with this approach.<>
{"title":"Wafer testing with pairwise comparisons","authors":"K. Huang, V. Agarwal, L. LaForge","doi":"10.1109/FTCS.1992.243563","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243563","url":null,"abstract":"A novel diagnosis scheme is proposed for wafer testing, in which the test access port of each die is utilized to perform comparison tests on its neighbors. A probabilistic diagnosis algorithm is presented, which correctly identifies almost all dies, even when the probability of failure of a die is larger than 0.5. The algorithm is shown to be particularly suitable for constant degree structures, such as rectangular and octagonal grids. The algorithm is designed for wafer scale structures, where the boundary dies do not have a complete regular structure. The algorithm also allows for the fault coverage of the tests to be imperfect. In addition, diagnosis is done locally. Both the test time and the diagnosis time are invariant with respect to the number of dies on the wafer. The algorithm can also tolerate some systematic errors. The dies are tested in parallel with this approach.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133623394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243616
P. Jalote
In a distributed computation being performed by a network of communicating processes, failure of a process due to the failure of its host node can cause the entire computation to be aborted. The author proposes a scheme to make a distributed program resilient to the failure of one of its constituent processes. The distributed computation is completed despite the failure of a process. The scheme is for CSP programs and allows nondeterminism within a process. In CSP, the process name is used in input/output commands. Since synchronous communication is used, if a process specified in the input/output command of a process P does not execute a matching output/input command, P might get blocked. In the proposed scheme, if a process fails, another process starts executing on a backup node from the last checkpoint (CP) of the failed process. Programmed exception handling is used to ensure proper recovery and fault tolerance.<>
{"title":"Dynamic reconfiguration of CSP programs for fault tolerance","authors":"P. Jalote","doi":"10.1109/FTCS.1992.243616","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243616","url":null,"abstract":"In a distributed computation being performed by a network of communicating processes, failure of a process due to the failure of its host node can cause the entire computation to be aborted. The author proposes a scheme to make a distributed program resilient to the failure of one of its constituent processes. The distributed computation is completed despite the failure of a process. The scheme is for CSP programs and allows nondeterminism within a process. In CSP, the process name is used in input/output commands. Since synchronous communication is used, if a process specified in the input/output command of a process P does not execute a matching output/input command, P might get blocked. In the proposed scheme, if a process fails, another process starts executing on a backup node from the last checkpoint (CP) of the failed process. Programmed exception handling is used to ensure proper recovery and fault tolerance.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131281693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243562
D. Powell
A method is proposed for the formal analysis of failure mode assumptions and for the evaluation of the dependability of systems whose design correctness is conditioned on the validity of such assumptions. Formal definitions are given for the types of errors that can affect items of service delivered by a system or component. Failure node assumptions are then formalized as assertions on the types of errors that a component may induce in its enclosing system. The concept of assumption coverage is introduced to relate the notion of partially-ordered assumption assertions to the quantification of system dependability. Assumption coverage is shown to be extremely important in systems requiring very high dependability. It is also shown that the need to increase system redundancy to accommodate more severe modes of component failure can sometimes result in a decrease in dependability.<>
{"title":"Failure mode assumptions and assumption coverage","authors":"D. Powell","doi":"10.1109/FTCS.1992.243562","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243562","url":null,"abstract":"A method is proposed for the formal analysis of failure mode assumptions and for the evaluation of the dependability of systems whose design correctness is conditioned on the validity of such assumptions. Formal definitions are given for the types of errors that can affect items of service delivered by a system or component. Failure node assumptions are then formalized as assertions on the types of errors that a component may induce in its enclosing system. The concept of assumption coverage is introduced to relate the notion of partially-ordered assumption assertions to the quantification of system dependability. Assumption coverage is shown to be extremely important in systems requiring very high dependability. It is also shown that the need to increase system redundancy to accommodate more severe modes of component failure can sometimes result in a decrease in dependability.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114691611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1992-07-08DOI: 10.1109/FTCS.1992.243614
N. J. Alewine, Shyh-Kwei Chen, C. Li, W. Fuchs, Wen-mei W. Hwu
A compiler-assisted approach to implementing multiple instruction retry has recently been developed by C.-C. J Li et al. (1991). They extend compiler-assisted multiple instruction retry to include a broad class of code execution failures. Five benchmarks were used to measure the performance penalty of hazard resolution. Results indicate that the enhanced pure software approach can produce performance penalties consistent with existing hardware techniques. A combined compiler/hardware resolution strategy is also described and was evaluated. Experimental results indicate a lower performance penalty than with either a totally hardware or totally software approach.<>
c - c最近开发了一种编译器辅助的实现多指令重试的方法。李俊等(1991)。它们扩展了编译器辅助的多指令重试,以包括广泛的代码执行失败类别。使用五个基准来衡量危害解决的性能损失。结果表明,增强的纯软件方法可以产生与现有硬件技术一致的性能损失。本文还描述了一种编译器/硬件联合解析策略,并对其进行了评估。实验结果表明,与完全采用硬件或完全采用软件的方法相比,该方法的性能损失较小。
{"title":"Branch recovery with compiler-assisted multiple instruction retry","authors":"N. J. Alewine, Shyh-Kwei Chen, C. Li, W. Fuchs, Wen-mei W. Hwu","doi":"10.1109/FTCS.1992.243614","DOIUrl":"https://doi.org/10.1109/FTCS.1992.243614","url":null,"abstract":"A compiler-assisted approach to implementing multiple instruction retry has recently been developed by C.-C. J Li et al. (1991). They extend compiler-assisted multiple instruction retry to include a broad class of code execution failures. Five benchmarks were used to measure the performance penalty of hazard resolution. Results indicate that the enhanced pure software approach can produce performance penalties consistent with existing hardware techniques. A combined compiler/hardware resolution strategy is also described and was evaluated. Experimental results indicate a lower performance penalty than with either a totally hardware or totally software approach.<<ETX>>","PeriodicalId":360985,"journal":{"name":"[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing","volume":"408 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124333652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}