Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105635
Shi-ze Huang, Jie Xu, Tinghuai Chen
In the system-level diagnosis area, F.P. Preparata, G. Metze, and R.T. Chien (1967) first presented a formal graph-theoretic model and introduced the concept of sequentially t-diagnosable systems. A system S is called sequentially t-diagnosable if, given any complete collection of test results, at least one faulty unit in S can be identified, provided the number of faulty units does not exceed t. However, until very recently, developing a characterization theorem of sequentially t-diagnosable systems for the PMC model was still an important, open problem. The authors resolve this problem by presenting the first complete characterization. A canonical class of systems, D/sub 1,k/ systems, is discussed, and a valuable result on the sequential t-diagnosability is obtained.<>
{"title":"Characterization and design of sequentially t-diagnosable systems","authors":"Shi-ze Huang, Jie Xu, Tinghuai Chen","doi":"10.1109/FTCS.1989.105635","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105635","url":null,"abstract":"In the system-level diagnosis area, F.P. Preparata, G. Metze, and R.T. Chien (1967) first presented a formal graph-theoretic model and introduced the concept of sequentially t-diagnosable systems. A system S is called sequentially t-diagnosable if, given any complete collection of test results, at least one faulty unit in S can be identified, provided the number of faulty units does not exceed t. However, until very recently, developing a characterization theorem of sequentially t-diagnosable systems for the PMC model was still an important, open problem. The authors resolve this problem by presenting the first complete characterization. A canonical class of systems, D/sub 1,k/ systems, is discussed, and a valuable result on the sequential t-diagnosability is obtained.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126675125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105594
S. Balaji, L. Jenkins, L. Patnaik, P. S. Goel
In a hard real-time distributed computing system (HRTDCS), all the tasks are required to meet their associated deadlines; a task not meeting its deadline leads to a catastrophic failure of the system. The authors consider an HRTDCS that executes both periodic and aperiodic tasks associated with timing, precedence, and resource constraints. The fault-tolerance capability in such a system is achieved through the use of time redundancy. The problem of workload redistribution for fault tolerance in an HRTDCS is studied. A graph model to represent the system workload is developed. Three performance measures for the analysis of an HRTDCS are defined. A nonpreemptive scheduling algorithm is proposed to distribute the workload of the operational nodes of the HRTDCS in the presence of both hardware and task failures. This task allocation strategy is applied to a practical system, namely, the HRTDCS onboard a spacecraft. The performance measures obtained for a typical system workload indicate that the algorithm is quite suitable for an HRTDCS with regard to uniform workload distribution.<>
{"title":"Workload redistribution for fault-tolerance in a hard real-time distributed computing system","authors":"S. Balaji, L. Jenkins, L. Patnaik, P. S. Goel","doi":"10.1109/FTCS.1989.105594","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105594","url":null,"abstract":"In a hard real-time distributed computing system (HRTDCS), all the tasks are required to meet their associated deadlines; a task not meeting its deadline leads to a catastrophic failure of the system. The authors consider an HRTDCS that executes both periodic and aperiodic tasks associated with timing, precedence, and resource constraints. The fault-tolerance capability in such a system is achieved through the use of time redundancy. The problem of workload redistribution for fault tolerance in an HRTDCS is studied. A graph model to represent the system workload is developed. Three performance measures for the analysis of an HRTDCS are defined. A nonpreemptive scheduling algorithm is proposed to distribute the workload of the operational nodes of the HRTDCS in the presence of both hardware and task failures. This task allocation strategy is applied to a practical system, namely, the HRTDCS onboard a spacecraft. The performance measures obtained for a typical system workload indicate that the algorithm is quite suitable for an HRTDCS with regard to uniform workload distribution.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116764056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105584
M. Jacomino, R. David
Two measures of test confidence in tested circuits are presented. One takes into account all circuits tested and appears to be a novel measure that is of interest to circuit manufacturers. The other measure, which has already been introduced, takes into account only those circuits that have passed the test and is of interest to the circuit user. Both measures are functions of the same variable, called faulty circuit coverage, which quantifies the confidence in the test sequence. This variable is rather difficult to compute. Therefore a novel approach to approximate the faulty circuit coverage, based on a partition of the prescribed set of faults, is proposed.<>
{"title":"A new approach of test confidence estimation","authors":"M. Jacomino, R. David","doi":"10.1109/FTCS.1989.105584","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105584","url":null,"abstract":"Two measures of test confidence in tested circuits are presented. One takes into account all circuits tested and appears to be a novel measure that is of interest to circuit manufacturers. The other measure, which has already been introduced, takes into account only those circuits that have passed the test and is of interest to the circuit user. Both measures are functions of the same variable, called faulty circuit coverage, which quantifies the confidence in the test sequence. This variable is rather difficult to compute. Therefore a novel approach to approximate the faulty circuit coverage, based on a partition of the prescribed set of faults, is proposed.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132788661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105637
Arun Kumar Somani, T. R. Sarnaik
Two different fault-tolerant architectural concepts for a computer node to be used in a distributed embedded environment have been developed to meet the requirements that the system can sustain at least two independent, nonsimulation hardware failures and remain operational. The architectures are distinguished by the organization of their fault-tolerant algorithm hardware. An analysis is made of these two architectures, and several issues on the reliability analysis of such complex architectures are addressed. Techniques are developed to reduce the complexity of the reliability model. An analysis of the interrelationship between the number of retries and their effect upon system reliability for different average transient lifetimes has also been performed.<>
{"title":"Reliability analysis and comparison of two fail-op/fail-op/fail-safe architectures","authors":"Arun Kumar Somani, T. R. Sarnaik","doi":"10.1109/FTCS.1989.105637","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105637","url":null,"abstract":"Two different fault-tolerant architectural concepts for a computer node to be used in a distributed embedded environment have been developed to meet the requirements that the system can sustain at least two independent, nonsimulation hardware failures and remain operational. The architectures are distinguished by the organization of their fault-tolerant algorithm hardware. An analysis is made of these two architectures, and several issues on the reliability analysis of such complex architectures are addressed. Techniques are developed to reduce the complexity of the reliability model. An analysis of the interrelationship between the number of retries and their effect upon system reliability for different average transient lifetimes has also been performed.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133337342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105590
U. Gunneflo, J. Karlsson, J. Torin
Several concurrent error detection schemes suitable for a watch-dog processor were evaluated by fault injection. Soft errors were induced into a MC6809E microprocessor by heavy-ion radiation from a Californium-252 source. Recordings of error behavior were used to characterize the errors as well as to determine coverage and latency for the various error detection schemes. The error recordings were used as input to programs that simulate the error detection schemes. The schemes evaluated detected up to 79% of all errors within 85 bus cycles. Fifty-eight percent of the errors caused execution to diverge permanently from the correct program. The best schemes detected 99% of these errors. Eighteen percent of the errors affected only data, and the coverage of these errors was at most 38%.<>
{"title":"Evaluation of error detection schemes using fault injection by heavy-ion radiation","authors":"U. Gunneflo, J. Karlsson, J. Torin","doi":"10.1109/FTCS.1989.105590","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105590","url":null,"abstract":"Several concurrent error detection schemes suitable for a watch-dog processor were evaluated by fault injection. Soft errors were induced into a MC6809E microprocessor by heavy-ion radiation from a Californium-252 source. Recordings of error behavior were used to characterize the errors as well as to determine coverage and latency for the various error detection schemes. The error recordings were used as input to programs that simulate the error detection schemes. The schemes evaluated detected up to 79% of all errors within 85 bus cycles. Fifty-eight percent of the errors caused execution to diverge permanently from the correct program. The best schemes detected 99% of these errors. Eighteen percent of the errors affected only data, and the coverage of these errors was at most 38%.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"7 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130490090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105582
J. Udell, E. McCluskey
Formal definitions are presented for segments and segmentations. Under these definitions, the partitionings of a circuit are a subset of the segmentations of that circuit. The fault coverage of an exhaustive test of a segment is then examined. Multiple-output segments, which have not previously been considered in the literature, are shown to present special difficulties, resulting in the definition of a novel type of segment test set. These results are used to present a formal definition for a pseudoexhaustive test using a segmentation. This definition guarantees detection of all detectable faults within segments. Consistency with previous definitions is maintained where practical.<>
{"title":"Pseudo-exhaustive test and segmentation: formal definitions and extended fault coverage results","authors":"J. Udell, E. McCluskey","doi":"10.1109/FTCS.1989.105582","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105582","url":null,"abstract":"Formal definitions are presented for segments and segmentations. Under these definitions, the partitionings of a circuit are a subset of the segmentations of that circuit. The fault coverage of an exhaustive test of a segment is then examined. Multiple-output segments, which have not previously been considered in the literature, are shown to present special difficulties, resulting in the definition of a novel type of segment test set. These results are used to present a formal definition for a pseudoexhaustive test using a segmentation. This definition guarantees detection of all detectable faults within segments. Consistency with previous definitions is maintained where practical.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117145248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105573
Y. Levendel
The contribution of software to the reliability of large distributed systems is addressed. The author analyzes and models the software development process and presents field experience for these large distributed systems. Defect removal is shown to be the bottleneck in achieving the appropriate quality level before system deployment in the field. The author presents a model that relates generic field introduction to the residual defect level and allows reliability prediction since system reliability is related to the residual defect level.<>
{"title":"Defects and reliability analysis of large software systems: field experience","authors":"Y. Levendel","doi":"10.1109/FTCS.1989.105573","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105573","url":null,"abstract":"The contribution of software to the reliability of large distributed systems is addressed. The author analyzes and models the software development process and presents field experience for these large distributed systems. Defect removal is shown to be the bottleneck in achieving the appropriate quality level before system deployment in the field. The author presents a model that relates generic field introduction to the residual defect level and allows reliability prediction since system reliability is related to the residual defect level.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117152211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105543
A. Sengupta, A. Dahbura
An analysis is made of the diagnosability and diagnosis problems for a model of a self-diagnosable multiprocessor system where processors compare the results of tasks performed by other processors in the system. A set of criteria is given for determining whether the faulty processors in the system can be diagnosed on the basis of the comparisons, and a polynomial-time algorithm is presented to identify the faulty units of such a system on the basis of the comparison results when the system is known to be diagnosable.<>
{"title":"On self-diagnosable multiprocessor systems: diagnosis by the comparison approach","authors":"A. Sengupta, A. Dahbura","doi":"10.1109/FTCS.1989.105543","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105543","url":null,"abstract":"An analysis is made of the diagnosability and diagnosis problems for a model of a self-diagnosable multiprocessor system where processors compare the results of tasks performed by other processors in the system. A set of criteria is given for determining whether the faulty processors in the system can be diagnosed on the basis of the comparisons, and a polynomial-time algorithm is presented to identify the faulty units of such a system on the basis of the comparison results when the system is known to be diagnosable.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127121913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105636
D. Fussell, S. Rangarajan
Presents probabilistic fault diagnosis algorithms and a comparison-based fault model for homogeneous systems where the probability of correct diagnosis approaches one when the number of tests conducted on each processor grows slightly faster than log N. For a comparison-based model, this means that each processor has to compare its result on test jobs with a constant number of other processors where the number of test jobs grows slightly faster than log N. These algorithms do not require the neighborhood of processors to grow and thus could be used on systems with arbitrary processor graphs with the in-degree of each processor being greater than a specified value, which in most practical situations is two. Also, diagnosis decisions are made in a distributed fashion. The asymptotic performance of the algorithm is considered.<>
{"title":"Probabilistic diagnosis of multiprocessor systems with arbitrary connectivity","authors":"D. Fussell, S. Rangarajan","doi":"10.1109/FTCS.1989.105636","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105636","url":null,"abstract":"Presents probabilistic fault diagnosis algorithms and a comparison-based fault model for homogeneous systems where the probability of correct diagnosis approaches one when the number of tests conducted on each processor grows slightly faster than log N. For a comparison-based model, this means that each processor has to compare its result on test jobs with a constant number of other processors where the number of test jobs grows slightly faster than log N. These algorithms do not require the neighborhood of processors to grow and thus could be used on systems with arbitrary processor graphs with the in-degree of each processor being greater than a specified value, which in most practical situations is two. Also, diagnosis decisions are made in a distributed fashion. The asymptotic performance of the algorithm is considered.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121714812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105559
R. Geist, M. Smotherman, Michael Brown
A long-standing conjecture, that application of the instantaneous coverage technique to the time-dependent failure rate case also provides conservative reliability estimates, is resolved negatively. In particular, two examples are provided which show that even monotonic failure rates can lead to overly optimistic estimates. An alternative extension of the instantaneous coverage technique, consistent with the constant-rate approach, is then offered. The novel approach is shown to provide conservative estimates in the time-dependent case, provided fault-handling and recovery time distributions can be described by step functions.<>
{"title":"Ultrahigh reliability estimates for systems exhibiting globally time-dependent failure processes","authors":"R. Geist, M. Smotherman, Michael Brown","doi":"10.1109/FTCS.1989.105559","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105559","url":null,"abstract":"A long-standing conjecture, that application of the instantaneous coverage technique to the time-dependent failure rate case also provides conservative reliability estimates, is resolved negatively. In particular, two examples are provided which show that even monotonic failure rates can lead to overly optimistic estimates. An alternative extension of the instantaneous coverage technique, consistent with the constant-rate approach, is then offered. The novel approach is shown to provide conservative estimates in the time-dependent case, provided fault-handling and recovery time distributions can be described by step functions.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115026977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}