D. Simon, C. Hourtolle, H. Biondi, J. Bernelas, P. Duverneuil, S. Gallet, P. Vielcanet, S. D. Viguerie, F. Gsell, J. Chelotti
The aim of the experiment described was to implement and assess fault-tolerant software within an industrial framework. Another significant aspect was to adapt the classical software engineering life cycle to this type of project. Two complementary techniques are considered: fault avoidance through the use of higher level language and strict development process; and fault tolerance by using techniques based on design diversity, such as N-version programming and recovery blocks, and exception handling. Starting from the specification of an existing spacecraft orbit and attitude control system, a 3-version software was developed, coded in Ada, and assessed in a fault-tolerant experimental testbed. The authors describe the experiment development and the main study results (on development efforts, observed diversity, and methodology aspects).<>
{"title":"A software fault tolerance experiment for space applications","authors":"D. Simon, C. Hourtolle, H. Biondi, J. Bernelas, P. Duverneuil, S. Gallet, P. Vielcanet, S. D. Viguerie, F. Gsell, J. Chelotti","doi":"10.1109/FTCS.1990.89363","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89363","url":null,"abstract":"The aim of the experiment described was to implement and assess fault-tolerant software within an industrial framework. Another significant aspect was to adapt the classical software engineering life cycle to this type of project. Two complementary techniques are considered: fault avoidance through the use of higher level language and strict development process; and fault tolerance by using techniques based on design diversity, such as N-version programming and recovery blocks, and exception handling. Starting from the specification of an existing spacecraft orbit and attitude control system, a 3-version software was developed, coded in Ada, and assessed in a fault-tolerant experimental testbed. The authors describe the experiment development and the main study results (on development efforts, observed diversity, and methodology aspects).<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"581 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116067915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors present an implementation model for the programmer-transparent coordination (PTC) scheme that fits well with local-area-network-(LAN-) based systems equipped with broadcasting channels. The model is a significant improvement over the earlier formulated implementation guidelines for developing LAN-based fault-tolerant distributed computer systems (DCSs). The model uses a highly decentralized broadcasting-based approach to the execution of PTC functions. The result is a significant reduction in the PTC-related message traffic, and the extent of reduction could be drastic in many application environments. Another major element of the model is a three-layer software structure in which distributed cooperating application processes and PTC-related operating system components are incorporated in modular forms amenable to cost-effective concurrent processing.<>
{"title":"A highly decentralized implementation model for the programmer-transparent coordination (PTC) scheme for cooperative recovery","authors":"K. Kim, J. You","doi":"10.1109/FTCS.1990.89376","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89376","url":null,"abstract":"The authors present an implementation model for the programmer-transparent coordination (PTC) scheme that fits well with local-area-network-(LAN-) based systems equipped with broadcasting channels. The model is a significant improvement over the earlier formulated implementation guidelines for developing LAN-based fault-tolerant distributed computer systems (DCSs). The model uses a highly decentralized broadcasting-based approach to the execution of PTC functions. The result is a significant reduction in the PTC-related message traffic, and the extent of reduction could be drastic in many application environments. Another major element of the model is a three-layer software structure in which distributed cooperating application processes and PTC-related operating system components are incorporated in modular forms amenable to cost-effective concurrent processing.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125419970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An analytical technique for the availability evaluation of multiprocessors using a multistage interconnection network (MIN) is presented. The MIN represents a Butterfly-type connection with a 4*4-switching element (SE). The novelty of this approach is that the complexity of constructing a single-level exact Markov chain (MC) is not required. By use of structural decomposition, the system is divided into three subsystems-processors, memories, and MIN. Two simple MCs are solved by using a software package, called HARP, to find the probability of i working processing elements (PEs) and j working memory modules (MMs) at time t. A second level of decomposition is then used to find the approximate number of SEs (x) required for connecting the i PEs and j MMs. A third MC is then solved to find the probability that the MIN will provide the necessary communication. The model has been validated through simulation for up to a 256-node configuration, the maximum size available for a commercial MIN-connected multiprocessor.<>
{"title":"Availability evaluation of MIN-connected multiprocessors using decomposition technique","authors":"C. Das, L. Tien, L. Bhuyan","doi":"10.1109/FTCS.1990.89353","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89353","url":null,"abstract":"An analytical technique for the availability evaluation of multiprocessors using a multistage interconnection network (MIN) is presented. The MIN represents a Butterfly-type connection with a 4*4-switching element (SE). The novelty of this approach is that the complexity of constructing a single-level exact Markov chain (MC) is not required. By use of structural decomposition, the system is divided into three subsystems-processors, memories, and MIN. Two simple MCs are solved by using a software package, called HARP, to find the probability of i working processing elements (PEs) and j working memory modules (MMs) at time t. A second level of decomposition is then used to find the approximate number of SEs (x) required for connecting the i PEs and j MMs. A third MC is then solved to find the probability that the MIN will provide the necessary communication. The model has been validated through simulation for up to a 256-node configuration, the maximum size available for a commercial MIN-connected multiprocessor.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128582346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The error models which have appeared in the literature are described and compared. The comparison includes an informal discussion and comparison of detectability and correctability results obtainable with the various models. The ideal comparison basis would be errors produced by real faults in real systems. No such data are available, and an experiment to obtain such data would be extremely costly. One particular case can be used: the errors resulting from crashes (partially completed updates of storage structures) are easily determined and are used as the final basis of comparison.<>
{"title":"Error models for robust storage structures","authors":"David J. Taylor","doi":"10.1109/FTCS.1990.89396","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89396","url":null,"abstract":"The error models which have appeared in the literature are described and compared. The comparison includes an informal discussion and comparison of detectability and correctability results obtainable with the various models. The ideal comparison basis would be errors produced by real faults in real systems. No such data are available, and an experiment to obtain such data would be extremely costly. One particular case can be used: the errors resulting from crashes (partially completed updates of storage structures) are easily determined and are used as the final basis of comparison.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128872049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The author presents an approach to the consistent diagnosis of error monitoring observations in a distributed fault-tolerant computing system, even when the faulty source produces arbitrary errors. He describes the online algorithm used in the multicomputer architecture for fault tolerance (MAFT) to diagnose faulty system elements. By the use of syndrome information which categorizes detected errors as either symmetric or asymmetric, bounds for correct diagnosis can be deduced. Finally, an interactive consistency algorithm is employed to guarantee consistent diagnosis in a distributed environment and to provide online verification of all diagnostic units.<>
{"title":"Identifying the cause of detected errors","authors":"C. Walter","doi":"10.1109/FTCS.1990.89365","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89365","url":null,"abstract":"The author presents an approach to the consistent diagnosis of error monitoring observations in a distributed fault-tolerant computing system, even when the faulty source produces arbitrary errors. He describes the online algorithm used in the multicomputer architecture for fault tolerance (MAFT) to diagnose faulty system elements. By the use of syndrome information which categorizes detected errors as either symmetric or asymmetric, bounds for correct diagnosis can be deduced. Finally, an interactive consistency algorithm is employed to guarantee consistent diagnosis in a distributed environment and to provide online verification of all diagnostic units.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126898454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effects of gate-level faults on program behavior are described and used as a basis for fault models at the program level. A simulation model of the IBM RT PC was developed and injected with 18900 gate-level transient faults. A comparison of the system state of good and faulted runs was made to observe internal propagation of errors, while memory traffic and program flow comparisons detected errors in program behavior. Results show several distinct classes of program-level error behavior, including program flow changes, incorrect memory bus traffic, and undetected but corrupted program state. Additionally, the dependencies of fault location, injection time, and workload on error detection coverage are reported. For the IBM RT PC, the error detection latency was shown to follow a Weibull distribution dependent on the error detection mechanism and the two selected workloads. These results aid in the understanding of the effects of gate-level faults and allow for the generation and validation of new fault models, fault injection methods, and error detection mechanisms.<>
{"title":"Effects of transient gate-level faults on program behavior","authors":"E. W. Czeck, D. Siewiorek","doi":"10.1109/FTCS.1990.89371","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89371","url":null,"abstract":"Effects of gate-level faults on program behavior are described and used as a basis for fault models at the program level. A simulation model of the IBM RT PC was developed and injected with 18900 gate-level transient faults. A comparison of the system state of good and faulted runs was made to observe internal propagation of errors, while memory traffic and program flow comparisons detected errors in program behavior. Results show several distinct classes of program-level error behavior, including program flow changes, incorrect memory bus traffic, and undetected but corrupted program state. Additionally, the dependencies of fault location, injection time, and workload on error detection coverage are reported. For the IBM RT PC, the error detection latency was shown to follow a Weibull distribution dependent on the error detection mechanism and the two selected workloads. These results aid in the understanding of the effects of gate-level faults and allow for the generation and validation of new fault models, fault injection methods, and error detection mechanisms.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128002116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A modeling approach to investigating the interdependencies between memory faults and system performance is presented. Describing the program behavior by an independent reference model, the author develops a fault occurrence model that depends on workload characteristics such as task sojourn times, the number of page references, and the number of page faults. Determining the probability that a fault is detected at a page test, the author quantifies the workload required for fault handling. Using a queuing network in a stationary analysis, he evaluates the average performance decrease caused by memory faults. The interdependencies between performance and reliability quantities are described by a set of nonlinear equations. An iterative method for evaluating the model is given. The results of some experiments demonstrate that the performance decrease caused by memory error depends on system workload and operating system characteristics.<>
{"title":"On the modeling of workload dependent memory faults","authors":"J. Dunkel","doi":"10.1109/FTCS.1990.89388","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89388","url":null,"abstract":"A modeling approach to investigating the interdependencies between memory faults and system performance is presented. Describing the program behavior by an independent reference model, the author develops a fault occurrence model that depends on workload characteristics such as task sojourn times, the number of page references, and the number of page faults. Determining the probability that a fault is detected at a page test, the author quantifies the workload required for fault handling. Using a queuing network in a stationary analysis, he evaluates the average performance decrease caused by memory faults. The interdependencies between performance and reliability quantities are described by a set of nonlinear equations. An iterative method for evaluating the model is given. The results of some experiments demonstrate that the performance decrease caused by memory error depends on system workload and operating system characteristics.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127822008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors present an approach aimed at modeling and evaluating the reliability and availability of systems from the knowledge of the reliability growth of their components. First system behavior is characterized with respect to reliability and availability. The hyperexponential model for reliability and availability growth modeling is introduced and applied to multicomponent systems. The possibility of accounting for future reliability growth when performing evaluations during the design of the system is considered.<>
{"title":"The transformation approach to the modeling and evaluation of the reliability and availability growth","authors":"J. Laprie, C. Béounes, M. Kaâniche, K. Kanoun","doi":"10.1109/FTCS.1990.89390","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89390","url":null,"abstract":"The authors present an approach aimed at modeling and evaluating the reliability and availability of systems from the knowledge of the reliability growth of their components. First system behavior is characterized with respect to reliability and availability. The hyperexponential model for reliability and availability growth modeling is introduced and applied to multicomponent systems. The possibility of accounting for future reliability growth when performing evaluations during the design of the system is considered.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121456186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is proved that there exist allocations that are optimal with respect to reliability. A simple transformation rule that derives an optimal allocation of replicated systems from an allocation of a given nonreplicated system is presented. This transformation preserves performance optimizing properties of the original allocation. Generally, replication gives a large number of processor links. A second transformation rule generates a replicated system with authenticated messages. The reliability of this system is also optimal, with, however, significantly fewer communication links.<>
{"title":"Static allocation of process replicas in fault tolerant computing systems","authors":"L. J. M. Neiuwenhuis","doi":"10.1109/FTCS.1990.89345","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89345","url":null,"abstract":"It is proved that there exist allocations that are optimal with respect to reliability. A simple transformation rule that derives an optimal allocation of replicated systems from an allocation of a given nonreplicated system is presented. This transformation preserves performance optimizing properties of the original allocation. Generally, replication gives a large number of processor links. A second transformation rule generates a replicated system with authenticated messages. The reliability of this system is also optimal, with, however, significantly fewer communication links.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115633575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A class of n-unit multiprocessor systems with O(n log n) interconnecting links is constructed, and a distributed probabilistic fault diagnosis algorithm whose probability of correctness converges to 1 as n to infinity is proposed. For small probability of unit failure, a distributed diagnosis whose probability also converges to 1 as the size of the system grows is proposed for the hypercube. On the other hand, it is proved that if a class of systems has fewer than kn log n links for a small constant k, the probability of correctness of every fault diagnosis converges to 0 as n to infinity . By combining the probabilistic and the distributed approach the authors' model of fault diagnosis removes the major drawbacks of the PMC (Preparata-Metze-Chien) model: the assumption of tests with complete fault coverage and the assumption of a fault-free central monitoring unit capable of performing diagnosis.<>
{"title":"Distributed probabilistic fault diagnosis for multiprocessor systems","authors":"P. Berman, A. Pelc","doi":"10.1109/FTCS.1990.89383","DOIUrl":"https://doi.org/10.1109/FTCS.1990.89383","url":null,"abstract":"A class of n-unit multiprocessor systems with O(n log n) interconnecting links is constructed, and a distributed probabilistic fault diagnosis algorithm whose probability of correctness converges to 1 as n to infinity is proposed. For small probability of unit failure, a distributed diagnosis whose probability also converges to 1 as the size of the system grows is proposed for the hypercube. On the other hand, it is proved that if a class of systems has fewer than kn log n links for a small constant k, the probability of correctness of every fault diagnosis converges to 0 as n to infinity . By combining the probabilistic and the distributed approach the authors' model of fault diagnosis removes the major drawbacks of the PMC (Preparata-Metze-Chien) model: the assumption of tests with complete fault coverage and the assumption of a fault-free central monitoring unit capable of performing diagnosis.<<ETX>>","PeriodicalId":174189,"journal":{"name":"[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium","volume":"506 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116171882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}