Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105565
K. Echtle
A novel class of agreement protocols suitable for replicated nondeterministic processes is introduced. Reduction of message number and early stopping are achieved by taking distance decisions not after, but during protocol execution. Metrical comparison of results is not restricted to numerical applications. Unlike median selection, it covers multidimensional spaces and helps to solve typical problems of distributed systems, e.g., global scheduling, synchronization, sequence agreement, reconfiguration, and elimination of time skew. A so-called pendulum protocol is described in detail.<>
{"title":"Distance agreement protocols","authors":"K. Echtle","doi":"10.1109/FTCS.1989.105565","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105565","url":null,"abstract":"A novel class of agreement protocols suitable for replicated nondeterministic processes is introduced. Reduction of message number and early stopping are achieved by taking distance decisions not after, but during protocol execution. Metrical comparison of results is not restricted to numerical applications. Unlike median selection, it covers multidimensional spaces and helps to solve typical problems of distributed systems, e.g., global scheduling, synchronization, sequence agreement, reconfiguration, and elimination of time skew. A so-called pendulum protocol is described in detail.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127143960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105621
S. Davidson, Insup Lee, V. Wolfe
In a large class of hard-real-time control applications, components execute concurrently on distributed nodes and must coordinate, under timing constraints, to perform the control task. As such, they perform a type of atomic commitment. In traditional atomic commitment there are no timing constraints; agreement is eventual. The authors present a definition of timed atomic commitment (TAC) which requires the processes to be functionally consistent, but allows the outcome to include an exceptional state, indicating that faults have caused timing constraints to be violated. The authors also present a high-level language construct that facilitates the use of TAC in distributed real-time programming and discuss its behavior when faults occur.<>
{"title":"Language constructs for timed atomic commitment","authors":"S. Davidson, Insup Lee, V. Wolfe","doi":"10.1109/FTCS.1989.105621","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105621","url":null,"abstract":"In a large class of hard-real-time control applications, components execute concurrently on distributed nodes and must coordinate, under timing constraints, to perform the control task. As such, they perform a type of atomic commitment. In traditional atomic commitment there are no timing constraints; agreement is eventual. The authors present a definition of timed atomic commitment (TAC) which requires the processes to be functionally consistent, but allows the outcome to include an exceptional state, indicating that faults have caused timing constraints to be violated. The authors also present a high-level language construct that facilitates the use of TAC in distributed real-time programming and discuss its behavior when faults occur.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122810314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105615
N. Saxena, E. McCluskey
A control-flow checking method is proposed. Extended-precision checksum-based control-flow checking is shown to have low error detection latency compared to previously proposed methods. Analytical measures are derived to demonstrate the effectiveness of using extended-precision checksums for control-flow checking. The error detection latency in the extended-precision checksum-based control-flow checking remains relatively constant for both single and multiple sequence errors. In the case of signature-based methods, error detection latency increases linearly with the number of sequence errors. A watchdog assist architecture for control-flow checking in programs is defined. Unlike previously proposed control-flow checking methods, this watchdog assist architecture is well suited for multiprocessor, multiprogramming, and cache-based environments. The Hewlett-Packard precision architecture is used as an example to demonstrate the feasibility of watchdog assists.<>
{"title":"Control-flow checking using watchdog assists and extended-precision checksums","authors":"N. Saxena, E. McCluskey","doi":"10.1109/FTCS.1989.105615","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105615","url":null,"abstract":"A control-flow checking method is proposed. Extended-precision checksum-based control-flow checking is shown to have low error detection latency compared to previously proposed methods. Analytical measures are derived to demonstrate the effectiveness of using extended-precision checksums for control-flow checking. The error detection latency in the extended-precision checksum-based control-flow checking remains relatively constant for both single and multiple sequence errors. In the case of signature-based methods, error detection latency increases linearly with the number of sequence errors. A watchdog assist architecture for control-flow checking in programs is defined. Unlike previously proposed control-flow checking methods, this watchdog assist architecture is well suited for multiprocessor, multiprogramming, and cache-based environments. The Hewlett-Packard precision architecture is used as an example to demonstrate the feasibility of watchdog assists.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114543289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105544
D. Blough, G. Sullivan, G. Masson
The authors present a general approach to fault diagnosis that is widely applicable and requires only a limited number of connections among units. Each unit in the system forms a private opinion on the status of each of its neighboring units based on duplication of jobs and comparison of job results over time. A diagnosis algorithm that consists of simply taking a majority vote among the neighbors of a unit to determine the status of that unit is then executed. The performance of this simple majority-vote diagnosis algorithm is analyzed using a probabilistic model for the faults in the system. It is shown that with high probability, for systems composed of n units, the algorithm will correctly identify the status of all units when each unit is connected to O(log n) other units. It is also shown that the algorithm works with high probability in a class of systems in which the average number of neighbors of a unit is constant. The results indicate that fault diagnosis can in fact be achieved quite simply in multiprocessor systems containing a low to moderate number of testing conditions.<>
{"title":"Fault diagnosis for sparsely interconnected multiprocessor systems","authors":"D. Blough, G. Sullivan, G. Masson","doi":"10.1109/FTCS.1989.105544","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105544","url":null,"abstract":"The authors present a general approach to fault diagnosis that is widely applicable and requires only a limited number of connections among units. Each unit in the system forms a private opinion on the status of each of its neighboring units based on duplication of jobs and comparison of job results over time. A diagnosis algorithm that consists of simply taking a majority vote among the neighbors of a unit to determine the status of that unit is then executed. The performance of this simple majority-vote diagnosis algorithm is analyzed using a probabilistic model for the faults in the system. It is shown that with high probability, for systems composed of n units, the algorithm will correctly identify the status of all units when each unit is connected to O(log n) other units. It is also shown that the algorithm works with high probability in a class of systems in which the average number of neighbors of a unit is constant. The results indicate that fault diagnosis can in fact be achieved quite simply in multiprocessor systems containing a low to moderate number of testing conditions.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128515953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105562
J. F. Meyer, K. Muralidhar, W. Sanders
The authors present the results of a detailed performability evaluation of a network using the IEEE 802.4 protocol. In particular a 30 station IEEE 802.4 token bus network operating in a hostile factory environment is evaluated using stochastic activity networks. Stochastic activity networks, a generalization of stochastic Petri nets, provide a convenient representation for computer networks and are formal enough to permit solution by both analysis and simulation. The evaluation results show (1) that stochastic activity networks are an appropriate model type for evaluating the performability of local-area networks, and (2) that the protocol is extremely tolerant to transient faults such as token losses and noise bursts under moderate network loads.<>
{"title":"Performability of a token bus network under transient fault conditions","authors":"J. F. Meyer, K. Muralidhar, W. Sanders","doi":"10.1109/FTCS.1989.105562","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105562","url":null,"abstract":"The authors present the results of a detailed performability evaluation of a network using the IEEE 802.4 protocol. In particular a 30 station IEEE 802.4 token bus network operating in a hostile factory environment is evaluated using stochastic activity networks. Stochastic activity networks, a generalization of stochastic Petri nets, provide a convenient representation for computer networks and are formal enough to permit solution by both analysis and simulation. The evaluation results show (1) that stochastic activity networks are an appropriate model type for evaluating the performability of local-area networks, and (2) that the protocol is extremely tolerant to transient faults such as token losses and noise bursts under moderate network loads.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123832993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105631
C. Das, Jong Kim
An analytical model is presented for computing the availability of an n-dimensional hypercube. The model computes the probability of j connected working nodes in a hypercube by multiplying two probabilistic terms. The first term is the probability of x connected nodes (x>or=j) working out of 2/sup n/ fully connected nodes. This is obtained from the numerical solution of the well-known machine repairman model, modified to capture imperfect coverage and imprecise repair. The second term, which is the probability of having j connected nodes in a hypercube, is computed from an approximate model of the hypercube. The approximate model, in turn, is based on a decomposition principle, where an n-cube connectivity is computed from a two-cube base model using a recursive equation. The availability model studied in this paper is known as task-based availability, where a system remains operational as long as a task can be executed on the system. Analytical results from n-dimensional cubes are given for various task requirements. The model is validated by comparing the analytical results with those from simulation.<>
{"title":"An analytical model for computing hypercube availability","authors":"C. Das, Jong Kim","doi":"10.1109/FTCS.1989.105631","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105631","url":null,"abstract":"An analytical model is presented for computing the availability of an n-dimensional hypercube. The model computes the probability of j connected working nodes in a hypercube by multiplying two probabilistic terms. The first term is the probability of x connected nodes (x>or=j) working out of 2/sup n/ fully connected nodes. This is obtained from the numerical solution of the well-known machine repairman model, modified to capture imperfect coverage and imprecise repair. The second term, which is the probability of having j connected nodes in a hypercube, is computed from an approximate model of the hypercube. The approximate model, in turn, is based on a decomposition principle, where an n-cube connectivity is computed from a two-cube base model using a recursive equation. The availability model studied in this paper is known as task-based availability, where a system remains operational as long as a task can be executed on the system. Analytical results from n-dimensional cubes are given for various task requirements. The model is validated by comparing the analytical results with those from simulation.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127060604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105588
A. Olson, K. Shin
The authors develop a routing scheme in two steps for a wrapped hexagonal mesh, called HARTS (hexagonal architecture for real-time systems), which ensures the delivery of every message as long as there is a path between its source and destination. The scheme can also detect the nonexistence of a path between a pair of nodes in a finite amount of time. Moreover, the scheme requires each node in HARTS to know only the state (faulty or not) of each of its own links. The performance of the simple routing scheme is simulated for three- and five-dimensional H-meshes while the physical distribution of faulty components is varied. It is shown that a shortest path between the source and the destination of each message is taken with a high probability, and a path, if one exists, is usually found very quickly.<>
{"title":"Message routing in HARTS with faulty components","authors":"A. Olson, K. Shin","doi":"10.1109/FTCS.1989.105588","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105588","url":null,"abstract":"The authors develop a routing scheme in two steps for a wrapped hexagonal mesh, called HARTS (hexagonal architecture for real-time systems), which ensures the delivery of every message as long as there is a path between its source and destination. The scheme can also detect the nonexistence of a path between a pair of nodes in a finite amount of time. Moreover, the scheme requires each node in HARTS to know only the state (faulty or not) of each of its own links. The performance of the simple routing scheme is simulated for three- and five-dimensional H-meshes while the physical distribution of faulty components is varied. It is shown that a shortest path between the source and the destination of each message is taken with a high probability, and a path, if one exists, is usually found very quickly.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124438187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105632
Vikram V. Karmarkar, J. G. Kuhl
A fail-softness evaluation methodology is presented which is suitable for quantifying the graceful degradation characteristics of local computer networks (LCN) using multiple buses. The approach quantifies degradation of performance due to failure over any given application lifetime and also yields a single figure of merit that can be used for comparison of alternative multiple-bus LCN architectures with specific reliability/cost constraints. The analysis technique models both network service failures and configuration-related delay characteristics. Existing notions of performability analysis and bandwidth availability are used in the modeling process to derive a combined performance/reliability measure. The fail-softness analysis is used to compare several alternative multiple-bus architectures, which use different demand-assignment multiple-access (DAMA) methods. A class of integrated access methodologies that use a single shared token to arbitrate access to all buses is shown to exhibit generally superior performance/reliability characteristics as compared to other alternatives, such as those which use an independent DAMA protocol for each bus.<>
{"title":"Fail-softness evaluation in multiple-bus local computer networks","authors":"Vikram V. Karmarkar, J. G. Kuhl","doi":"10.1109/FTCS.1989.105632","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105632","url":null,"abstract":"A fail-softness evaluation methodology is presented which is suitable for quantifying the graceful degradation characteristics of local computer networks (LCN) using multiple buses. The approach quantifies degradation of performance due to failure over any given application lifetime and also yields a single figure of merit that can be used for comparison of alternative multiple-bus LCN architectures with specific reliability/cost constraints. The analysis technique models both network service failures and configuration-related delay characteristics. Existing notions of performability analysis and bandwidth availability are used in the modeling process to derive a combined performance/reliability measure. The fail-softness analysis is used to compare several alternative multiple-bus architectures, which use different demand-assignment multiple-access (DAMA) methods. A class of integrated access methodologies that use a single shared token to arbitrate access to all buses is shown to exhibit generally superior performance/reliability characteristics as compared to other alternatives, such as those which use an independent DAMA protocol for each bus.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115031054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105592
R. Chillarege, N. Bowen
Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability.<>
{"title":"Understanding large system failures-a fault injection experiment","authors":"R. Chillarege, N. Bowen","doi":"10.1109/FTCS.1989.105592","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105592","url":null,"abstract":"Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129462518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1989-06-21DOI: 10.1109/FTCS.1989.105578
James M. Purtilo, P. Jalote
A description is given of a system that allows versions to be coded in different programming languages. The system supports both the recovery block scheme and the N-version programming method. It permits fault tolerance to be used for specified modules that could be embedded in a larger program. The system also allows the different versions to be executed on different machines. It has been implemented in C on DEC Vaxes and Sun 3 workstations and operates in a network of Unix-based machines.<>
{"title":"A system for supporting multi-language versions for software fault tolerance","authors":"James M. Purtilo, P. Jalote","doi":"10.1109/FTCS.1989.105578","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105578","url":null,"abstract":"A description is given of a system that allows versions to be coded in different programming languages. The system supports both the recovery block scheme and the N-version programming method. It permits fault tolerance to be used for specified modules that could be embedded in a larger program. The system also allows the different versions to be executed on different machines. It has been implemented in C on DEC Vaxes and Sun 3 workstations and operates in a network of Unix-based machines.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129230879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}