Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816228
C. Constantinescu
Effective error detection is paramount for building highly dependable computing systems. A new methodology, based on physical and simulated fault injection, is developed for evaluating error detection mechanisms. Our approach consists of two steps. First, transient faults are physically injected at the IC pin level of a prototype server. Experiments are carried our in a three dimensional space of events, the location, time of occurrence and duration of the fault being randomly selected. Improved detection circuitry is devised for decreasing signal sensitivity to transients. Second, simulated fault injection is performed to asses the effectiveness of the new detection mechanisms, without using expensive silicon implementations. Physical fault injection experiments, carried out on the server, and simulated fault injection, performed on protocol checker, are presented. Detection effectiveness is measured by the error detection coverage, defined as the conditional probability that an error is detected given that an error occurs. Fault injection reveals that coverage probability is a function of fault duration. The protocol checker significantly improves error detection. Although, further research is required to increase detection coverage of the errors induced by short transient faults.
{"title":"Using physical and simulated fault injection to evaluate error detection mechanisms","authors":"C. Constantinescu","doi":"10.1109/PRDC.1999.816228","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816228","url":null,"abstract":"Effective error detection is paramount for building highly dependable computing systems. A new methodology, based on physical and simulated fault injection, is developed for evaluating error detection mechanisms. Our approach consists of two steps. First, transient faults are physically injected at the IC pin level of a prototype server. Experiments are carried our in a three dimensional space of events, the location, time of occurrence and duration of the fault being randomly selected. Improved detection circuitry is devised for decreasing signal sensitivity to transients. Second, simulated fault injection is performed to asses the effectiveness of the new detection mechanisms, without using expensive silicon implementations. Physical fault injection experiments, carried out on the server, and simulated fault injection, performed on protocol checker, are presented. Detection effectiveness is measured by the error detection coverage, defined as the conditional probability that an error is detected given that an error occurs. Fault injection reveals that coverage probability is a function of fault duration. The protocol checker significantly improves error detection. Although, further research is required to increase detection coverage of the errors induced by short transient faults.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127729930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816223
Wen-li Wang, Ye Wu, Mei-Hwa Chen
We present an analytical model for estimating architecture-based software reliability, according to the reliability of each component, the operational profile, and the architecture of software. Our approach is based on Markov chain properties and architecture view to state view transformations to perform reliability analysis on heterogeneous software architectures. We demonstrate how this analytical model can be utilized to estimate the reliability of a heterogeneous architecture consisting of batch-sequential/pipeline, call-and-return, parallel/pipe-filters, and fault tolerance styles. In addition, we conduct an experiment on a system embedded with three architectural styles to validate this heterogeneous software reliability model.
{"title":"An architecture-based software reliability model","authors":"Wen-li Wang, Ye Wu, Mei-Hwa Chen","doi":"10.1109/PRDC.1999.816223","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816223","url":null,"abstract":"We present an analytical model for estimating architecture-based software reliability, according to the reliability of each component, the operational profile, and the architecture of software. Our approach is based on Markov chain properties and architecture view to state view transformations to perform reliability analysis on heterogeneous software architectures. We demonstrate how this analytical model can be utilized to estimate the reliability of a heterogeneous architecture consisting of batch-sequential/pipeline, call-and-return, parallel/pipe-filters, and fault tolerance styles. In addition, we conduct an experiment on a system embedded with three architectural styles to validate this heterogeneous software reliability model.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115626530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816236
Shahnaz Afroz, H. Youn, Dongman Lee
Among the various systems developed for parallel and distributed computing, networks of workstations (NOWs) based on the Message Passing Interface (MPI) have been recognized as an efficient platform. In this paper, we implement and compare two important message logging protocols, pessimistic and optimistic, for a NOW employing MPI. An experiment reveals that the total execution time is not significantly affected by the number of failures, while the performance of the optimistic protocol is more influenced by the number of failures than the pessimistic protocol is. Also, the former is more efficient than the latter for a reasonable number of failure points.
{"title":"Performance of message logging protocols for NOWs with MPI","authors":"Shahnaz Afroz, H. Youn, Dongman Lee","doi":"10.1109/PRDC.1999.816236","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816236","url":null,"abstract":"Among the various systems developed for parallel and distributed computing, networks of workstations (NOWs) based on the Message Passing Interface (MPI) have been recognized as an efficient platform. In this paper, we implement and compare two important message logging protocols, pessimistic and optimistic, for a NOW employing MPI. An experiment reveals that the total execution time is not significantly affected by the number of failures, while the performance of the optimistic protocol is more influenced by the number of failures than the pessimistic protocol is. Also, the former is more efficient than the latter for a reasonable number of failure points.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130750349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816215
Bo Yang, M. Xie
For many safety critical systems, redundancy is the only acceptable method to achieve high operational reliability as individual modules can hardly be certified to have reached that level. When limited resources are available in the testing of a redundant software system, it is important to allocate the testing-time efficiently so that the maximum reliability of the complete system is achieved. In this paper, this problem is investigated in detail. A general formulation is presented and a specific case is used to illustrate the procedure. The case where individual module reliability requirements are given is also considered.
{"title":"Testing-resource allocation for redundant software systems","authors":"Bo Yang, M. Xie","doi":"10.1109/PRDC.1999.816215","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816215","url":null,"abstract":"For many safety critical systems, redundancy is the only acceptable method to achieve high operational reliability as individual modules can hardly be certified to have reached that level. When limited resources are available in the testing of a redundant software system, it is important to allocate the testing-time efficiently so that the maximum reliability of the complete system is achieved. In this paper, this problem is investigated in detail. A general formulation is presented and a specific case is used to illustrate the procedure. The case where individual module reliability requirements are given is also considered.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132076369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816238
Allan K. Y. Wong, T. Dillon
The proposed fault-tolerant data communication setup has two main features: a consecutive transmission scheme that improves the reliability of message transmission, and an adaptive buffer management scheme that prevents message losses due to buffer overflow. These two features together reduce message retransmissions and produce better channel reliability and system performance. Simulation data confirm that the adaptive buffer management scheme is indeed an effective reliability measure to prevent data overflow.
{"title":"A fault-tolerant data communication setup to improve reliability and performance for Internet based distributed applications","authors":"Allan K. Y. Wong, T. Dillon","doi":"10.1109/PRDC.1999.816238","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816238","url":null,"abstract":"The proposed fault-tolerant data communication setup has two main features: a consecutive transmission scheme that improves the reliability of message transmission, and an adaptive buffer management scheme that prevents message losses due to buffer overflow. These two features together reduce message retransmissions and produce better channel reliability and system performance. Simulation data confirm that the adaptive buffer management scheme is indeed an effective reliability measure to prevent data overflow.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125089740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816227
Jun Xu, Z. Kalbarczyk, R. Iyer
This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.
{"title":"Networked Windows NT system field failure data analysis","authors":"Jun Xu, Z. Kalbarczyk, R. Iyer","doi":"10.1109/PRDC.1999.816227","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816227","url":null,"abstract":"This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128039400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816235
F. Clermidy, T. Collette, M. Nicolaidis
One way to improve reliability in parallel computers consists of adding supplementary processors and interconnections to the functional structure in order to replace faulty processors with respect to the network structure. This approach is named structural fault tolerance (SFT). Very integrated parallel computers are one way to implement a parallel structure. The material structure is then composed of many elementary blocks, such as ASICs or multi-chip modules (MCMs), each containing many processors. We show that former SFT methods fail in combining the different features, constraints and requirements of such structures. Thus, this paper introduces a new reconfiguration approach that is dedicated to very integrated parallel computers.
{"title":"A new placement algorithm dedicated to parallel computers: bases and application","authors":"F. Clermidy, T. Collette, M. Nicolaidis","doi":"10.1109/PRDC.1999.816235","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816235","url":null,"abstract":"One way to improve reliability in parallel computers consists of adding supplementary processors and interconnections to the functional structure in order to replace faulty processors with respect to the network structure. This approach is named structural fault tolerance (SFT). Very integrated parallel computers are one way to implement a parallel structure. The material structure is then composed of many elementary blocks, such as ASICs or multi-chip modules (MCMs), each containing many processors. We show that former SFT methods fail in combining the different features, constraints and requirements of such structures. Thus, this paper introduces a new reconfiguration approach that is dedicated to very integrated parallel computers.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132547669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816237
M. Arai, Atsushi Chiba, K. Iwasaki
We have measured the packet loss ratio, its time dependency, and the frequency of burst packet losses in Internet end-to-end communications. To do this, we developed a tool that sends and receives UDP (User Datagram Protocol) packets. Our measurements showed that long burst losses are more likely when the packet loss ratio is high. We then examined two models for calculating the burst packet loss, an independent loss model and a Markov-chain model, to see whether they explain the packet loss characteristics we measured. They did not, so we developed a sine model, in which the packet loss probability depends on the time of day. Theoretical analysis and simulations showed that this model explains the characteristics of the burst packet losses that we measured.
{"title":"Measurement and modeling of burst packet losses in Internet end-to-end communications","authors":"M. Arai, Atsushi Chiba, K. Iwasaki","doi":"10.1109/PRDC.1999.816237","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816237","url":null,"abstract":"We have measured the packet loss ratio, its time dependency, and the frequency of burst packet losses in Internet end-to-end communications. To do this, we developed a tool that sends and receives UDP (User Datagram Protocol) packets. Our measurements showed that long burst losses are more likely when the packet loss ratio is high. We then examined two models for calculating the burst packet loss, an independent loss model and a Markov-chain model, to see whether they explain the packet loss characteristics we measured. They did not, so we developed a sine model, in which the packet loss probability depends on the time of day. Theoretical analysis and simulations showed that this model explains the characteristics of the burst packet losses that we measured.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122037809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816222
Hui Shi, J. Peleska, M. Kouvaras
This paper presents experiences gained from the verification of a large-scale real-world embedded system by means of formal methods. This industrial verification project was performed for a fault-tolerant system designed and implemented by DaimlerChrysler Aerospace for the International Space Station ISS. The verification involved various aspects of system correctness, like deadlock and livelock analysis, correct protocol implementation, etc. The approach is based on CSP specifications and uses the model-checking tool FDR. It is realized by combining methods for the development as well as for the analysis. It is illustrated by examples and results obtained during the verification of the Byzantine agreement protocol implementation, where the combination of different abstraction methods is required.
{"title":"Combining methods for the analysis of a fault-tolerant system","authors":"Hui Shi, J. Peleska, M. Kouvaras","doi":"10.1109/PRDC.1999.816222","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816222","url":null,"abstract":"This paper presents experiences gained from the verification of a large-scale real-world embedded system by means of formal methods. This industrial verification project was performed for a fault-tolerant system designed and implemented by DaimlerChrysler Aerospace for the International Space Station ISS. The verification involved various aspects of system correctness, like deadlock and livelock analysis, correct protocol implementation, etc. The approach is based on CSP specifications and uses the model-checking tool FDR. It is realized by combining methods for the development as well as for the analysis. It is illustrated by examples and results obtained during the verification of the Byzantine agreement protocol implementation, where the combination of different abstraction methods is required.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125583670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-12-16DOI: 10.1109/PRDC.1999.816221
E. Troubitsyna
A probabilistic extension of the refinement calculus has been successfully applied in the design of safety-critical systems. The approach is based on a firm mathematical foundation within which the reasoning about correctness and behavior of the system under construction is carried out. The framework allows us also to obtain a quantitative assessment of the attributes of system dependability. We present an extension of our main design technique-refinement-the so-called parameterized refinement. The purpose of the extension is to create a technique which facilitates refinement of a system in such a way that the dependability of the implementation would be maximal. We mostly focus on the reliability aspect. The parameterized refinement resolves the problem of how to build more reliable systems by incorporating statistical information about a controlled environment and reliabilities of system components in the development process. We illustrate this by a case study-the development of a state monitoring system.
{"title":"Enhancing dependability via parameterized refinement","authors":"E. Troubitsyna","doi":"10.1109/PRDC.1999.816221","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816221","url":null,"abstract":"A probabilistic extension of the refinement calculus has been successfully applied in the design of safety-critical systems. The approach is based on a firm mathematical foundation within which the reasoning about correctness and behavior of the system under construction is carried out. The framework allows us also to obtain a quantitative assessment of the attributes of system dependability. We present an extension of our main design technique-refinement-the so-called parameterized refinement. The purpose of the extension is to create a technique which facilitates refinement of a system in such a way that the dependability of the implementation would be maximal. We mostly focus on the reliability aspect. The parameterized refinement resolves the problem of how to build more reliable systems by incorporating statistical information about a controlled environment and reliabilities of system components in the development process. We illustrate this by a case study-the development of a state monitoring system.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129480343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}