Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311938
A. Kalakech, K. Kanoun, Y. Crouzet, J. Arlat
The aim of this paper is to compare the dependability of three operating systems (Windows NT4, Windows 2000 and Windows XP) with respect to erroneous behavior of the application layer. The results show a similar behavior of the three OSs with respect to robustness and a noticeable difference in OS reaction and restart times. They also show that the application state (mainly the hang and abort states) significantly impacts the restart time for the three OSs.
{"title":"Benchmarking the dependability of Windows NT4, 2000 and XP","authors":"A. Kalakech, K. Kanoun, Y. Crouzet, J. Arlat","doi":"10.1109/DSN.2004.1311938","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311938","url":null,"abstract":"The aim of this paper is to compare the dependability of three operating systems (Windows NT4, Windows 2000 and Windows XP) with respect to erroneous behavior of the application layer. The results show a similar behavior of the three OSs with respect to robustness and a noticeable difference in OS reaction and restart times. They also show that the application state (mainly the hang and abort states) significantly impacts the restart time for the three OSs.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114745424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311946
Raul Ceretta Nunes, Ingrid Jansch-Pôrto
Unreliable failure detectors have been an important abstraction to build dependable distributed applications over asynchronous distributed systems subject to faults. Their implementations are commonly based on timeouts to ensure algorithm termination. However, for systems built on the Internet, it is hard to estimate this time value due to traffic variations. Thus, different types of predictors have been used to model this behavior and make predictions of delays. In order to increase the quality of service (QoS), self-tuned failure detectors dynamically adapt their timeouts to the communication delay behavior added of a safety margin. In this paper, we evaluate the QoS of a failure detector for different combinations of communication delay predictors and safety margins. As the results show, to improve the QoS, one must consider the relation between the pair predictor/margin, instead of each one separately. Furthermore, performance and accuracy requirements should be considered for a suitable relationship.
{"title":"QoS of timeout-based self-tuned failure detectors: the effects of the communication delay predictor and the safety margin","authors":"Raul Ceretta Nunes, Ingrid Jansch-Pôrto","doi":"10.1109/DSN.2004.1311946","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311946","url":null,"abstract":"Unreliable failure detectors have been an important abstraction to build dependable distributed applications over asynchronous distributed systems subject to faults. Their implementations are commonly based on timeouts to ensure algorithm termination. However, for systems built on the Internet, it is hard to estimate this time value due to traffic variations. Thus, different types of predictors have been used to model this behavior and make predictions of delays. In order to increase the quality of service (QoS), self-tuned failure detectors dynamically adapt their timeouts to the communication delay behavior added of a safety margin. In this paper, we evaluate the QoS of a failure detector for different combinations of communication delay predictors and safety margins. As the results show, to improve the QoS, one must consider the relation between the pair predictor/margin, instead of each one separately. Furthermore, performance and accuracy requirements should be considered for a suitable relationship.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133631634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311904
I. Assayad, A. Girault, Hamoudi Kalla
Multi-criteria scheduling problems, involving optimization of more than one criterion, are subject to a growing interest. In this paper, we present a new bi-criteria scheduling heuristic for scheduling data-flow graphs of operations onto parallel heterogeneous architectures according to two criteria: first the minimization of the schedule length, and second the maximization of the system reliability. Reliability is defined as the probability that none of the system components will fail while processing. The proposed algorithm is a list scheduling heuristics, based on a bi-criteria compromise function that introduces priority between the operations to be scheduled, and that chooses on what subset of processors they should be scheduled. It uses the active replication of operations to improve the reliability. If the system reliability or the schedule length requirements are not met, then a parameter of the compromise function can be changed and the algorithm re-executed. This process is iterated until both requirements are met.
{"title":"A bi-criteria scheduling heuristic for distributed embedded systems under reliability and real-time constraints","authors":"I. Assayad, A. Girault, Hamoudi Kalla","doi":"10.1109/DSN.2004.1311904","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311904","url":null,"abstract":"Multi-criteria scheduling problems, involving optimization of more than one criterion, are subject to a growing interest. In this paper, we present a new bi-criteria scheduling heuristic for scheduling data-flow graphs of operations onto parallel heterogeneous architectures according to two criteria: first the minimization of the schedule length, and second the maximization of the system reliability. Reliability is defined as the probability that none of the system components will fail while processing. The proposed algorithm is a list scheduling heuristics, based on a bi-criteria compromise function that introduces priority between the operations to be scheduled, and that chooses on what subset of processors they should be scheduled. It uses the active replication of operations to improve the reliability. If the system reliability or the schedule length requirements are not met, then a parameter of the compromise function can be changed and the algorithm re-executed. This process is iterated until both requirements are met.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134138078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311889
W. Steiner, J. Rushby, M. Sorea, H. Pfeifer
The increasing performance of modern model-checking tools offers high potential for the computer-aided design of fault-tolerant algorithms. Instead of relying on human imagination to generate taxing failure scenarios to probe a fault-tolerant algorithm during development, we define the fault behavior of a faulty process at its interfaces to the remaining system and use model checking to automatically examine all possible failure scenarios. We call this approach "exhaustive fault simulation". In this paper we illustrate exhaustive fault simulation using a new startup algorithm for the time-triggered architecture (TTA) and show that this approach is fast enough to be deployed in the design loop. We use the SAL toolset from SRI for our experiments and describe an approach to modeling and analyzing fault-tolerant algorithms that exploits the capabilities of tools such as this.
{"title":"Model checking a fault-tolerant startup algorithm: from design exploration to exhaustive fault simulation","authors":"W. Steiner, J. Rushby, M. Sorea, H. Pfeifer","doi":"10.1109/DSN.2004.1311889","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311889","url":null,"abstract":"The increasing performance of modern model-checking tools offers high potential for the computer-aided design of fault-tolerant algorithms. Instead of relying on human imagination to generate taxing failure scenarios to probe a fault-tolerant algorithm during development, we define the fault behavior of a faulty process at its interfaces to the remaining system and use model checking to automatically examine all possible failure scenarios. We call this approach \"exhaustive fault simulation\". In this paper we illustrate exhaustive fault simulation using a new startup algorithm for the time-triggered architecture (TTA) and show that this approach is fast enough to be deployed in the design loop. We use the SAL toolset from SRI for our experiments and describe an approach to modeling and analyzing fault-tolerant algorithms that exploits the capabilities of tools such as this.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"398 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131758398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311947
P. Katsaros, C. Lazos
Recent developments in the field of object-based fault tolerance and the advent of the first OMG FT-CORBA compliant middleware raise new requirements for the design process of distributed fault-tolerant systems. In this work, we introduce a simulation-based design approach based on the optimum effectiveness of the compared fault tolerance schemes. Each scheme is defined as a set of fault tolerance properties for the objects that compose the system. Its optimum effectiveness is determined by the tightest effective checkpoint intervals, for the passively replicated objects. Our approach allows mixing miscellaneous fault tolerance policies, as opposed to the published analytic models, which are best suited in the evaluation of single-server process replication schemes. Special emphasis has been given to the accuracy of the generated estimates using an appropriate simulation output analysis procedure. We provide showcase results and compare two characteristic warm passive replication schemes: one with periodic and another one with load-dependent object state checkpoints. Finally, a trade-off analysis is applied, for determining appropriate checkpoint properties, in respect to a specified design goal.
{"title":"Optimal object state transfer - recovery policies for fault tolerant distributed systems","authors":"P. Katsaros, C. Lazos","doi":"10.1109/DSN.2004.1311947","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311947","url":null,"abstract":"Recent developments in the field of object-based fault tolerance and the advent of the first OMG FT-CORBA compliant middleware raise new requirements for the design process of distributed fault-tolerant systems. In this work, we introduce a simulation-based design approach based on the optimum effectiveness of the compared fault tolerance schemes. Each scheme is defined as a set of fault tolerance properties for the objects that compose the system. Its optimum effectiveness is determined by the tightest effective checkpoint intervals, for the passively replicated objects. Our approach allows mixing miscellaneous fault tolerance policies, as opposed to the published analytic models, which are best suited in the evaluation of single-server process replication schemes. Special emphasis has been given to the accuracy of the generated estimates using an appropriate simulation output analysis procedure. We provide showcase results and compare two characteristic warm passive replication schemes: one with periodic and another one with load-dependent object state checkpoints. Finally, a trade-off analysis is applied, for determining appropriate checkpoint properties, in respect to a specified design goal.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"97 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129998517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311943
T. Ozaki, T. Dohi, H. Okamura, N. Kaio
In this paper we consider two kinds of sequential checkpoint placement problems with infinite/finite time horizon. For these problems, we apply the approximation methods based on the variational principle and develop the computation algorithms to derive the optimal checkpoint sequence approximately. Next, we focus on the situation where the knowledge on system failure is incomplete, i.e. the system failure time distribution is unknown. We develop the so-called min-max checkpoint placement methods to determine the optimal checkpoint sequence under the uncertain circumstance in terms of the system failure time distribution. In numerical examples, we investigate quantitatively the min-max checkpoint placement methods, and refer to their potential applicability in practice.
{"title":"Min-max checkpoint placement under incomplete failure information","authors":"T. Ozaki, T. Dohi, H. Okamura, N. Kaio","doi":"10.1109/DSN.2004.1311943","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311943","url":null,"abstract":"In this paper we consider two kinds of sequential checkpoint placement problems with infinite/finite time horizon. For these problems, we apply the approximation methods based on the variational principle and develop the computation algorithms to derive the optimal checkpoint sequence approximately. Next, we focus on the situation where the knowledge on system failure is incomplete, i.e. the system failure time distribution is unknown. We develop the so-called min-max checkpoint placement methods to determine the optimal checkpoint sequence under the uncertain circumstance in terms of the system failure time distribution. In numerical examples, we investigate quantitatively the min-max checkpoint placement methods, and refer to their potential applicability in practice.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132036766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311952
Vinu Vijay Kumar, Rashi Verma, J. Lach, J. Dugan
The design of quality digital systems depends on models that accurately evaluate various options in the design space against a set of prioritized metrics. While individual models for evaluating area, performance, reliability, power, etc. are well established, models combining multiple metrics are less mature. This paper introduces a formal methodology for comprehensively analyzing performance, area and reliability in the design of synchronous dataflow systems using a novel Markov Reward Model. A Markov chain system reliability model is constructed for various design options in the presence of possible component failures, and high-level synthesis techniques are used to associate performance and area rewards with each state in the chain. The cumulative reward for a chain is then used to evaluate the corresponding design option with respect to the metrics of interest. Application of the model to a benchmark DSP circuit provides insights into reliable synchronous dataflow system design.
{"title":"A Markov reward model for reliable synchronous dataflow system design","authors":"Vinu Vijay Kumar, Rashi Verma, J. Lach, J. Dugan","doi":"10.1109/DSN.2004.1311952","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311952","url":null,"abstract":"The design of quality digital systems depends on models that accurately evaluate various options in the design space against a set of prioritized metrics. While individual models for evaluating area, performance, reliability, power, etc. are well established, models combining multiple metrics are less mature. This paper introduces a formal methodology for comprehensively analyzing performance, area and reliability in the design of synchronous dataflow systems using a novel Markov Reward Model. A Markov chain system reliability model is constructed for various design options in the presence of possible component failures, and high-level synthesis techniques are used to associate performance and area rewards with each state in the chain. The cumulative reward for a chain is then used to evaluate the corresponding design option with respect to the metrics of interest. Application of the model to a benchmark DSP circuit provides insights into reliable synchronous dataflow system design.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132560822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311919
E. P. Duarte, Rogério Santini, Jaime Cohen
Routing protocols present a convergence latency for all routers to update their tables after a fault occurs and the network topology changes. During this time interval, which in the Internet has been shown to be of up to minutes, packets may be lost before reaching their destinations. In order to allow nodes to continue communicating during the convergence latency interval, we propose the use of alternative routes called detours. In this work we introduce new criteria for selecting detours based on network connectivity. Detours are chosen without the knowledge of which node or link is faulty. Highly connected components present a larger number of distinct paths, thus increasing the probability that the detour will work correctly. Experimental results were obtained with simulation on random Internet-like graphs generated with the Waxman method. Results show that the fault coverage obtained through the usage of the best detour is up to 90%. When the three best detours are considered, the fault coverage is up to 98%.
{"title":"Delivering packets during the routing convergence latency interval through highly connected detours","authors":"E. P. Duarte, Rogério Santini, Jaime Cohen","doi":"10.1109/DSN.2004.1311919","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311919","url":null,"abstract":"Routing protocols present a convergence latency for all routers to update their tables after a fault occurs and the network topology changes. During this time interval, which in the Internet has been shown to be of up to minutes, packets may be lost before reaching their destinations. In order to allow nodes to continue communicating during the convergence latency interval, we propose the use of alternative routes called detours. In this work we introduce new criteria for selecting detours based on network connectivity. Detours are chosen without the knowledge of which node or link is faulty. Highly connected components present a larger number of distinct paths, thus increasing the probability that the detour will work correctly. Experimental results were obtained with simulation on random Internet-like graphs generated with the Waxman method. Results show that the fault coverage obtained through the usage of the best detour is up to 90%. When the three best detours are considered, the fault coverage is up to 98%.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133734577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311910
Aaron B. Brown, Leonard Chung, William Kakes, C. Ling, D. Patterson
We describe an approach to quantitatively evaluating human-assisted failure-recovery tools and processes in the environment of modern Internetand enterprise-class server systems. Our approach can quantify the dependability impact of a single recovery system, and also enables comparisons between different recovery approaches. The approach combines aspects of dependability benchmarking with human user studies, incorporating human participants in the system evaluations yet still producing typical dependability-related metrics as results. We illustrate our methodology via a case study of a system-wide undo/redo recovery tool for e-mail services; our approach is able to expose the dependability benefits of the tool as well as point out areas where its behavior could use improvement.
{"title":"Experience with evaluating human-assisted recovery processes","authors":"Aaron B. Brown, Leonard Chung, William Kakes, C. Ling, D. Patterson","doi":"10.1109/DSN.2004.1311910","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311910","url":null,"abstract":"We describe an approach to quantitatively evaluating human-assisted failure-recovery tools and processes in the environment of modern Internetand enterprise-class server systems. Our approach can quantify the dependability impact of a single recovery system, and also enables comparisons between different recovery approaches. The approach combines aspects of dependability benchmarking with human user studies, incorporating human participants in the system evaluations yet still producing typical dependability-related metrics as results. We illustrate our methodology via a case study of a system-wide undo/redo recovery tool for e-mail services; our approach is able to expose the dependability benefits of the tool as well as point out areas where its behavior could use improvement.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126660240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-28DOI: 10.1109/DSN.2004.1311932
Jun He, M. Hiltunen, R. Schlichting
Mobile service platforms are used to facilitate access to enterprise services such as email, product inventory, or design drawing databases by a wide range of mobile devices using a variety of access protocols. This paper presents a quality of service (QoS) architecture that allows flexible combinations of dependability attributes such as reliability, timeliness, and security to be enforced on a per service request basis. In addition to components that implement the underlying dependability techniques, the architecture includes policy components that evaluate a request's requirements and dynamically determine an appropriate execution strategy. The architecture has been integrated into an experimental version of iMobile, a mobile service platform being developed at AT&T. This paper describes the design and implementation of the architecture, and gives initial experimental results for the iMobile prototype.
{"title":"Customizing dependability attributes for mobile service platforms","authors":"Jun He, M. Hiltunen, R. Schlichting","doi":"10.1109/DSN.2004.1311932","DOIUrl":"https://doi.org/10.1109/DSN.2004.1311932","url":null,"abstract":"Mobile service platforms are used to facilitate access to enterprise services such as email, product inventory, or design drawing databases by a wide range of mobile devices using a variety of access protocols. This paper presents a quality of service (QoS) architecture that allows flexible combinations of dependability attributes such as reliability, timeliness, and security to be enforced on a per service request basis. In addition to components that implement the underlying dependability techniques, the architecture includes policy components that evaluate a request's requirements and dynamically determine an appropriate execution strategy. The architecture has been integrated into an experimental version of iMobile, a mobile service platform being developed at AT&T. This paper describes the design and implementation of the architecture, and gives initial experimental results for the iMobile prototype.","PeriodicalId":436323,"journal":{"name":"International Conference on Dependable Systems and Networks, 2004","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116219317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}