Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466962
Akio Watanabe, K. Sakamura
We are exploring an MLDD (Multi-Layered Design Diversity) architecture that applies natural design diversity to an application program layer, an operating system layer, and a hardware layer based on the TRON standardization project. We have devised a backward error recovery mechanism for the operating system layer, and to implement it, we have developed a mechanism that automatically exchanges diverse operating system implementations. The paper presents an error-check generation method for the operating system layer. In this method, which is called SBACCG (Specification-Based Adaptive Consistency Checks Generation), one set of consistency checks is derived from a formal specification, and the checks are adapted to each implementation. We experimentally evaluated the effectiveness of our backward error recovery mechanism that uses the error checks generated through SBACCG.<>
{"title":"Design fault tolerance in operating systems based on a standardization project","authors":"Akio Watanabe, K. Sakamura","doi":"10.1109/FTCS.1995.466962","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466962","url":null,"abstract":"We are exploring an MLDD (Multi-Layered Design Diversity) architecture that applies natural design diversity to an application program layer, an operating system layer, and a hardware layer based on the TRON standardization project. We have devised a backward error recovery mechanism for the operating system layer, and to implement it, we have developed a mechanism that automatically exchanges diverse operating system implementations. The paper presents an error-check generation method for the operating system layer. In this method, which is called SBACCG (Specification-Based Adaptive Consistency Checks Generation), one set of consistency checks is derived from a formal specification, and the checks are adapted to each implementation. We experimentally evaluated the effectiveness of our backward error recovery mechanism that uses the error checks generated through SBACCG.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115706006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466985
Jong Kim, Heejo Lee, Sunggu Lee
In this paper, we consider a load-balancing process allocation method for fault-tolerant multicomputer systems that balances the load before as well as after faults start to degrade the performance of the system. In order to be able to tolerate a single fault, each process (primary process) is duplicated (i.e. has a backup process). The backup process executes on a different processor from the primary, checkpointing the primary process and recovering the process if the primary process fails due to the occurrence of a fault. In this paper, we first formalize the problem of load-balancing process allocation and show that it is an NP-hard problem. Next, we propose a new heuristic process allocation method and analyze the performance of the proposed allocation method. Simulations are used to compare the proposed method with a process allocation method that does not take into account the different load characteristics of the primary and backup processes. While both methods perform well before the occurrence of a fault in a primary process, only the proposed method maintains a balanced load after the occurrence of such a fault.<>
{"title":"Process allocation for load distribution in fault-tolerant multicomputers","authors":"Jong Kim, Heejo Lee, Sunggu Lee","doi":"10.1109/FTCS.1995.466985","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466985","url":null,"abstract":"In this paper, we consider a load-balancing process allocation method for fault-tolerant multicomputer systems that balances the load before as well as after faults start to degrade the performance of the system. In order to be able to tolerate a single fault, each process (primary process) is duplicated (i.e. has a backup process). The backup process executes on a different processor from the primary, checkpointing the primary process and recovering the process if the primary process fails due to the occurrence of a fault. In this paper, we first formalize the problem of load-balancing process allocation and show that it is an NP-hard problem. Next, we propose a new heuristic process allocation method and analyze the performance of the proposed allocation method. Simulations are used to compare the proposed method with a process allocation method that does not take into account the different load characteristics of the primary and backup processes. While both methods perform well before the occurrence of a fault in a primary process, only the proposed method maintains a balanced load after the occurrence of such a fault.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121389380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466951
Babu Turumella, Aiman Kabakibo, Manjunath Bogadi, Karakunakara Menon, Shaleah Thusoo, Long Nguyen, N. Saxena, Michael Chow
The paper provides an overview of the design verification methodology for HaL's Sparc64 processor development. This activity covered approximately two and a half years of design development time. Objectives and challenges are discussed and the verification methodology is described. Monitoring mechanisms that give high observability to internal design states, novel features that increase the simulation speed, and tools for automatic result checking are described. Also presented for the first time, is an analysis of the design defects discovered during the verification process. Such an analysis is useful in augmenting verification programs to target common design defects.<>
{"title":"Design verification of a super-scalar RISC processor","authors":"Babu Turumella, Aiman Kabakibo, Manjunath Bogadi, Karakunakara Menon, Shaleah Thusoo, Long Nguyen, N. Saxena, Michael Chow","doi":"10.1109/FTCS.1995.466951","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466951","url":null,"abstract":"The paper provides an overview of the design verification methodology for HaL's Sparc64 processor development. This activity covered approximately two and a half years of design development time. Objectives and challenges are discussed and the verification methodology is described. Monitoring mechanisms that give high observability to internal design states, novel features that increase the simulation speed, and tools for automatic result checking are described. Also presented for the first time, is an analysis of the design defects discovered during the verification process. Such an analysis is useful in augmenting verification programs to target common design defects.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"24 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113964849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466963
Sean W. Smith, David B. Johnson, J. D. Tygar
Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message logging and replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However previous optimistic rollback recovery protocols either have required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. We present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one. This protocol is based on comparing timestamp vectors across multiple levels of partial order time.<>
{"title":"Completely asynchronous optimistic recovery with minimal rollbacks","authors":"Sean W. Smith, David B. Johnson, J. D. Tygar","doi":"10.1109/FTCS.1995.466963","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466963","url":null,"abstract":"Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message logging and replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However previous optimistic rollback recovery protocols either have required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. We present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one. This protocol is based on comparing timestamp vectors across multiple levels of partial order time.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130212682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466956
D. Tang, M. Hecht
The paper discusses a measurement-based approach to dependability evaluation of fault-tolerant, real-time software systems based on failure data collected from stability tests of an air traffic control system under development. Several dependability analysis techniques are illustrated with the data: parameter estimation, availability modeling of software from the task level, applications of the parameter estimation and model evaluation in assessing availability, identifying key problem areas, and predicting required test duration for achieving desired levels of availability and quantification of relationships between software size, the number of faults, and failure rate for a software unit. Although most discussion is focused on a typical subsystem, Sector Suite, the discussed methodology is applicable to other subsystems and the system. The study demonstrates a promising approach to measuring and assessing software availability during the development phase, which has been increasingly demanded by the project management of developing large, critical systems.<>
{"title":"Evaluation of software dependability based on stability test data","authors":"D. Tang, M. Hecht","doi":"10.1109/FTCS.1995.466956","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466956","url":null,"abstract":"The paper discusses a measurement-based approach to dependability evaluation of fault-tolerant, real-time software systems based on failure data collected from stability tests of an air traffic control system under development. Several dependability analysis techniques are illustrated with the data: parameter estimation, availability modeling of software from the task level, applications of the parameter estimation and model evaluation in assessing availability, identifying key problem areas, and predicting required test duration for achieving desired levels of availability and quantification of relationships between software size, the number of faults, and failure rate for a software unit. Although most discussion is focused on a typical subsystem, Sector Suite, the discussed methodology is applicable to other subsystems and the system. The study demonstrates a promising approach to measuring and assessing software availability during the development phase, which has been increasingly demanded by the project management of developing large, critical systems.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114303555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466976
J. Ohlsson, M. Rimén
Proposes a control flow checking method that assigns unique initial signatures to each basic block in a program by using the block's start address. Using this strategy, implicit signature checking points are obtained at the beginning of each basic block, which results in a short error detection latency (2-5 instructions). Justifying signatures are embedded at each branch instruction, and a watchdog timer is used to detect the absence of a signature checking point. The method does not require the building of a program flow graph and it handles jumps to destinations that are not fixed at compile/link-time, e.g. subroutine calls using function pointers in the C language. This paper includes a generalized description of the control flow checking method, as well as a description and evaluation of an implementation of the method.<>
{"title":"Implicit signature checking","authors":"J. Ohlsson, M. Rimén","doi":"10.1109/FTCS.1995.466976","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466976","url":null,"abstract":"Proposes a control flow checking method that assigns unique initial signatures to each basic block in a program by using the block's start address. Using this strategy, implicit signature checking points are obtained at the beginning of each basic block, which results in a short error detection latency (2-5 instructions). Justifying signatures are embedded at each branch instruction, and a watchdog timer is used to detect the absence of a signature checking point. The method does not require the building of a program flow graph and it handles jumps to destinations that are not fixed at compile/link-time, e.g. subroutine calls using function pointers in the C language. This paper includes a generalized description of the control flow checking method, as well as a description and evaluation of an implementation of the method.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116004082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466953
Yennun Huang, Yi-Min Wang
Much of the literature on message logging and checkpointing in the past decade has been based on a so-called optimistic approach that places more emphasis on failure-free overhead than recovery efficiency. Our experience has shown that most telecommunications systems use a pessimistic approach because the main purpose of using message logging and checkpointing is to achieve fast and localized recovery, and the failure-free overhead of a pessimistic approach can often be made reasonably low by exploiting application-specific information.<>
{"title":"Why optimistic message logging has not been used in telecommunications systems","authors":"Yennun Huang, Yi-Min Wang","doi":"10.1109/FTCS.1995.466953","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466953","url":null,"abstract":"Much of the literature on message logging and checkpointing in the past decade has been based on a so-called optimistic approach that places more emphasis on failure-free overhead than recovery efficiency. Our experience has shown that most telecommunications systems use a pessimistic approach because the main purpose of using message logging and checkpointing is to achieve fast and localized recovery, and the failure-free overhead of a pessimistic approach can often be made reasonably low by exploiting application-specific information.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116812608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466989
Michael R. Lyu, Jinsong S. Yu, E. Keramidas, S. Dalal
ARMOR (Analyzer for Reducing Module Operational Risk) is a software risk analysis tool which automatically identifies the operational risks of software program modules. ARMOR takes data directly from project database, failure database, and program development database, establishes risk models according to several risk analysis schemes, determines the risks of software programs, and displays various statistical quantities for project management and engineering decisions. Its enhanced user interface greatly simplifies the risk modeling procedures and the usage learning time. The tool can perform the following tasks during project development, testing, and operation: establish promising risk models for the project under evaluation; measure the risks of software programs within the project; identify the source of risks and indicate how to improve software programs to reduce their risk levels; and determine the validity of risk models from field data.<>
ARMOR (Analyzer for Reducing Module Operational Risk)是一种软件风险分析工具,可以自动识别软件程序模块的操作风险。ARMOR直接从项目数据库、故障数据库和程序开发数据库中获取数据,根据多种风险分析方案建立风险模型,确定软件程序的风险,并显示各种统计量,供项目管理和工程决策使用。其增强的用户界面大大简化了风险建模过程和使用学习时间。该工具可在项目开发、测试和运行过程中执行以下任务:为被评估项目建立有前景的风险模型;衡量项目中软件程序的风险;识别风险的来源,并指出如何改进软件程序以降低其风险水平;并根据现场数据确定风险模型的有效性。
{"title":"ARMOR: analyzer for reducing module operational risk","authors":"Michael R. Lyu, Jinsong S. Yu, E. Keramidas, S. Dalal","doi":"10.1109/FTCS.1995.466989","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466989","url":null,"abstract":"ARMOR (Analyzer for Reducing Module Operational Risk) is a software risk analysis tool which automatically identifies the operational risks of software program modules. ARMOR takes data directly from project database, failure database, and program development database, establishes risk models according to several risk analysis schemes, determines the risks of software programs, and displays various statistical quantities for project management and engineering decisions. Its enhanced user interface greatly simplifies the risk modeling procedures and the usage learning time. The tool can perform the following tasks during project development, testing, and operation: establish promising risk models for the project under evaluation; measure the risks of software programs within the project; identify the source of risks and indicate how to improve software programs to reduce their risk levels; and determine the validity of risk models from field data.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114292121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466954
H. Kantz, C. Koza
Since the beginning of the century, Alcatel Austria has been the main supplier of railway signalling products in Austria. In 1985, Alcatel Austria began developing the electronic interlocking system ELEKTRA. In order to meet the stringent safety requirements for railway interlocking applications, a two channel system based on design diversity has been developed. High availability and reliability are achieved by using actively triplicated redundancy with on-line recovery. In 1989, the first system was put into operation. About 15 railway interlocking systems are in operation and further installations are ongoing. The paper presents the fault tolerance mechanisms used for design faults as well as physical faults. The experience gained with these concepts is also discussed.<>
{"title":"The ELEKTRA railway signalling system: field experience with an actively replicated system with diversity","authors":"H. Kantz, C. Koza","doi":"10.1109/FTCS.1995.466954","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466954","url":null,"abstract":"Since the beginning of the century, Alcatel Austria has been the main supplier of railway signalling products in Austria. In 1985, Alcatel Austria began developing the electronic interlocking system ELEKTRA. In order to meet the stringent safety requirements for railway interlocking applications, a two channel system based on design diversity has been developed. High availability and reliability are achieved by using actively triplicated redundancy with on-line recovery. In 1989, the first system was put into operation. About 15 railway interlocking systems are in operation and further installations are ongoing. The paper presents the fault tolerance mechanisms used for design faults as well as physical faults. The experience gained with these concepts is also discussed.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127552659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466969
Andreas G. Savva, T. Nanya
The bulk-synchronous parallel model (BSPM) was proposed as a bridging model for parallel computation by Valiant (1990). By using randomised shared memory (RSM), this model offers an asymptotically optimal emulation of the PRAM. By using the BSPM with RSM, we show how a gracefully degrading massively parallel system can be obtained through: memory duplication to ensure global memory integrity, and to speed up the reconfiguration; a global reconfiguration method that restores the logical properties of the system, after a fault occurs. We assume fail-stop processors, single faults, no spare processors, and no significant loss of network throughput as a result of faults. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. The overhead of the scheme and the graceful degradation achieved depend on the program being executed. We evaluate the reconfiguration, overhead, and graceful degradation of the system experimentally.<>
{"title":"Gracefully degrading systems using the bulk-synchronous parallel model with randomised shared memory","authors":"Andreas G. Savva, T. Nanya","doi":"10.1109/FTCS.1995.466969","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466969","url":null,"abstract":"The bulk-synchronous parallel model (BSPM) was proposed as a bridging model for parallel computation by Valiant (1990). By using randomised shared memory (RSM), this model offers an asymptotically optimal emulation of the PRAM. By using the BSPM with RSM, we show how a gracefully degrading massively parallel system can be obtained through: memory duplication to ensure global memory integrity, and to speed up the reconfiguration; a global reconfiguration method that restores the logical properties of the system, after a fault occurs. We assume fail-stop processors, single faults, no spare processors, and no significant loss of network throughput as a result of faults. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. The overhead of the scheme and the graceful degradation achieved depend on the program being executed. We evaluate the reconfiguration, overhead, and graceful degradation of the system experimentally.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132711519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}