首页 > 最新文献

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers最新文献

英文 中文
A fault-tolerant protocol for location directory maintenance in mobile networks 移动网络中定位目录维护的容错协议
S. Rangarajan, K. Ratnam, A. Dahbura
In this paper, we present a fault-tolerant protocol for maintaining location directories in mobile networks. The protocol tolerates base station failures and also allows for consistent location information to be maintained about mobile hosts that switch off and arbitrarily reappear in some other part of the network. Further, the protocol tolerates the corruption of a logical time stamp that is part of any protocol where new location information has to be distinguished from old location information when a location directory is updated. We formally show that the protocol maintains consistent location information and does not overwrite new location information with old location information. The protocol can be hierarchically organized to reduce the message overhead incurred by location directory updates.<>
在本文中,我们提出了一种在移动网络中维护位置目录的容错协议。该协议允许基站故障,也允许保持关于关闭并任意重新出现在网络其他部分的移动主机的一致位置信息。此外,该协议允许逻辑时间戳的损坏,这是任何协议的一部分,当位置目录更新时,必须将新位置信息与旧位置信息区分开来。我们正式表明,该协议保持了一致的位置信息,并且不会用旧的位置信息覆盖新的位置信息。该协议可以分层组织,以减少位置目录更新带来的消息开销。
{"title":"A fault-tolerant protocol for location directory maintenance in mobile networks","authors":"S. Rangarajan, K. Ratnam, A. Dahbura","doi":"10.1109/FTCS.1995.466986","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466986","url":null,"abstract":"In this paper, we present a fault-tolerant protocol for maintaining location directories in mobile networks. The protocol tolerates base station failures and also allows for consistent location information to be maintained about mobile hosts that switch off and arbitrarily reappear in some other part of the network. Further, the protocol tolerates the corruption of a logical time stamp that is part of any protocol where new location information has to be distinguished from old location information when a location directory is updated. We formally show that the protocol maintains consistent location information and does not overwrite new location information with old location information. The protocol can be hierarchically organized to reduce the message overhead incurred by location directory updates.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131159968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Measuring robustness of a fault tolerant aerospace system 航空航天容错系统鲁棒性的测量
Christopher P. Dingman, Joe Marshall, D. Siewiorek
In commercial literature, the meaning of the term fault tolerant has become vague. We describe a system used to measure the robustness of a fault tolerant aerospace system developed at IBM, present the data collected during the project, and report conclusions and areas for future work.<>
在商业文献中,术语容错的含义已经变得模糊。我们描述了一个用于度量IBM开发的容错航空航天系统的健壮性的系统,展示了项目期间收集的数据,并报告了结论和未来工作的领域。
{"title":"Measuring robustness of a fault tolerant aerospace system","authors":"Christopher P. Dingman, Joe Marshall, D. Siewiorek","doi":"10.1109/FTCS.1995.466945","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466945","url":null,"abstract":"In commercial literature, the meaning of the term fault tolerant has become vague. We describe a system used to measure the robustness of a fault tolerant aerospace system developed at IBM, present the data collected during the project, and report conclusions and areas for future work.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115330395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Fault tolerance in safety critical automotive applications: cost of agreement as a limiting factor 安全关键汽车应用中的容错:作为限制因素的协议成本
S. Poledna
The high availability and safety requirements for automotive electronics are currently almost exclusively addressed by application specific engineering solutions to fault tolerance rather than by systematic approaches. Currently, systematic approaches are ruled out because of cost. The reason for this is that a systematic approach to fault tolerance requires: replication of components; and communication between replicated components to achieve agreement despite nondeterminism. While replicated components become more and more available with the connection of different control units by means of a multiplex bus, it is shown that the cost of agreement on sensor inputs will become the limiting factor for systematic approaches to fault tolerance. For that reason a new agreement algorithm is introduced which considers the problem of agreement and sensor inputs in an integrated fashion. This algorithm takes advantage of the a priori knowledge on the maximum deviation of replicated sensor inputs. Optimality of this algorithm is shown with respect to the minimum number of bits for agreement. This algorithm allows broader application of systematic fault tolerance to automotive applications. The result of this work will be used for a prototype implementation of a safety critical automotive application.<>
目前,汽车电子产品的高可用性和安全性要求几乎完全由特定应用的工程解决方案来解决,而不是通过系统的方法。目前,由于成本的原因,系统的方法被排除在外。这样做的原因是系统的容错方法需要:组件的复制;以及复制组件之间的通信,以在不确定性的情况下实现协议。通过多路总线连接不同的控制单元,可以获得越来越多的复制组件,但研究表明,传感器输入的一致性成本将成为系统容错方法的限制因素。为此,提出了一种新的协议算法,该算法综合考虑了协议和传感器输入的问题。该算法利用了对复制传感器输入最大偏差的先验知识。该算法的最优性体现在协议的最小比特数方面。该算法允许在汽车应用中更广泛地应用系统容错。这项工作的结果将用于安全关键汽车应用的原型实现。
{"title":"Fault tolerance in safety critical automotive applications: cost of agreement as a limiting factor","authors":"S. Poledna","doi":"10.1109/FTCS.1995.466996","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466996","url":null,"abstract":"The high availability and safety requirements for automotive electronics are currently almost exclusively addressed by application specific engineering solutions to fault tolerance rather than by systematic approaches. Currently, systematic approaches are ruled out because of cost. The reason for this is that a systematic approach to fault tolerance requires: replication of components; and communication between replicated components to achieve agreement despite nondeterminism. While replicated components become more and more available with the connection of different control units by means of a multiplex bus, it is shown that the cost of agreement on sensor inputs will become the limiting factor for systematic approaches to fault tolerance. For that reason a new agreement algorithm is introduced which considers the problem of agreement and sensor inputs in an integrated fashion. This algorithm takes advantage of the a priori knowledge on the maximum deviation of replicated sensor inputs. Optimality of this algorithm is shown with respect to the minimum number of bits for agreement. This algorithm allows broader application of systematic fault tolerance to automotive applications. The result of this work will be used for a prototype implementation of a safety critical automotive application.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123379617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Combining software-implemented and simulation-based fault injection into a single fault injection method 将软件实现的故障注入和基于仿真的故障注入相结合,形成单一的故障注入方法
Jens Güthoff, V. Sieh
Fault/error injection has emerged as a valuable means for evaluating the dependability of a system. In particular, software-based techniques (which can be described as software-implemented and simulation-based techniques) have become very popular because of the relative simplicity of injecting faults. After discussing the advantages and drawbacks of these techniques, two approaches are introduced which try to overcome crucial problems when using software-based fault injection techniques. The first one improves the accuracy of software-implemented fault injection experiments. The second one offers detailed insights into the system dynamics in the presence of faults. With this knowledge, the number of fault injections (a major concern in simulation-based fault injection) can be significantly reduced. These approaches can be joined together, offering accuracy of fault injection results as well as transparency of the system dynamics in the presence of faults. A case study is shown in which the de facto dependability properties of a standard component, a Motorola MC88100 RISC processor, are evaluated.<>
故障/错误注入已经成为评估系统可靠性的一种有价值的手段。特别是,基于软件的技术(可以被描述为软件实现和基于仿真的技术)已经变得非常流行,因为注入错误相对简单。在讨论了这些技术的优缺点后,介绍了两种方法,它们试图克服使用基于软件的故障注入技术时遇到的关键问题。第一种方法提高了软件实现故障注入实验的准确性。第二部分提供了对存在故障的系统动力学的详细见解。有了这些知识,故障注入的数量(基于仿真的故障注入的主要关注点)可以显著减少。这些方法可以结合在一起,提供故障注入结果的准确性以及存在故障时系统动力学的透明度。一个案例研究显示,其中一个标准组件,摩托罗拉MC88100 RISC处理器的事实上的可靠性属性进行了评估。
{"title":"Combining software-implemented and simulation-based fault injection into a single fault injection method","authors":"Jens Güthoff, V. Sieh","doi":"10.1109/FTCS.1995.466978","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466978","url":null,"abstract":"Fault/error injection has emerged as a valuable means for evaluating the dependability of a system. In particular, software-based techniques (which can be described as software-implemented and simulation-based techniques) have become very popular because of the relative simplicity of injecting faults. After discussing the advantages and drawbacks of these techniques, two approaches are introduced which try to overcome crucial problems when using software-based fault injection techniques. The first one improves the accuracy of software-implemented fault injection experiments. The second one offers detailed insights into the system dynamics in the presence of faults. With this knowledge, the number of fault injections (a major concern in simulation-based fault injection) can be significantly reduced. These approaches can be joined together, offering accuracy of fault injection results as well as transparency of the system dynamics in the presence of faults. A case study is shown in which the de facto dependability properties of a standard component, a Motorola MC88100 RISC processor, are evaluated.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129603888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Interactive consistency algorithms based on voting and error-correcting codes 基于投票和纠错码的交互式一致性算法
T. Krol
This paper presents a new class of synchronous deterministic non authenticated algorithms for reaching interactive consistency (Byzantine agreement). The algorithms are based on voting and error correcting codes and require considerably less data communication than the original algorithm, whereas the number of rounds and the number of modules meet the minimum bounds. These algorithms based on voting and coding are defined and proved on the basis of a class of algorithms, called the dispersed joined communication algorithms.<>
本文提出了一类新的同步确定性非认证算法,用于实现交互一致性(拜占庭协议)。该算法基于投票和纠错码,所需的数据通信比原始算法少得多,而轮数和模块数满足最小界限。这些基于投票和编码的算法是在一类称为分散连接通信算法的基础上定义和证明的。
{"title":"Interactive consistency algorithms based on voting and error-correcting codes","authors":"T. Krol","doi":"10.1109/FTCS.1995.466994","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466994","url":null,"abstract":"This paper presents a new class of synchronous deterministic non authenticated algorithms for reaching interactive consistency (Byzantine agreement). The algorithms are based on voting and error correcting codes and require considerably less data communication than the original algorithm, whereas the number of rounds and the number of modules meet the minimum bounds. These algorithms based on voting and coding are defined and proved on the basis of a class of algorithms, called the dispersed joined communication algorithms.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129076185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Error detection and handling in a superscalar, speculative out-of-order execution processor system 超标量推测乱序执行处理器系统中的错误检测和处理
N. Saxena, Chien Chen, R. Swami, H. Osone, Shalesh Thusoo, D. Lyon, D. Chang, Anand Dharmaraj, N. Patkar, Yizhi Lu, Ben-Hau Chia
The HaL SPARC64 Processor, the first 64-bit SPARC-V9 architecture implementation, uses several techniques to ensure a high degree of system reliability, error detection, and error recovery. The CPU of the multi-chip module processor has a superscalar, speculative issue unit, and an out-of-order execution datapath. These two processor components complicate the maintenance of precise state in the event of errors. By exploiting the SPARC-V9 architectural features, and the micro-architecture for speculative execution, SPARC64 maintains precise state in the event of exceptions and errors, logs and reports errors, and facilitates error detection during full system bringup. The paper presents details of error detection and handling in the CPU, the cache system, and the Memory Management Unit(MMU). The HaL R1 system also implements a fault-secure memory system design. The memory system corrects all single-bit errors, detects double bit errors, detects single address line failures, and detects all single dynamic RAM (DRAM) chip failures. Certain debug features have been added to the system that are useful during system bring-up.<>
HaL SPARC64处理器是第一个64位SPARC-V9架构实现,它使用了几种技术来确保高度的系统可靠性、错误检测和错误恢复。多片模块处理器的CPU具有超标量、推测问题单元和乱序执行数据路径。这两个处理器组件使在发生错误时精确状态的维护复杂化。通过利用SPARC-V9体系结构特性和用于推测执行的微体系结构,SPARC64在发生异常和错误时保持精确的状态,记录和报告错误,并在整个系统启动期间促进错误检测。本文详细介绍了CPU、缓存系统和内存管理单元(MMU)的错误检测和处理。HaL R1系统还实现了故障安全存储系统设计。内存系统可以纠正所有的单比特错误,检测双比特错误,检测单地址线故障,以及检测所有的单动态RAM (DRAM)芯片故障。某些调试功能已添加到系统中,这些功能在系统启动期间很有用。
{"title":"Error detection and handling in a superscalar, speculative out-of-order execution processor system","authors":"N. Saxena, Chien Chen, R. Swami, H. Osone, Shalesh Thusoo, D. Lyon, D. Chang, Anand Dharmaraj, N. Patkar, Yizhi Lu, Ben-Hau Chia","doi":"10.1109/FTCS.1995.466952","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466952","url":null,"abstract":"The HaL SPARC64 Processor, the first 64-bit SPARC-V9 architecture implementation, uses several techniques to ensure a high degree of system reliability, error detection, and error recovery. The CPU of the multi-chip module processor has a superscalar, speculative issue unit, and an out-of-order execution datapath. These two processor components complicate the maintenance of precise state in the event of errors. By exploiting the SPARC-V9 architectural features, and the micro-architecture for speculative execution, SPARC64 maintains precise state in the event of exceptions and errors, logs and reports errors, and facilitates error detection during full system bringup. The paper presents details of error detection and handling in the CPU, the cache system, and the Memory Management Unit(MMU). The HaL R1 system also implements a fault-secure memory system design. The memory system corrects all single-bit errors, detects double bit errors, detects single address line failures, and detects all single dynamic RAM (DRAM) chip failures. Certain debug features have been added to the system that are useful during system bring-up.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Towards totally self-checking delay-insensitive systems 迈向完全自检延迟不敏感系统
S. Piestrak, T. Nanya
Considers designing quasi-delay-insensitive (QDI) combinational circuits (CCs), a class of self-timed (asynchronous) circuits. The necessity of coding both inputs and outputs of any QDI CC by using unordered codes naturally leads to inverter-free realization. The analysis of behavior of a QDI CC with input errors leads to the observation that it is impossible to avoid the so-called late detection problem. The new set of correct definitions of the code-disjoint QDI CC and of the totally self-checking (TSC) QDI CC is introduced. The detailed analysis of the behavior of a faulty QDI system with internal permanent faults shows that: (1) late detection, (2) the possibility of occurrence of invalid transitions, and (3) premature completion, seem to be the inherent properties of any QDI CC, which preclude its fault-secure (hence TSC) implementation for some single stuck-at faults. The first ever self-testing code-disjoint completion checker is proposed. Finally, an extensive study of designing self-testing code-disjoint QDI CCs is presented.<>
考虑设计一类自定时(异步)电路——准延迟不敏感(QDI)组合电路(CCs)。对任意QDI CC的输入和输出都使用无序编码的必要性自然会导致无逆变器的实现。对具有输入错误的QDI CC的行为分析导致观察到不可能避免所谓的延迟检测问题。介绍了码不相交QDI CC和完全自检(TSC) QDI CC的新的正确定义。对具有内部永久故障的故障QDI系统的行为的详细分析表明:(1)延迟检测,(2)发生无效转换的可能性,以及(3)过早完成,似乎是任何QDI CC的固有属性,这排除了它的故障安全(因此TSC)对某些单一卡在故障的实现。提出了第一个自测试代码分离完成检查器。最后,对自测试代码分离QDI cc的设计进行了广泛的研究。
{"title":"Towards totally self-checking delay-insensitive systems","authors":"S. Piestrak, T. Nanya","doi":"10.1109/FTCS.1995.466975","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466975","url":null,"abstract":"Considers designing quasi-delay-insensitive (QDI) combinational circuits (CCs), a class of self-timed (asynchronous) circuits. The necessity of coding both inputs and outputs of any QDI CC by using unordered codes naturally leads to inverter-free realization. The analysis of behavior of a QDI CC with input errors leads to the observation that it is impossible to avoid the so-called late detection problem. The new set of correct definitions of the code-disjoint QDI CC and of the totally self-checking (TSC) QDI CC is introduced. The detailed analysis of the behavior of a faulty QDI system with internal permanent faults shows that: (1) late detection, (2) the possibility of occurrence of invalid transitions, and (3) premature completion, seem to be the inherent properties of any QDI CC, which preclude its fault-secure (hence TSC) implementation for some single stuck-at faults. The first ever self-testing code-disjoint completion checker is proposed. Finally, an extensive study of designing self-testing code-disjoint QDI CCs is presented.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"12 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120848332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Optimal recovery point insertion for high-level synthesis of recoverable microarchitectures 可恢复微架构高级综合的最佳恢复点插入
D. Blough, F. Kurdahi, S. Ohm
The paper considers the problem of automatic insertion of recovery points in recoverable microarchitectures. Previous work on this problem provided heuristic algorithms that attempted either to minimize computation time with a bounded hardware overhead or to minimize hardware overhead with a bounded computation time. We present efficient algorithms that provide provably optimal solutions for both of these formulations of the problem. These algorithms take as their input a scheduled control-data flow graph describing the behavior of the system and they output either a minimum-time or a minimum-cost set of recovery point locations. We demonstrate the performance of our algorithms using some well-known benchmark control-data flow graphs. Over all parameter values for each of these benchmarks, our optimal algorithms are shown to perform as well as, and in many cases better than, the previously proposed heuristics.<>
研究了可恢复微体系结构中恢复点的自动插入问题。在此问题上的先前工作提供了启发式算法,这些算法要么尝试在有限的硬件开销下最小化计算时间,要么尝试在有限的计算时间内最小化硬件开销。我们提出了有效的算法,为这两个问题的表述提供了可证明的最优解。这些算法将描述系统行为的调度控制数据流图作为输入,并输出最小时间或最小成本的恢复点位置集。我们使用一些著名的基准控制数据流图来演示我们的算法的性能。在这些基准测试的所有参数值中,我们的最优算法表现得与之前提出的启发式算法一样好,在许多情况下甚至更好
{"title":"Optimal recovery point insertion for high-level synthesis of recoverable microarchitectures","authors":"D. Blough, F. Kurdahi, S. Ohm","doi":"10.1109/FTCS.1995.466979","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466979","url":null,"abstract":"The paper considers the problem of automatic insertion of recovery points in recoverable microarchitectures. Previous work on this problem provided heuristic algorithms that attempted either to minimize computation time with a bounded hardware overhead or to minimize hardware overhead with a bounded computation time. We present efficient algorithms that provide provably optimal solutions for both of these formulations of the problem. These algorithms take as their input a scheduled control-data flow graph describing the behavior of the system and they output either a minimum-time or a minimum-cost set of recovery point locations. We demonstrate the performance of our algorithms using some well-known benchmark control-data flow graphs. Over all parameter values for each of these benchmarks, our optimal algorithms are shown to perform as well as, and in many cases better than, the previously proposed heuristics.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134157141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Fault-tolerance for off-the-shelf applications and hardware 现成应用程序和硬件的容错性
M. Russinovich, Z. Segall
The concept of middleware provides a transparent way to augment and change the characteristics of a service provider as seen from a client. Fault tolerant policies are ideal candidates for middleware implementation. We have defined and implemented operating system based middleware support that provides the power and flexibility needed by diverse fault tolerant policies. This mechanism, called the sentry, has been built into the UNIX 4.3 BSD operating system server running on a Mach 3.0 kernel. To demonstrate the effectiveness of the mechanism several policies have been implemented using sentries including checkpointing and journaling. The implementation shows that complex fault tolerant policies can be efficiently and transparently implemented as middleware. Performance overhead of input journaling is less than 5% and application suspension during the checkpoint is typically under 10 seconds in length. A standard hard disk is used to store journal and checkpoint information with dedicated storage requirements of less than 20 MB.<>
中间件的概念提供了一种透明的方式来增加和更改从客户端看到的服务提供者的特征。容错策略是中间件实现的理想选择。我们已经定义并实现了基于操作系统的中间件支持,它提供了各种容错策略所需的强大功能和灵活性。这种机制被称为岗哨,已经内置于运行在Mach 3.0内核上的UNIX 4.3 BSD操作系统服务器中。为了证明该机制的有效性,已经使用包括检查点和日志记录在内的哨兵实现了几个策略。实践表明,复杂的容错策略可以作为中间件高效、透明地实现。输入日志记录的性能开销小于5%,并且检查点期间的应用程序挂起长度通常小于10秒。标准硬盘用于存储日志和检查点信息,专用存储空间要求小于20mb。
{"title":"Fault-tolerance for off-the-shelf applications and hardware","authors":"M. Russinovich, Z. Segall","doi":"10.1109/FTCS.1995.466997","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466997","url":null,"abstract":"The concept of middleware provides a transparent way to augment and change the characteristics of a service provider as seen from a client. Fault tolerant policies are ideal candidates for middleware implementation. We have defined and implemented operating system based middleware support that provides the power and flexibility needed by diverse fault tolerant policies. This mechanism, called the sentry, has been built into the UNIX 4.3 BSD operating system server running on a Mach 3.0 kernel. To demonstrate the effectiveness of the mechanism several policies have been implemented using sentries including checkpointing and journaling. The implementation shows that complex fault tolerant policies can be efficiently and transparently implemented as middleware. Performance overhead of input journaling is less than 5% and application suspension during the checkpoint is typically under 10 seconds in length. A standard hard disk is used to store journal and checkpoint information with dedicated storage requirements of less than 20 MB.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128628626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
A flexible ServerNet-based fault-tolerant architecture 灵活的基于servernet的容错架构
W. Baker, R. Horst, D. Sonnier, W. Watson
The paper introduces a new fault-tolerant architecture that combines the best attributes of the software fault-tolerant Tandem NonStop systems with the hardware fault-tolerant integrity systems. This architecture is based on the ServerNet System Area Network (SAN). ServerNet, formerly called TNet, is a packetized byte-serial multistage network that supports both I/O and interprocessor traffic in fault-tolerant systems. Dual-ported CPUs and VO controllers connect to independent subnetworks in a variety of different network topologies. Systems can expand either through shared or distributed memory multiprocessing. A separate maintenance system controls system initialization, online configuration changes, and error reporting. The architecture's flexibility makes it suitable for a wide range of environments with varying requirements for performance, fault tolerance, and software compatibility.<>
本文介绍了一种新的容错体系结构,它结合了软件容错串联不间断系统和硬件容错完整性系统的优点。该体系结构基于ServerNet系统区域网络(SAN)。ServerNet,以前称为TNet,是一个分组的字节串行多级网络,在容错系统中支持I/O和处理器间流量。双端口cpu和VO控制器连接到各种不同网络拓扑结构的独立子网。系统可以通过共享或分布式内存多处理进行扩展。单独的维护系统控制系统初始化、在线配置更改和错误报告。该体系结构的灵活性使其适用于对性能、容错性和软件兼容性有不同要求的各种环境。
{"title":"A flexible ServerNet-based fault-tolerant architecture","authors":"W. Baker, R. Horst, D. Sonnier, W. Watson","doi":"10.1109/FTCS.1995.466982","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466982","url":null,"abstract":"The paper introduces a new fault-tolerant architecture that combines the best attributes of the software fault-tolerant Tandem NonStop systems with the hardware fault-tolerant integrity systems. This architecture is based on the ServerNet System Area Network (SAN). ServerNet, formerly called TNet, is a packetized byte-serial multistage network that supports both I/O and interprocessor traffic in fault-tolerant systems. Dual-ported CPUs and VO controllers connect to independent subnetworks in a variety of different network topologies. Systems can expand either through shared or distributed memory multiprocessing. A separate maintenance system controls system initialization, online configuration changes, and error reporting. The architecture's flexibility makes it suitable for a wide range of environments with varying requirements for performance, fault tolerance, and software compatibility.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121805290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
期刊
Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1