Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers最新文献

英文中文

Design fault tolerance in operating systems based on a standardization project 基于标准化项目设计操作系统中的容错

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466962

Akio Watanabe, K. Sakamura

We are exploring an MLDD (Multi-Layered Design Diversity) architecture that applies natural design diversity to an application program layer, an operating system layer, and a hardware layer based on the TRON standardization project. We have devised a backward error recovery mechanism for the operating system layer, and to implement it, we have developed a mechanism that automatically exchanges diverse operating system implementations. The paper presents an error-check generation method for the operating system layer. In this method, which is called SBACCG (Specification-Based Adaptive Consistency Checks Generation), one set of consistency checks is derived from a formal specification, and the checks are adapted to each implementation. We experimentally evaluated the effectiveness of our backward error recovery mechanism that uses the error checks generated through SBACCG.<>

基于波场TRON标准化项目，我们正在探索MLDD (Multi-Layered Design Diversity)架构，将自然设计多样性应用于应用程序层、操作系统层和硬件层。我们为操作系统层设计了向后错误恢复机制，为了实现它，我们开发了一种自动交换不同操作系统实现的机制。提出了一种针对操作系统层的错误检查生成方法。在这种称为SBACCG(基于规范的自适应一致性检查生成)的方法中，从正式规范派生出一组一致性检查，并对每个实现进行调整。我们通过实验评估了我们的反向错误恢复机制的有效性，该机制使用了通过SBACCG生成的错误检查。

引用次数: 9

Process allocation for load distribution in fault-tolerant multicomputers 容错多机中负载分配的进程分配

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466985

Jong Kim, Heejo Lee, Sunggu Lee

In this paper, we consider a load-balancing process allocation method for fault-tolerant multicomputer systems that balances the load before as well as after faults start to degrade the performance of the system. In order to be able to tolerate a single fault, each process (primary process) is duplicated (i.e. has a backup process). The backup process executes on a different processor from the primary, checkpointing the primary process and recovering the process if the primary process fails due to the occurrence of a fault. In this paper, we first formalize the problem of load-balancing process allocation and show that it is an NP-hard problem. Next, we propose a new heuristic process allocation method and analyze the performance of the proposed allocation method. Simulations are used to compare the proposed method with a process allocation method that does not take into account the different load characteristics of the primary and backup processes. While both methods perform well before the occurrence of a fault in a primary process, only the proposed method maintains a balanced load after the occurrence of such a fault.<>

在本文中，我们考虑了一种容错多计算机系统的负载均衡进程分配方法，该方法在故障开始降低系统性能之前和之后平衡负载。为了能够容忍单个故障，每个进程(主进程)都是复制的(即有一个备份进程)。备份进程在与主进程不同的处理器上执行，检查主进程并在主进程由于发生故障而失败时恢复该进程。本文首先形式化了负载均衡进程分配问题，并证明了它是一个np困难问题。接下来，我们提出了一种新的启发式过程分配方法，并对该方法的性能进行了分析。通过仿真比较了该方法与不考虑主备进程不同负载特性的进程分配方法。虽然这两种方法在主进程发生故障前都表现良好，但只有提出的方法在发生故障后保持负载均衡。

{"title":"Process allocation for load distribution in fault-tolerant multicomputers","authors":"Jong Kim, Heejo Lee, Sunggu Lee","doi":"10.1109/FTCS.1995.466985","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466985","url":null,"abstract":"In this paper, we consider a load-balancing process allocation method for fault-tolerant multicomputer systems that balances the load before as well as after faults start to degrade the performance of the system. In order to be able to tolerate a single fault, each process (primary process) is duplicated (i.e. has a backup process). The backup process executes on a different processor from the primary, checkpointing the primary process and recovering the process if the primary process fails due to the occurrence of a fault. In this paper, we first formalize the problem of load-balancing process allocation and show that it is an NP-hard problem. Next, we propose a new heuristic process allocation method and analyze the performance of the proposed allocation method. Simulations are used to compare the proposed method with a process allocation method that does not take into account the different load characteristics of the primary and backup processes. While both methods perform well before the occurrence of a fault in a primary process, only the proposed method maintains a balanced load after the occurrence of such a fault.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121389380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Design verification of a super-scalar RISC processor 超大规模RISC处理器的设计验证

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466951

Babu Turumella, Aiman Kabakibo, Manjunath Bogadi, Karakunakara Menon, Shaleah Thusoo, Long Nguyen, N. Saxena, Michael Chow

The paper provides an overview of the design verification methodology for HaL's Sparc64 processor development. This activity covered approximately two and a half years of design development time. Objectives and challenges are discussed and the verification methodology is described. Monitoring mechanisms that give high observability to internal design states, novel features that increase the simulation speed, and tools for automatic result checking are described. Also presented for the first time, is an analysis of the design defects discovered during the verification process. Such an analysis is useful in augmenting verification programs to target common design defects.<>

本文概述了HaL的Sparc64处理器开发的设计验证方法。这项活动涵盖了大约两年半的设计开发时间。讨论了目标和挑战，并描述了验证方法。描述了对内部设计状态提供高可观察性的监控机制、提高仿真速度的新特性以及用于自动结果检查的工具。本文还首次对验证过程中发现的设计缺陷进行了分析。这样的分析在扩展验证程序以针对常见的设计缺陷时是有用的。

引用次数: 5

Completely asynchronous optimistic recovery with minimal rollbacks 具有最小回滚的完全异步乐观恢复

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466963

Sean W. Smith, David B. Johnson, J. D. Tygar

Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message logging and replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However previous optimistic rollback recovery protocols either have required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. We present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one. This protocol is based on comparing timestamp vectors across multiple levels of partial order time.<>

考虑在一个或多个进程失败时透明地恢复异步分布式计算的问题。基于乐观消息日志和重放的回滚恢复是可取的，原因有几个，包括在无故障操作期间不需要进程之间的同步。然而，以前的乐观回滚恢复协议要么要求在恢复期间进行同步，要么允许一个进程的故障可能触发指数级的进程回滚。我们提出了一种乐观回滚恢复协议，该协议提供了完全异步恢复，同时还将响应故障的进程必须回滚的次数减少到最多一次。这个协议是基于在多个偏序时间级别上比较时间戳向量

引用次数: 62

Evaluation of software dependability based on stability test data 基于稳定性测试数据的软件可靠性评估

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466956

D. Tang, M. Hecht

The paper discusses a measurement-based approach to dependability evaluation of fault-tolerant, real-time software systems based on failure data collected from stability tests of an air traffic control system under development. Several dependability analysis techniques are illustrated with the data: parameter estimation, availability modeling of software from the task level, applications of the parameter estimation and model evaluation in assessing availability, identifying key problem areas, and predicting required test duration for achieving desired levels of availability and quantification of relationships between software size, the number of faults, and failure rate for a software unit. Although most discussion is focused on a typical subsystem, Sector Suite, the discussed methodology is applicable to other subsystems and the system. The study demonstrates a promising approach to measuring and assessing software availability during the development phase, which has been increasingly demanded by the project management of developing large, critical systems.<>

本文讨论了一种基于测量的容错实时软件系统可靠性评估方法，该方法基于从正在开发的空中交通管制系统稳定性测试中收集的故障数据。用数据说明了几种可靠性分析技术:参数估计，从任务级别对软件进行可用性建模，在评估可用性中参数估计和模型评估的应用，识别关键问题区域，并预测所需的测试持续时间，以达到所需的可用性水平，并量化软件大小、故障数量和软件单元的故障率之间的关系。尽管大多数讨论集中在一个典型的子系统，扇区套件上，但所讨论的方法适用于其他子系统和系统。该研究展示了在开发阶段测量和评估软件可用性的一种有前途的方法，开发大型关键系统的项目管理越来越需要这种方法。

引用次数: 31

Implicit signature checking 隐式签名检查

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466976

J. Ohlsson, M. Rimén

Proposes a control flow checking method that assigns unique initial signatures to each basic block in a program by using the block's start address. Using this strategy, implicit signature checking points are obtained at the beginning of each basic block, which results in a short error detection latency (2-5 instructions). Justifying signatures are embedded at each branch instruction, and a watchdog timer is used to detect the absence of a signature checking point. The method does not require the building of a program flow graph and it handles jumps to destinations that are not fixed at compile/link-time, e.g. subroutine calls using function pointers in the C language. This paper includes a generalized description of the control flow checking method, as well as a description and evaluation of an implementation of the method.<>

提出了一种控制流检查方法，该方法通过块的起始地址为程序中的每个基本块分配唯一的初始签名。使用该策略，在每个基本块的开始处获得隐式签名检查点，从而缩短了错误检测延迟(2-5个指令)。在每个分支指令中嵌入校验签名，并使用看门狗定时器检测签名检查点的缺失。该方法不需要构建程序流程图，并且它处理跳转到编译/链接时不固定的目标，例如使用C语言中的函数指针的子例程调用。本文对控制流检查方法进行了概括描述，并对该方法的一个实现进行了描述和评价。

引用次数: 79

Why optimistic message logging has not been used in telecommunications systems 为什么乐观消息记录没有在电信系统中使用

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466953

Yennun Huang, Yi-Min Wang

Much of the literature on message logging and checkpointing in the past decade has been based on a so-called optimistic approach that places more emphasis on failure-free overhead than recovery efficiency. Our experience has shown that most telecommunications systems use a pessimistic approach because the main purpose of using message logging and checkpointing is to achieve fast and localized recovery, and the failure-free overhead of a pessimistic approach can often be made reasonably low by exploiting application-specific information.<>

在过去十年中，关于消息日志和检查点的许多文献都是基于一种所谓的乐观方法，这种方法更强调无故障开销，而不是恢复效率。我们的经验表明，大多数电信系统使用悲观方法，因为使用消息日志记录和检查点的主要目的是实现快速和局部恢复，并且通过利用特定于应用程序的信息，悲观方法的无故障开销通常可以降低得相当低。

引用次数: 50

ARMOR: analyzer for reducing module operational risk ARMOR:用于降低模块操作风险的分析仪

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466989

Michael R. Lyu, Jinsong S. Yu, E. Keramidas, S. Dalal

ARMOR (Analyzer for Reducing Module Operational Risk) is a software risk analysis tool which automatically identifies the operational risks of software program modules. ARMOR takes data directly from project database, failure database, and program development database, establishes risk models according to several risk analysis schemes, determines the risks of software programs, and displays various statistical quantities for project management and engineering decisions. Its enhanced user interface greatly simplifies the risk modeling procedures and the usage learning time. The tool can perform the following tasks during project development, testing, and operation: establish promising risk models for the project under evaluation; measure the risks of software programs within the project; identify the source of risks and indicate how to improve software programs to reduce their risk levels; and determine the validity of risk models from field data.<>

ARMOR (Analyzer for Reducing Module Operational Risk)是一种软件风险分析工具，可以自动识别软件程序模块的操作风险。ARMOR直接从项目数据库、故障数据库和程序开发数据库中获取数据，根据多种风险分析方案建立风险模型，确定软件程序的风险，并显示各种统计量，供项目管理和工程决策使用。其增强的用户界面大大简化了风险建模过程和使用学习时间。该工具可在项目开发、测试和运行过程中执行以下任务:为被评估项目建立有前景的风险模型;衡量项目中软件程序的风险;识别风险的来源，并指出如何改进软件程序以降低其风险水平;并根据现场数据确定风险模型的有效性。

引用次数: 30

The ELEKTRA railway signalling system: field experience with an actively replicated system with diversity ELEKTRA铁路信号系统:具有多样性的积极复制系统的现场经验

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466954

H. Kantz, C. Koza

Since the beginning of the century, Alcatel Austria has been the main supplier of railway signalling products in Austria. In 1985, Alcatel Austria began developing the electronic interlocking system ELEKTRA. In order to meet the stringent safety requirements for railway interlocking applications, a two channel system based on design diversity has been developed. High availability and reliability are achieved by using actively triplicated redundancy with on-line recovery. In 1989, the first system was put into operation. About 15 railway interlocking systems are in operation and further installations are ongoing. The paper presents the fault tolerance mechanisms used for design faults as well as physical faults. The experience gained with these concepts is also discussed.<>

自本世纪初以来，阿尔卡特奥地利公司一直是奥地利铁路信号产品的主要供应商。1985年，阿尔卡特奥地利公司开始开发电子联锁系统ELEKTRA。为了满足铁路联锁应用的严格安全要求，开发了一种基于设计分集的双通道系统。高可用性和可靠性是通过使用主动三冗余和在线恢复来实现的。1989年，第一个系统投入运行。大约有15个铁路联锁系统正在运行，更多的装置正在安装中。本文介绍了用于设计故障和物理故障的容错机制。本文还讨论了从这些概念中获得的经验。

引用次数: 55

Gracefully degrading systems using the bulk-synchronous parallel model with randomised shared memory 使用随机共享内存的大块同步并行模型优雅地降级系统

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

Pub Date : 1995-06-27 DOI: 10.1109/FTCS.1995.466969

Andreas G. Savva, T. Nanya

The bulk-synchronous parallel model (BSPM) was proposed as a bridging model for parallel computation by Valiant (1990). By using randomised shared memory (RSM), this model offers an asymptotically optimal emulation of the PRAM. By using the BSPM with RSM, we show how a gracefully degrading massively parallel system can be obtained through: memory duplication to ensure global memory integrity, and to speed up the reconfiguration; a global reconfiguration method that restores the logical properties of the system, after a fault occurs. We assume fail-stop processors, single faults, no spare processors, and no significant loss of network throughput as a result of faults. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. The overhead of the scheme and the graceful degradation achieved depend on the program being executed. We evaluate the reconfiguration, overhead, and graceful degradation of the system experimentally.<>

块同步并行模型(BSPM)是Valiant(1990)提出的并行计算桥接模型。通过使用随机共享内存(RSM)，该模型提供了PRAM的渐近最优仿真。通过使用BSPM和RSM，我们展示了如何通过内存复制来获得优雅退化的大规模并行系统，以确保全局内存完整性，并加快重构;一种在故障发生后恢复系统逻辑属性的全局重新配置方法。我们假设有故障停止处理器、单个故障、没有备用处理器，并且没有由于故障而导致的网络吞吐量的重大损失。在重新配置过程中完成的工作在活动处理器之间平等地共享，很少进行协调。方案的开销和实现的优雅降级取决于正在执行的程序。我们通过实验评估了系统的重构、开销和优雅退化。

引用次数: 4

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀