Proceedings 1999 Pacific Rim International Symposium on Dependable Computing最新文献

英文中文

Using physical and simulated fault injection to evaluate error detection mechanisms 使用物理和模拟故障注入来评估错误检测机制

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816228

C. Constantinescu

Effective error detection is paramount for building highly dependable computing systems. A new methodology, based on physical and simulated fault injection, is developed for evaluating error detection mechanisms. Our approach consists of two steps. First, transient faults are physically injected at the IC pin level of a prototype server. Experiments are carried our in a three dimensional space of events, the location, time of occurrence and duration of the fault being randomly selected. Improved detection circuitry is devised for decreasing signal sensitivity to transients. Second, simulated fault injection is performed to asses the effectiveness of the new detection mechanisms, without using expensive silicon implementations. Physical fault injection experiments, carried out on the server, and simulated fault injection, performed on protocol checker, are presented. Detection effectiveness is measured by the error detection coverage, defined as the conditional probability that an error is detected given that an error occurs. Fault injection reveals that coverage probability is a function of fault duration. The protocol checker significantly improves error detection. Although, further research is required to increase detection coverage of the errors induced by short transient faults.

有效的错误检测对于构建高度可靠的计算系统至关重要。提出了一种基于物理和模拟故障注入的错误检测机制评价方法。我们的方法包括两个步骤。首先，在原型服务器的IC引脚级物理注入瞬态故障。实验是在三维事件空间中进行的，故障的位置、发生时间和持续时间是随机选择的。改进了检测电路，降低了信号对瞬变的灵敏度。其次，在不使用昂贵的硅实现的情况下，进行模拟故障注入来评估新检测机制的有效性。给出了在服务器上进行的物理故障注入实验和在协议检查器上进行的模拟故障注入实验。检测有效性是通过错误检测覆盖率来衡量的，错误检测覆盖率定义为在错误发生的情况下检测到错误的条件概率。故障注入表明，覆盖概率是故障持续时间的函数。协议检查器显著提高了错误检测。然而，对于短时暂态故障引起的误差，需要进一步的研究来提高检测覆盖率。

{"title":"Using physical and simulated fault injection to evaluate error detection mechanisms","authors":"C. Constantinescu","doi":"10.1109/PRDC.1999.816228","DOIUrl":"https://doi.org/10.1109/PRDC.1999.816228","url":null,"abstract":"Effective error detection is paramount for building highly dependable computing systems. A new methodology, based on physical and simulated fault injection, is developed for evaluating error detection mechanisms. Our approach consists of two steps. First, transient faults are physically injected at the IC pin level of a prototype server. Experiments are carried our in a three dimensional space of events, the location, time of occurrence and duration of the fault being randomly selected. Improved detection circuitry is devised for decreasing signal sensitivity to transients. Second, simulated fault injection is performed to asses the effectiveness of the new detection mechanisms, without using expensive silicon implementations. Physical fault injection experiments, carried out on the server, and simulated fault injection, performed on protocol checker, are presented. Detection effectiveness is measured by the error detection coverage, defined as the conditional probability that an error is detected given that an error occurs. Fault injection reveals that coverage probability is a function of fault duration. The protocol checker significantly improves error detection. Although, further research is required to increase detection coverage of the errors induced by short transient faults.","PeriodicalId":389294,"journal":{"name":"Proceedings 1999 Pacific Rim International Symposium on Dependable Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127729930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

An architecture-based software reliability model 基于体系结构的软件可靠性模型

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816223

Wen-li Wang, Ye Wu, Mei-Hwa Chen

We present an analytical model for estimating architecture-based software reliability, according to the reliability of each component, the operational profile, and the architecture of software. Our approach is based on Markov chain properties and architecture view to state view transformations to perform reliability analysis on heterogeneous software architectures. We demonstrate how this analytical model can be utilized to estimate the reliability of a heterogeneous architecture consisting of batch-sequential/pipeline, call-and-return, parallel/pipe-filters, and fault tolerance styles. In addition, we conduct an experiment on a system embedded with three architectural styles to validate this heterogeneous software reliability model.

我们根据每个组件的可靠性、运行概况和软件的体系结构，提出了一个基于体系结构的软件可靠性评估的分析模型。我们的方法是基于马尔可夫链属性和架构视图到状态视图的转换来执行异构软件架构的可靠性分析。我们演示了如何利用此分析模型来估计由批处理顺序/管道、调用-返回、并行/管道过滤器和容错风格组成的异构体系结构的可靠性。此外，我们在一个具有三种架构风格的嵌入式系统上进行了实验，以验证这种异构软件可靠性模型。

引用次数: 175

Performance of message logging protocols for NOWs with MPI 带MPI的NOWs消息日志记录协议的性能

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816236

Shahnaz Afroz, H. Youn, Dongman Lee

Among the various systems developed for parallel and distributed computing, networks of workstations (NOWs) based on the Message Passing Interface (MPI) have been recognized as an efficient platform. In this paper, we implement and compare two important message logging protocols, pessimistic and optimistic, for a NOW employing MPI. An experiment reveals that the total execution time is not significantly affected by the number of failures, while the performance of the optimistic protocol is more influenced by the number of failures than the pessimistic protocol is. Also, the former is more efficient than the latter for a reasonable number of failure points.

在面向并行和分布式计算的各种系统中，基于消息传递接口(MPI)的工作站网络(NOWs)已被公认为是一种高效的平台。本文对采用MPI的NOW实现了悲观和乐观两种重要的消息记录协议，并进行了比较。实验表明，总执行时间受失败次数的影响不显著，而乐观协议的性能受失败次数的影响大于悲观协议。此外，对于合理数量的故障点，前者比后者更有效。

引用次数: 1

Testing-resource allocation for redundant software systems 冗余软件系统的测试资源分配

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816215

Bo Yang, M. Xie

For many safety critical systems, redundancy is the only acceptable method to achieve high operational reliability as individual modules can hardly be certified to have reached that level. When limited resources are available in the testing of a redundant software system, it is important to allocate the testing-time efficiently so that the maximum reliability of the complete system is achieved. In this paper, this problem is investigated in detail. A general formulation is presented and a specific case is used to illustrate the procedure. The case where individual module reliability requirements are given is also considered.

对于许多安全关键系统来说，冗余是实现高运行可靠性的唯一可接受的方法，因为单个模块很难被认证达到该水平。当冗余软件系统的测试资源有限时，有效地分配测试时间以达到整个系统的最大可靠性是非常重要的。本文对这一问题进行了详细的研究。给出了一个一般公式，并用一个具体的案例来说明这个过程。还考虑了给出单个模块可靠性要求的情况。

引用次数: 11

A fault-tolerant data communication setup to improve reliability and performance for Internet based distributed applications 一种容错数据通信设置，用于提高基于Internet的分布式应用程序的可靠性和性能

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816238

Allan K. Y. Wong, T. Dillon

The proposed fault-tolerant data communication setup has two main features: a consecutive transmission scheme that improves the reliability of message transmission, and an adaptive buffer management scheme that prevents message losses due to buffer overflow. These two features together reduce message retransmissions and produce better channel reliability and system performance. Simulation data confirm that the adaptive buffer management scheme is indeed an effective reliability measure to prevent data overflow.

提出的容错数据通信设置有两个主要特点:提高消息传输可靠性的连续传输方案和防止缓冲区溢出造成消息丢失的自适应缓冲区管理方案。这两个特性一起减少了消息重传，提高了信道可靠性和系统性能。仿真数据验证了自适应缓冲管理方案确实是一种有效的防止数据溢出的可靠性措施。

引用次数: 26

Networked Windows NT system field failure data analysis 网络化Windows NT系统现场故障数据分析

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816227

Jun Xu, Z. Kalbarczyk, R. Iyer

This paper presents a measurement-based dependability study of a Networked Windows NT system based on field data collected from NT System Logs from 503 servers running in a production environment over a four-month period. The event logs at hand contains only system reboot information. We study individual server failures and domain behavior in order to characterize failure behavior and explore error propagation between servers. The key observations from this study are: (1) system software and hardware failures are the two major contributors to the total system downtime (22% and 10%), (2) recovery from application software failures are usually quick, (3) in many cases, more than one reboots are required to recover from a failure, (4) the average availability of an individual server is over 99%, (5) there is a strong indication of error dependency or error propagation across the network, (6) most (58%) reboots are unclassified indicating the need for better logging techniques, (7) maintenance and configuration contribute to 24% of system downtime.

本文基于在生产环境中运行了四个月的503台服务器的NT系统日志中收集的现场数据，对网络化的Windows NT系统进行了基于测量的可靠性研究。手头的事件日志只包含系统重启信息。我们研究单个服务器故障和域行为，以表征故障行为并探索服务器之间的错误传播。本研究的主要观察结果如下:(1)系统软件和硬件故障是导致系统总停机时间(22%和10%)的两个主要原因;(2)应用软件故障的恢复通常很快;(3)在许多情况下，需要多次重新启动才能从故障中恢复;(4)单个服务器的平均可用性超过99%;(5)有很强的错误依赖或错误在网络中传播的迹象;(6)大多数(58%)重启是未分类的，表明需要更好的日志记录技术;(7)维护和配置占系统停机时间的24%。

引用次数: 154

A new placement algorithm dedicated to parallel computers: bases and application 一种新的并行计算机定位算法:基础与应用

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816235

F. Clermidy, T. Collette, M. Nicolaidis

One way to improve reliability in parallel computers consists of adding supplementary processors and interconnections to the functional structure in order to replace faulty processors with respect to the network structure. This approach is named structural fault tolerance (SFT). Very integrated parallel computers are one way to implement a parallel structure. The material structure is then composed of many elementary blocks, such as ASICs or multi-chip modules (MCMs), each containing many processors. We show that former SFT methods fail in combining the different features, constraints and requirements of such structures. Thus, this paper introduces a new reconfiguration approach that is dedicated to very integrated parallel computers.

提高并行计算机可靠性的一种方法是在功能结构中增加补充处理器和互连，以取代相对于网络结构的故障处理器。这种方法被称为结构容错(SFT)。高度集成的并行计算机是实现并行结构的一种方法。然后，材料结构由许多基本块组成，例如asic或多芯片模块(mcm)，每个模块包含许多处理器。我们发现以前的SFT方法不能结合这种结构的不同特征、约束和要求。因此，本文介绍了一种新的重构方法，专门用于高度集成的并行计算机。

引用次数: 1

Measurement and modeling of burst packet losses in Internet end-to-end communications 互联网端到端通信中突发包丢失的测量和建模

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816237

M. Arai, Atsushi Chiba, K. Iwasaki

We have measured the packet loss ratio, its time dependency, and the frequency of burst packet losses in Internet end-to-end communications. To do this, we developed a tool that sends and receives UDP (User Datagram Protocol) packets. Our measurements showed that long burst losses are more likely when the packet loss ratio is high. We then examined two models for calculating the burst packet loss, an independent loss model and a Markov-chain model, to see whether they explain the packet loss characteristics we measured. They did not, so we developed a sine model, in which the packet loss probability depends on the time of day. Theoretical analysis and simulations showed that this model explains the characteristics of the burst packet losses that we measured.

我们已经测量了丢包率，它的时间依赖性，以及在互联网端到端通信突发丢包的频率。为此，我们开发了一个发送和接收UDP(用户数据报协议)数据包的工具。我们的测量表明，当丢包率高时，长突发损失更有可能发生。然后，我们检查了计算突发数据包丢失的两个模型，一个独立的丢失模型和一个马尔可夫链模型，看看它们是否解释了我们测量的数据包丢失特征。他们没有，所以我们开发了一个正弦模型，其中数据包丢失概率取决于一天中的时间。理论分析和仿真表明，该模型解释了我们测量到的突发丢包的特征。

引用次数: 36

Combining methods for the analysis of a fault-tolerant system 容错系统的组合分析方法

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816222

Hui Shi, J. Peleska, M. Kouvaras

This paper presents experiences gained from the verification of a large-scale real-world embedded system by means of formal methods. This industrial verification project was performed for a fault-tolerant system designed and implemented by DaimlerChrysler Aerospace for the International Space Station ISS. The verification involved various aspects of system correctness, like deadlock and livelock analysis, correct protocol implementation, etc. The approach is based on CSP specifications and uses the model-checking tool FDR. It is realized by combining methods for the development as well as for the analysis. It is illustrated by examples and results obtained during the verification of the Byzantine agreement protocol implementation, where the combination of different abstraction methods is required.

本文介绍了用形式化方法对一个大型实际嵌入式系统进行验证所获得的经验。该工业验证项目是为戴姆勒-克莱斯勒航空航天公司为国际空间站ISS设计和实施的容错系统进行的。验证涉及系统正确性的各个方面，如死锁和活锁分析，正确的协议实现等。该方法基于CSP规范，并使用模型检查工具FDR。它通过开发和分析相结合的方法来实现。通过实例和验证拜占庭协议协议实现过程中获得的结果来说明这一点，其中需要不同抽象方法的组合。

引用次数: 11

Enhancing dependability via parameterized refinement 通过参数化细化增强可靠性

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

Pub Date : 1999-12-16 DOI: 10.1109/PRDC.1999.816221

E. Troubitsyna

A probabilistic extension of the refinement calculus has been successfully applied in the design of safety-critical systems. The approach is based on a firm mathematical foundation within which the reasoning about correctness and behavior of the system under construction is carried out. The framework allows us also to obtain a quantitative assessment of the attributes of system dependability. We present an extension of our main design technique-refinement-the so-called parameterized refinement. The purpose of the extension is to create a technique which facilitates refinement of a system in such a way that the dependability of the implementation would be maximal. We mostly focus on the reliability aspect. The parameterized refinement resolves the problem of how to build more reliable systems by incorporating statistical information about a controlled environment and reliabilities of system components in the development process. We illustrate this by a case study-the development of a state monitoring system.

在安全临界系统的设计中成功地应用了精算的概率推广。该方法建立在坚实的数学基础之上，在此基础上，对正在构建的系统的正确性和行为进行推理。该框架还允许我们获得系统可靠性属性的定量评估。我们提出了我们的主要设计技术的扩展-细化-所谓的参数化细化。扩展的目的是创建一种技术，以使实现的可靠性最大化的方式促进系统的细化。我们主要关注可靠性方面。参数化精化解决了如何通过在开发过程中结合有关受控环境和系统组件可靠性的统计信息来构建更可靠的系统的问题。我们通过一个案例研究来说明这一点——一个状态监测系统的开发。

引用次数: 8

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings 1999 Pacific Rim International Symposium on Dependable Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀