首页 > 最新文献

[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers最新文献

英文 中文
Distance agreement protocols 远程协议协议
K. Echtle
A novel class of agreement protocols suitable for replicated nondeterministic processes is introduced. Reduction of message number and early stopping are achieved by taking distance decisions not after, but during protocol execution. Metrical comparison of results is not restricted to numerical applications. Unlike median selection, it covers multidimensional spaces and helps to solve typical problems of distributed systems, e.g., global scheduling, synchronization, sequence agreement, reconfiguration, and elimination of time skew. A so-called pendulum protocol is described in detail.<>
介绍了一种适用于复制不确定性过程的新型协议协议。减少消息数量和提前停止是通过在协议执行之后,而是在协议执行期间做出距离决定来实现的。结果的测量比较并不局限于数值应用。与中值选择不同,它涵盖了多维空间,有助于解决分布式系统的典型问题,如全局调度、同步、序列协议、重新配置和消除时间倾斜。详细描述了所谓的钟摆协议。
{"title":"Distance agreement protocols","authors":"K. Echtle","doi":"10.1109/FTCS.1989.105565","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105565","url":null,"abstract":"A novel class of agreement protocols suitable for replicated nondeterministic processes is introduced. Reduction of message number and early stopping are achieved by taking distance decisions not after, but during protocol execution. Metrical comparison of results is not restricted to numerical applications. Unlike median selection, it covers multidimensional spaces and helps to solve typical problems of distributed systems, e.g., global scheduling, synchronization, sequence agreement, reconfiguration, and elimination of time skew. A so-called pendulum protocol is described in detail.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127143960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Language constructs for timed atomic commitment 定时原子承诺的语言构造
S. Davidson, Insup Lee, V. Wolfe
In a large class of hard-real-time control applications, components execute concurrently on distributed nodes and must coordinate, under timing constraints, to perform the control task. As such, they perform a type of atomic commitment. In traditional atomic commitment there are no timing constraints; agreement is eventual. The authors present a definition of timed atomic commitment (TAC) which requires the processes to be functionally consistent, but allows the outcome to include an exceptional state, indicating that faults have caused timing constraints to be violated. The authors also present a high-level language construct that facilitates the use of TAC in distributed real-time programming and discuss its behavior when faults occur.<>
在一大类硬实时控制应用程序中,组件在分布式节点上并发执行,并且必须在时间约束下协调以执行控制任务。因此,它们执行一种原子提交。在传统的原子承诺中,没有时间约束;协议是最终的。作者提出了定时原子提交(TAC)的定义,该定义要求流程在功能上保持一致,但允许结果包含异常状态,表明错误导致了违反定时约束。作者还提出了一种高级语言结构,便于在分布式实时编程中使用TAC,并讨论了故障发生时的行为。
{"title":"Language constructs for timed atomic commitment","authors":"S. Davidson, Insup Lee, V. Wolfe","doi":"10.1109/FTCS.1989.105621","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105621","url":null,"abstract":"In a large class of hard-real-time control applications, components execute concurrently on distributed nodes and must coordinate, under timing constraints, to perform the control task. As such, they perform a type of atomic commitment. In traditional atomic commitment there are no timing constraints; agreement is eventual. The authors present a definition of timed atomic commitment (TAC) which requires the processes to be functionally consistent, but allows the outcome to include an exceptional state, indicating that faults have caused timing constraints to be violated. The authors also present a high-level language construct that facilitates the use of TAC in distributed real-time programming and discuss its behavior when faults occur.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122810314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Control-flow checking using watchdog assists and extended-precision checksums 使用看门狗辅助和扩展精度校验和的控制流检查
N. Saxena, E. McCluskey
A control-flow checking method is proposed. Extended-precision checksum-based control-flow checking is shown to have low error detection latency compared to previously proposed methods. Analytical measures are derived to demonstrate the effectiveness of using extended-precision checksums for control-flow checking. The error detection latency in the extended-precision checksum-based control-flow checking remains relatively constant for both single and multiple sequence errors. In the case of signature-based methods, error detection latency increases linearly with the number of sequence errors. A watchdog assist architecture for control-flow checking in programs is defined. Unlike previously proposed control-flow checking methods, this watchdog assist architecture is well suited for multiprocessor, multiprogramming, and cache-based environments. The Hewlett-Packard precision architecture is used as an example to demonstrate the feasibility of watchdog assists.<>
提出了一种控制流检测方法。与先前提出的方法相比,基于扩展精度校验和的控制流检查具有较低的错误检测延迟。推导了一些分析方法来证明使用扩展精度校验和进行控制流校验的有效性。在基于扩展精度校验和的控制流检测中,对于单个和多个序列错误,错误检测延迟保持相对恒定。在基于签名的方法中,错误检测延迟随着序列错误的数量线性增加。定义了用于程序控制流检查的看门狗辅助体系结构。与先前提出的控制流检查方法不同,这种看门狗辅助体系结构非常适合于多处理器、多编程和基于缓存的环境。以惠普精密架构为例,验证了看门狗辅助的可行性。
{"title":"Control-flow checking using watchdog assists and extended-precision checksums","authors":"N. Saxena, E. McCluskey","doi":"10.1109/FTCS.1989.105615","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105615","url":null,"abstract":"A control-flow checking method is proposed. Extended-precision checksum-based control-flow checking is shown to have low error detection latency compared to previously proposed methods. Analytical measures are derived to demonstrate the effectiveness of using extended-precision checksums for control-flow checking. The error detection latency in the extended-precision checksum-based control-flow checking remains relatively constant for both single and multiple sequence errors. In the case of signature-based methods, error detection latency increases linearly with the number of sequence errors. A watchdog assist architecture for control-flow checking in programs is defined. Unlike previously proposed control-flow checking methods, this watchdog assist architecture is well suited for multiprocessor, multiprogramming, and cache-based environments. The Hewlett-Packard precision architecture is used as an example to demonstrate the feasibility of watchdog assists.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114543289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 104
Fault diagnosis for sparsely interconnected multiprocessor systems 稀疏互连多处理机系统的故障诊断
D. Blough, G. Sullivan, G. Masson
The authors present a general approach to fault diagnosis that is widely applicable and requires only a limited number of connections among units. Each unit in the system forms a private opinion on the status of each of its neighboring units based on duplication of jobs and comparison of job results over time. A diagnosis algorithm that consists of simply taking a majority vote among the neighbors of a unit to determine the status of that unit is then executed. The performance of this simple majority-vote diagnosis algorithm is analyzed using a probabilistic model for the faults in the system. It is shown that with high probability, for systems composed of n units, the algorithm will correctly identify the status of all units when each unit is connected to O(log n) other units. It is also shown that the algorithm works with high probability in a class of systems in which the average number of neighbors of a unit is constant. The results indicate that fault diagnosis can in fact be achieved quite simply in multiprocessor systems containing a low to moderate number of testing conditions.<>
作者提出了一种通用的故障诊断方法,该方法广泛适用,并且只需要单元之间有限数量的连接。系统中的每个单位都会根据工作的重复和工作结果的长期比较,对相邻单位的状态形成自己的意见。然后执行一种诊断算法,该算法由简单地在单元的邻居中进行多数投票来确定该单元的状态组成。利用系统故障的概率模型分析了简单多数投票诊断算法的性能。结果表明,对于由n个单元组成的系统,当每个单元与O(log n)个其他单元连接时,该算法将以高概率正确识别所有单元的状态。在一类单元的平均邻居数为常数的系统中,该算法具有较高的工作概率。结果表明,在包含少量或中等数量测试条件的多处理器系统中,实际上可以很容易地实现故障诊断。
{"title":"Fault diagnosis for sparsely interconnected multiprocessor systems","authors":"D. Blough, G. Sullivan, G. Masson","doi":"10.1109/FTCS.1989.105544","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105544","url":null,"abstract":"The authors present a general approach to fault diagnosis that is widely applicable and requires only a limited number of connections among units. Each unit in the system forms a private opinion on the status of each of its neighboring units based on duplication of jobs and comparison of job results over time. A diagnosis algorithm that consists of simply taking a majority vote among the neighbors of a unit to determine the status of that unit is then executed. The performance of this simple majority-vote diagnosis algorithm is analyzed using a probabilistic model for the faults in the system. It is shown that with high probability, for systems composed of n units, the algorithm will correctly identify the status of all units when each unit is connected to O(log n) other units. It is also shown that the algorithm works with high probability in a class of systems in which the average number of neighbors of a unit is constant. The results indicate that fault diagnosis can in fact be achieved quite simply in multiprocessor systems containing a low to moderate number of testing conditions.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128515953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Performability of a token bus network under transient fault conditions 暂态故障条件下令牌总线网络的性能
J. F. Meyer, K. Muralidhar, W. Sanders
The authors present the results of a detailed performability evaluation of a network using the IEEE 802.4 protocol. In particular a 30 station IEEE 802.4 token bus network operating in a hostile factory environment is evaluated using stochastic activity networks. Stochastic activity networks, a generalization of stochastic Petri nets, provide a convenient representation for computer networks and are formal enough to permit solution by both analysis and simulation. The evaluation results show (1) that stochastic activity networks are an appropriate model type for evaluating the performability of local-area networks, and (2) that the protocol is extremely tolerant to transient faults such as token losses and noise bursts under moderate network loads.<>
作者介绍了使用IEEE 802.4协议的网络的详细性能评估结果。特别是30站IEEE 802.4令牌总线网络在敌对工厂环境中运行,使用随机活动网络进行评估。随机活动网络是随机Petri网的一种推广,它为计算机网络提供了一种方便的表示,并且足够形式化,可以通过分析和模拟来求解。评估结果表明:(1)随机活动网络是评估局域网性能的合适模型类型;(2)在中等网络负载下,该协议对令牌丢失和噪声突发等瞬态故障具有极高的容忍度。
{"title":"Performability of a token bus network under transient fault conditions","authors":"J. F. Meyer, K. Muralidhar, W. Sanders","doi":"10.1109/FTCS.1989.105562","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105562","url":null,"abstract":"The authors present the results of a detailed performability evaluation of a network using the IEEE 802.4 protocol. In particular a 30 station IEEE 802.4 token bus network operating in a hostile factory environment is evaluated using stochastic activity networks. Stochastic activity networks, a generalization of stochastic Petri nets, provide a convenient representation for computer networks and are formal enough to permit solution by both analysis and simulation. The evaluation results show (1) that stochastic activity networks are an appropriate model type for evaluating the performability of local-area networks, and (2) that the protocol is extremely tolerant to transient faults such as token losses and noise bursts under moderate network loads.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123832993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
An analytical model for computing hypercube availability 计算超立方体可用性的解析模型
C. Das, Jong Kim
An analytical model is presented for computing the availability of an n-dimensional hypercube. The model computes the probability of j connected working nodes in a hypercube by multiplying two probabilistic terms. The first term is the probability of x connected nodes (x>or=j) working out of 2/sup n/ fully connected nodes. This is obtained from the numerical solution of the well-known machine repairman model, modified to capture imperfect coverage and imprecise repair. The second term, which is the probability of having j connected nodes in a hypercube, is computed from an approximate model of the hypercube. The approximate model, in turn, is based on a decomposition principle, where an n-cube connectivity is computed from a two-cube base model using a recursive equation. The availability model studied in this paper is known as task-based availability, where a system remains operational as long as a task can be executed on the system. Analytical results from n-dimensional cubes are given for various task requirements. The model is validated by comparing the analytical results with those from simulation.<>
给出了一个计算n维超立方体可用性的解析模型。该模型通过将两个概率项相乘来计算超立方体中j个连通工作节点的概率。第一项是x个连接节点(x>或=j)在2/sup n/个完全连接节点中工作的概率。这是由著名的机器修理工模型的数值解得到的,经过修改以捕捉不完全覆盖和不精确修理。第二项是在超立方体中有j个连接节点的概率,它是从超立方体的近似模型中计算出来的。而近似模型则基于分解原理,其中使用递归方程从两个立方体基本模型计算n-立方体连通性。本文研究的可用性模型被称为基于任务的可用性,只要系统上可以执行任务,系统就保持可操作性。针对不同的任务要求,给出了n维立方体的分析结果。通过与仿真结果的比较,验证了模型的正确性。
{"title":"An analytical model for computing hypercube availability","authors":"C. Das, Jong Kim","doi":"10.1109/FTCS.1989.105631","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105631","url":null,"abstract":"An analytical model is presented for computing the availability of an n-dimensional hypercube. The model computes the probability of j connected working nodes in a hypercube by multiplying two probabilistic terms. The first term is the probability of x connected nodes (x>or=j) working out of 2/sup n/ fully connected nodes. This is obtained from the numerical solution of the well-known machine repairman model, modified to capture imperfect coverage and imprecise repair. The second term, which is the probability of having j connected nodes in a hypercube, is computed from an approximate model of the hypercube. The approximate model, in turn, is based on a decomposition principle, where an n-cube connectivity is computed from a two-cube base model using a recursive equation. The availability model studied in this paper is known as task-based availability, where a system remains operational as long as a task can be executed on the system. Analytical results from n-dimensional cubes are given for various task requirements. The model is validated by comparing the analytical results with those from simulation.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127060604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Message routing in HARTS with faulty components 带有故障组件的hart中的消息路由
A. Olson, K. Shin
The authors develop a routing scheme in two steps for a wrapped hexagonal mesh, called HARTS (hexagonal architecture for real-time systems), which ensures the delivery of every message as long as there is a path between its source and destination. The scheme can also detect the nonexistence of a path between a pair of nodes in a finite amount of time. Moreover, the scheme requires each node in HARTS to know only the state (faulty or not) of each of its own links. The performance of the simple routing scheme is simulated for three- and five-dimensional H-meshes while the physical distribution of faulty components is varied. It is shown that a shortest path between the source and the destination of each message is taken with a high probability, and a path, if one exists, is usually found very quickly.<>
作者分两步为一个包裹的六边形网格开发了一种路由方案,称为HARTS(实时系统的六边形架构),只要在其源和目的地之间有一条路径,就可以确保每条消息的传递。该方案还可以在有限的时间内检测到一对节点之间不存在路径。此外,该方案要求HARTS中的每个节点只知道其每个链路的状态(故障或非故障)。仿真了三维和五维h -网格中故障部件物理分布变化时简单布线方案的性能。结果表明,每个消息的源和目的之间有一条最短路径的概率很高,而且如果存在一条路径,通常很快就能找到。
{"title":"Message routing in HARTS with faulty components","authors":"A. Olson, K. Shin","doi":"10.1109/FTCS.1989.105588","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105588","url":null,"abstract":"The authors develop a routing scheme in two steps for a wrapped hexagonal mesh, called HARTS (hexagonal architecture for real-time systems), which ensures the delivery of every message as long as there is a path between its source and destination. The scheme can also detect the nonexistence of a path between a pair of nodes in a finite amount of time. Moreover, the scheme requires each node in HARTS to know only the state (faulty or not) of each of its own links. The performance of the simple routing scheme is simulated for three- and five-dimensional H-meshes while the physical distribution of faulty components is varied. It is shown that a shortest path between the source and the destination of each message is taken with a high probability, and a path, if one exists, is usually found very quickly.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124438187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Fail-softness evaluation in multiple-bus local computer networks 多总线本地计算机网络的故障软性评价
Vikram V. Karmarkar, J. G. Kuhl
A fail-softness evaluation methodology is presented which is suitable for quantifying the graceful degradation characteristics of local computer networks (LCN) using multiple buses. The approach quantifies degradation of performance due to failure over any given application lifetime and also yields a single figure of merit that can be used for comparison of alternative multiple-bus LCN architectures with specific reliability/cost constraints. The analysis technique models both network service failures and configuration-related delay characteristics. Existing notions of performability analysis and bandwidth availability are used in the modeling process to derive a combined performance/reliability measure. The fail-softness analysis is used to compare several alternative multiple-bus architectures, which use different demand-assignment multiple-access (DAMA) methods. A class of integrated access methodologies that use a single shared token to arbitrate access to all buses is shown to exhibit generally superior performance/reliability characteristics as compared to other alternatives, such as those which use an independent DAMA protocol for each bus.<>
提出了一种适用于多总线局部计算机网络优雅退化特性量化的故障柔软性评价方法。该方法量化了在任何给定的应用程序生命周期内由于故障而导致的性能下降,并且还产生了一个单一的优点值,可用于比较具有特定可靠性/成本限制的备选多总线LCN架构。该分析技术对网络服务故障和与配置相关的延迟特征进行建模。在建模过程中使用了现有的性能分析和带宽可用性概念,以派生出组合的性能/可靠性度量。采用故障柔软性分析方法,比较了几种使用不同需求分配多址(DAMA)方法的备选多总线架构。与其他替代方案(例如为每个总线使用独立DAMA协议的方案)相比,一类使用单个共享令牌来仲裁对所有总线的访问的集成访问方法显示出总体上优越的性能/可靠性特征。
{"title":"Fail-softness evaluation in multiple-bus local computer networks","authors":"Vikram V. Karmarkar, J. G. Kuhl","doi":"10.1109/FTCS.1989.105632","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105632","url":null,"abstract":"A fail-softness evaluation methodology is presented which is suitable for quantifying the graceful degradation characteristics of local computer networks (LCN) using multiple buses. The approach quantifies degradation of performance due to failure over any given application lifetime and also yields a single figure of merit that can be used for comparison of alternative multiple-bus LCN architectures with specific reliability/cost constraints. The analysis technique models both network service failures and configuration-related delay characteristics. Existing notions of performability analysis and bandwidth availability are used in the modeling process to derive a combined performance/reliability measure. The fail-softness analysis is used to compare several alternative multiple-bus architectures, which use different demand-assignment multiple-access (DAMA) methods. A class of integrated access methodologies that use a single shared token to arbitrate access to all buses is shown to exhibit generally superior performance/reliability characteristics as compared to other alternatives, such as those which use an independent DAMA protocol for each bus.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115031054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Understanding large system failures-a fault injection experiment 理解大型系统故障—故障注入实验
R. Chillarege, N. Bowen
Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability.<>
故障注入用于描述大型系统故障。因此,它克服了现场故障数据缺乏完整信息所带来的限制。实验是在商业事务处理系统上进行的。作者:(1)引入失效加速的思想进行此类实验;(2)估计主服务的全部损失只发生在16%的故障中;(3)揭示被称为潜在危险的错误,这些错误不会影响短期可用性,但会在运行状态发生变化后导致灾难性故障;(4)确定至少41%的错误是在完全失效之前需要修复的潜在候选错误。结果增强了对大型系统故障的理解,并为设计增强和可用性建模提供了基础。
{"title":"Understanding large system failures-a fault injection experiment","authors":"R. Chillarege, N. Bowen","doi":"10.1109/FTCS.1989.105592","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105592","url":null,"abstract":"Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129462518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 191
A system for supporting multi-language versions for software fault tolerance 一个支持多语言版本的软件容错系统
James M. Purtilo, P. Jalote
A description is given of a system that allows versions to be coded in different programming languages. The system supports both the recovery block scheme and the N-version programming method. It permits fault tolerance to be used for specified modules that could be embedded in a larger program. The system also allows the different versions to be executed on different machines. It has been implemented in C on DEC Vaxes and Sun 3 workstations and operates in a network of Unix-based machines.<>
给出了一个系统的描述,该系统允许用不同的编程语言编写版本。系统支持恢复块方案和n版本编程方法。它允许对可以嵌入到较大程序中的指定模块使用容错功能。该系统还允许在不同的机器上执行不同的版本。它已在DEC vax和Sun 3工作站上用C语言实现,并在基于unix的机器网络中运行。
{"title":"A system for supporting multi-language versions for software fault tolerance","authors":"James M. Purtilo, P. Jalote","doi":"10.1109/FTCS.1989.105578","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105578","url":null,"abstract":"A description is given of a system that allows versions to be coded in different programming languages. The system supports both the recovery block scheme and the N-version programming method. It permits fault tolerance to be used for specified modules that could be embedded in a larger program. The system also allows the different versions to be executed on different machines. It has been implemented in C on DEC Vaxes and Sun 3 workstations and operates in a network of Unix-based machines.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129230879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1