首页 > 最新文献

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers最新文献

英文 中文
Measurement of failure rate in widely distributed software 广泛分布的软件中故障率的测量
R. Chillarege, S. Biyani, J. Rosenthal
In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines. The paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF reaches around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided. The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized. The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved.<>
在经验失败率测量的历史中,一个一直困扰着研究人员和实践者的问题是测量商业软件的客户感知失败率。不幸的是,对于广泛分布的商业软件来说,即使是故障率的数量级测量也无法真正可用。鉴于关于软件的重要性和重要性的反复报道,整个行业都在为一些真实的基线而挣扎。本文报告了分发给数十万客户的数百万行代码商业软件产品的故障率。一阶近似,MTBF达到4年左右,连续发布的软件达到2年左右。还提供了故障率随严重程度、释放和时间的变化情况。测量技术在失效和故障之间建立了直接联系,为研究和描述失效过程提供了机会。定义和表征了两个度量,即故障权重,对应于故障引起的故障数量和故障窗口,测量第一个和最后一个故障之间的时间长度。对于更严重的错误,这两个度量标准是更高的,在所有严重程度和版本中都是一致的。同时,窗权比随严重程度不变。故障权值和故障窗口是直观反映故障过程的自然测度。故障权重度量故障对总体故障率的影响,故障窗口度量该影响随时间的分散。这两者确实为讨论提供了一个新的论坛,并有机会更好地了解所涉及的过程。
{"title":"Measurement of failure rate in widely distributed software","authors":"R. Chillarege, S. Biyani, J. Rosenthal","doi":"10.1109/FTCS.1995.466957","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466957","url":null,"abstract":"In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines. The paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF reaches around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided. The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized. The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131985253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 97
On-line error monitoring for several data structures 几种数据结构的在线错误监测
J. Bright, G. Sullivan
We present several examples of programs which efficiently monitor the answers from queries performed on data structures to determine if any errors are present. Our paper includes the first efficient on-line error monitor for a data structure designed to perform nearest neighbor queries. Applications of nearest neighbor queries are extensive and include learning, categorization, speech processing, and data compression. Our paper also discusses on-line error monitors for priority queues and splittable priority queues. On-line error monitors immediately detect if an error is present in the answer to a query. An error monitor which is not on-line may delay the time of detection until a later query is being processed which may allow the error to propagate or may cause irreversible state changes. On-line monitors can allow a more rapid and accurate response to an error.<>
我们提供了几个程序示例,这些程序可以有效地监视对数据结构执行的查询的结果,以确定是否存在任何错误。我们的论文包括第一个有效的在线错误监视器,用于执行最近邻查询的数据结构。最近邻查询的应用非常广泛,包括学习、分类、语音处理和数据压缩。本文还讨论了优先级队列和可分割优先级队列的在线错误监测。在线错误监视器立即检测查询的答案中是否存在错误。未联机的错误监视器可能会延迟检测时间,直到处理以后的查询,这可能允许错误传播或可能导致不可逆的状态更改。在线监视器可以对错误作出更迅速和准确的反应。
{"title":"On-line error monitoring for several data structures","authors":"J. Bright, G. Sullivan","doi":"10.1109/FTCS.1995.466960","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466960","url":null,"abstract":"We present several examples of programs which efficiently monitor the answers from queries performed on data structures to determine if any errors are present. Our paper includes the first efficient on-line error monitor for a data structure designed to perform nearest neighbor queries. Applications of nearest neighbor queries are extensive and include learning, categorization, speech processing, and data compression. Our paper also discusses on-line error monitors for priority queues and splittable priority queues. On-line error monitors immediately detect if an error is present in the answer to a query. An error monitor which is not on-line may delay the time of detection until a later query is being processed which may allow the error to propagate or may cause irreversible state changes. On-line monitors can allow a more rapid and accurate response to an error.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133734530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Reduced overhead logging for rollback recovery in distributed shared memory 减少了分布式共享内存中回滚恢复的日志开销
G. Suri, B. Janssens, W. Fuchs
Rollback techniques that use message logging and deterministic replay can be used in parallel systems to recover a failed node without involving other nodes. Distributed shared memory (DSM) systems cannot directly apply message-passing logging techniques because they use inherently nondeterministic asynchronous communication. This paper presents new logging schemes that reduce the typically high overhead for logging in DSM. Our algorithm for sequentially consistent systems tracks rather than logs accesses to shared memory. In an extension of this method to lazy release consistency, the per-access overhead of tracking has been completely eliminated. Measurements with parallel applications show a significant reduction in failure-free overhead.<>
可以在并行系统中使用使用消息日志记录和确定性重放的回滚技术来恢复故障节点,而不涉及其他节点。分布式共享内存(DSM)系统不能直接应用消息传递日志记录技术,因为它们使用固有的不确定性异步通信。本文提出了新的记录方案,降低了DSM中通常较高的记录开销。我们用于顺序一致系统的算法跟踪而不是记录对共享内存的访问。在此方法的延迟发布一致性扩展中,跟踪的每次访问开销已完全消除。采用并行应用程序的测量结果显示,无故障开销显著降低。
{"title":"Reduced overhead logging for rollback recovery in distributed shared memory","authors":"G. Suri, B. Janssens, W. Fuchs","doi":"10.1109/FTCS.1995.466971","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466971","url":null,"abstract":"Rollback techniques that use message logging and deterministic replay can be used in parallel systems to recover a failed node without involving other nodes. Distributed shared memory (DSM) systems cannot directly apply message-passing logging techniques because they use inherently nondeterministic asynchronous communication. This paper presents new logging schemes that reduce the typically high overhead for logging in DSM. Our algorithm for sequentially consistent systems tracks rather than logs accesses to shared memory. In an extension of this method to lazy release consistency, the per-access overhead of tracking has been completely eliminated. Measurements with parallel applications show a significant reduction in failure-free overhead.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115834768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
Dependability at the user interface 用户界面的可靠性
R. Maxion, Aimee L. deChambeau
Even if a system's hardware and software underpinnings are completely reliable, errors at the user interface can cripple or destroy a mission, often with catastrophic consequences. Little attention has been paid to handling faults and errors at the user interface their causes and remediations are little understood and methods of predeployment fault detection in user interfaces are almost nonexistent. The paper presents a working definition of a user interface defect, and a robust method for detecting defects automatically. An experimental methodology for empirical testing and validation is given. Results show that while manifestations of defects may be many, only a few root causes are responsible for them.<>
即使系统的硬件和软件基础是完全可靠的,用户界面上的错误也会削弱或破坏任务,通常会带来灾难性的后果。人们很少关注用户界面上的故障和错误的处理,它们的原因和补救措施很少被理解,并且几乎不存在用户界面中的部署前故障检测方法。本文给出了用户界面缺陷的工作定义,并给出了一种自动检测缺陷的鲁棒方法。给出了一种经验检验和验证的实验方法。结果表明,虽然缺陷的表现可能很多,但只有少数根本原因是造成缺陷的原因。
{"title":"Dependability at the user interface","authors":"R. Maxion, Aimee L. deChambeau","doi":"10.1109/FTCS.1995.466944","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466944","url":null,"abstract":"Even if a system's hardware and software underpinnings are completely reliable, errors at the user interface can cripple or destroy a mission, often with catastrophic consequences. Little attention has been paid to handling faults and errors at the user interface their causes and remediations are little understood and methods of predeployment fault detection in user interfaces are almost nonexistent. The paper presents a working definition of a user interface defect, and a robust method for detecting defects automatically. An experimental methodology for empirical testing and validation is given. Results show that while manifestations of defects may be many, only a few root causes are responsible for them.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123955869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Fault tolerance in concurrent object-oriented software through coordinated error recovery 通过协调错误恢复实现并发面向对象软件的容错
Jie Xu, B. Randell, A. Romanovsky, C. M. F. Rubira, R. Stroud, Zhixue Wu
Presents a scheme for coordinated error recovery between multiple interacting objects in a concurrent object-oriented system. A conceptual framework for fault tolerance is established based on a general object concurrency model that is supported by most concurrent object-oriented languages and systems. This framework integrates two complementary concepts-conversations and transactions. Conversations (associated with cooperative exception handling) are used to provide coordinated error recovery between concurrent interacting activities whilst transactions are used to maintain the consistency of shared resources in the presence of concurrent access and possible failures. The serialisability property of transactions is exploited in order to help prevent unexpected information smuggling. The proposed framework is illustrated by means of a case study, and various linguistic and implementation issues are discussed.<>
提出了一种面向对象并发系统中多个交互对象间的协调错误恢复方案。基于大多数并发面向对象语言和系统支持的通用对象并发模型,建立了容错的概念框架。这个框架集成了两个互补的概念——对话和事务。对话(与协作异常处理相关联)用于在并发交互活动之间提供协调的错误恢复,而事务用于在并发访问和可能的故障存在的情况下维护共享资源的一致性。为了防止意外的信息走私,利用了事务的可序列化特性。通过一个案例研究说明了所提出的框架,并讨论了各种语言和实现问题。
{"title":"Fault tolerance in concurrent object-oriented software through coordinated error recovery","authors":"Jie Xu, B. Randell, A. Romanovsky, C. M. F. Rubira, R. Stroud, Zhixue Wu","doi":"10.1109/FTCS.1995.466948","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466948","url":null,"abstract":"Presents a scheme for coordinated error recovery between multiple interacting objects in a concurrent object-oriented system. A conceptual framework for fault tolerance is established based on a general object concurrency model that is supported by most concurrent object-oriented languages and systems. This framework integrates two complementary concepts-conversations and transactions. Conversations (associated with cooperative exception handling) are used to provide coordinated error recovery between concurrent interacting activities whilst transactions are used to maintain the consistency of shared resources in the presence of concurrent access and possible failures. The serialisability property of transactions is exploited in order to help prevent unexpected information smuggling. The proposed framework is illustrated by means of a case study, and various linguistic and implementation issues are discussed.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122664671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 225
Stopping rules for the operational testing of safety-critical software 安全关键软件运行测试的停止规则
B. Littlewood, David Wright
It has been proposed to conduct a test of a software safety system for a nuclear reactor by subjecting it to demands that are statistically representative of those it meets in operational use. The intention behind the test is to acquire a high confidence (99%) that the probability of failure on demand is smaller than 10/sup -3/. To this end the test takes the form of executing about 5000 demands and requiring that all of these are successful. In practice if is necessary to consider what happens if the software fails the test and is repaired. We argue that the earlier failure information needs to be taken into account in devising the form of the test that the modified software needs to pass-essentially that after such failure the testing requirement might need to be more stringent (i.e. the number of tests that must be executed failure-free should increase). We examine a Bayesian approach to the problem, for this stopping rule based upon a required bound for the probability of failure on demand, as above, and also for a requirement based upon a prediction of future failure behaviour. We show that the first approach seems to be less conservative than the second, and argue that the second should be preferred for practical application.<>
有人提议对核反应堆的软件安全系统进行测试,使其满足在运行使用中具有统计代表性的要求。测试背后的意图是获得高置信度(99%),即按需故障的概率小于10/sup -3/。为此,测试采取执行大约5000个需求的形式,并要求所有这些都是成功的。在实践中,有必要考虑如果软件测试失败并被修复会发生什么。我们认为,在设计修改后的软件需要通过的测试形式时,需要考虑早期的失败信息——本质上,在这种失败之后,测试需求可能需要更严格(即,必须在无故障情况下执行的测试数量应该增加)。我们研究了贝叶斯方法来解决这个问题,因为这个停止规则是基于按需故障概率的要求范围的,如上所述,也是基于对未来故障行为的预测的要求。我们表明,第一种方法似乎比第二种方法更保守,并认为第二种方法应优先用于实际应用。
{"title":"Stopping rules for the operational testing of safety-critical software","authors":"B. Littlewood, David Wright","doi":"10.1109/FTCS.1995.466955","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466955","url":null,"abstract":"It has been proposed to conduct a test of a software safety system for a nuclear reactor by subjecting it to demands that are statistically representative of those it meets in operational use. The intention behind the test is to acquire a high confidence (99%) that the probability of failure on demand is smaller than 10/sup -3/. To this end the test takes the form of executing about 5000 demands and requiring that all of these are successful. In practice if is necessary to consider what happens if the software fails the test and is repaired. We argue that the earlier failure information needs to be taken into account in devising the form of the test that the modified software needs to pass-essentially that after such failure the testing requirement might need to be more stringent (i.e. the number of tests that must be executed failure-free should increase). We examine a Bayesian approach to the problem, for this stopping rule based upon a required bound for the probability of failure on demand, as above, and also for a requirement based upon a prediction of future failure behaviour. We show that the first approach seems to be less conservative than the second, and argue that the second should be preferred for practical application.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"89 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117295912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
A model for the analysis of the fault injection process 断层注入过程的分析模型
A. Steininger, H. Schweinzer
Results of fault injection experiments performed under different conditions can only be related to each other, if their interpretation is based on a thorough understanding of activation and propagation of faults and errors. We analyze these processes by applying a special layer model of a computing system. Our aim is to model the transformation of a fault on a signal line into a system failure as the propagation of erroneous information through multiple layers. Two specific layers that describe the fault activation process have been sufficiently completed and are presented here. A quantification for these is derived and different applications are summarized. Excellent correspondence between analytical results based on modeling and experimental data is found. A prediction of fault activation with high accuracy is possible, as well as a quantitative evaluation of the effect of synchronizing fault injection.<>
不同条件下进行的断层注入实验的结果只有建立在对断层和误差的激活和传播的透彻理解的基础上,才能相互关联。我们通过应用一个计算系统的特殊层模型来分析这些过程。我们的目标是将信号线上的故障转换为系统故障作为错误信息通过多层传播的模型。描述断层激活过程的两个特定层已经充分完成,并在这里提出。对这些因素进行了量化,并总结了不同的应用。发现基于模型的分析结果与实验数据具有很好的一致性。高精度的断层激活预测是可能的,同时也可以定量评价同步断层注入的效果。
{"title":"A model for the analysis of the fault injection process","authors":"A. Steininger, H. Schweinzer","doi":"10.1109/FTCS.1995.466984","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466984","url":null,"abstract":"Results of fault injection experiments performed under different conditions can only be related to each other, if their interpretation is based on a thorough understanding of activation and propagation of faults and errors. We analyze these processes by applying a special layer model of a computing system. Our aim is to model the transformation of a fault on a signal line into a system failure as the propagation of erroneous information through multiple layers. Two specific layers that describe the fault activation process have been sufficiently completed and are presented here. A quantification for these is derived and different applications are summarized. Excellent correspondence between analytical results based on modeling and experimental data is found. A prediction of fault activation with high accuracy is possible, as well as a quantitative evaluation of the effect of synchronizing fault injection.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117140829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Node covering, error correcting codes and multiprocessors with very high average fault tolerance 节点覆盖,纠错码和多处理器具有非常高的平均容错性
S. Dutt, N. Mahapatra
Most previous work on fault-tolerant (FT) multiprocessor design has concentrated on deterministic k-fault-tolerant (k-FT) designs in which exactly k spare processors and some spare switches and links are added to construct multiprocessors that can tolerate any k processor faults. However, after k faults are reconfigured around, much of the extra links and switches can remain unutilized. We show how to use the node-covering principle of Dutt and Hayes (1992) and error correcting codes in order to construct probabilistic designs with very high average fault tolerance but low wiring and switch overhead. This design methodology is applicable to any multiprocessor interconnection topology. We also obtain the deterministic fault tolerance for these designs and develop efficient layout strategies for them.<>
先前关于容错多处理器设计的大部分工作都集中在确定性k容错设计(k-FT)上,在这种设计中,精确地添加k个备用处理器和一些备用开关和链路来构建能够容忍任意k个处理器故障的多处理器。然而,在k个故障被重新配置之后,许多额外的链路和交换机可能仍然没有被利用。我们展示了如何使用Dutt和Hayes(1992)的节点覆盖原理和纠错码来构建具有非常高的平均容错性但低布线和开关开销的概率设计。这种设计方法适用于任何多处理器互连拓扑结构。我们还获得了这些设计的确定性容错性,并为它们制定了有效的布局策略
{"title":"Node covering, error correcting codes and multiprocessors with very high average fault tolerance","authors":"S. Dutt, N. Mahapatra","doi":"10.1109/FTCS.1995.466967","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466967","url":null,"abstract":"Most previous work on fault-tolerant (FT) multiprocessor design has concentrated on deterministic k-fault-tolerant (k-FT) designs in which exactly k spare processors and some spare switches and links are added to construct multiprocessors that can tolerate any k processor faults. However, after k faults are reconfigured around, much of the extra links and switches can remain unutilized. We show how to use the node-covering principle of Dutt and Hayes (1992) and error correcting codes in order to construct probabilistic designs with very high average fault tolerance but low wiring and switch overhead. This design methodology is applicable to any multiprocessor interconnection topology. We also obtain the deterministic fault tolerance for these designs and develop efficient layout strategies for them.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"54 44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127354829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
A recoverable distributed shared memory integrating coherence and recoverability 一种集一致性和可恢复性于一体的可恢复分布式共享内存
Anne-Marie Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, I. Puaut
Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failures. Although most recoverable DSMs require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56-node Intel Paragon.<>
大规模分布式系统对于执行需要巨大计算能力的并行应用程序非常有吸引力。然而,它们的高概率站点故障是不可接受的,特别是对于长时间运行的应用程序。在本文中,我们解决了这个问题,并提出了一种依赖于可恢复的分布式共享内存(DSM)的检查点机制,以容忍单节点故障。虽然大多数可恢复的dsm需要特定的硬件来存储恢复数据,但我们的方案使用标准内存来存储当前和恢复数据。此外,通过扩展DSM的一致性协议,将恢复数据的管理与当前数据的管理合并。这种方法利用DSM提供的数据复制来限制检查点期间传输的页面数量。本文还介绍了我们的可恢复DSM在56节点英特尔Paragon上的实现和初步性能评估。
{"title":"A recoverable distributed shared memory integrating coherence and recoverability","authors":"Anne-Marie Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, I. Puaut","doi":"10.1109/FTCS.1995.466970","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466970","url":null,"abstract":"Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failures. Although most recoverable DSMs require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56-node Intel Paragon.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127814908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
Dependability modelling in a prototype development framework 原型开发框架中的可靠性建模
J. Bass, Sylvain Metge, A. Browne, P. Croll, P. Fleming
The Development Framework provides a highly automatic translation from a specification to an implementation. The specification is in a popular, graphical control engineering notation typically representing a system with stringent reliability requirements and hard real time constraints. An interface has been constructed between the Development Framework and the commercially available dependability modelling tool, SURF-2. This tool is designed to support an evaluation based design approach. Multiple design solutions can be compared to assess the implications of design decisions on the dependability of the system under development. The software demonstration will show how the interface between the Development Framework and SURF-2 is used to model the inclusion of selected fault tolerant mechanisms in the system under development.<>
开发框架提供了从规范到实现的高度自动转换。该规范是一种流行的图形化控制工程符号,通常表示具有严格可靠性要求和硬实时约束的系统。在开发框架和商业上可用的可靠性建模工具SURF-2之间建立了一个接口。该工具旨在支持基于评估的设计方法。可以比较多个设计解决方案,以评估设计决策对正在开发的系统可靠性的影响。软件演示将展示如何使用开发框架和SURF-2之间的接口来对正在开发的系统中所选择的容错机制进行建模。
{"title":"Dependability modelling in a prototype development framework","authors":"J. Bass, Sylvain Metge, A. Browne, P. Croll, P. Fleming","doi":"10.1109/FTCS.1995.466990","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466990","url":null,"abstract":"The Development Framework provides a highly automatic translation from a specification to an implementation. The specification is in a popular, graphical control engineering notation typically representing a system with stringent reliability requirements and hard real time constraints. An interface has been constructed between the Development Framework and the commercially available dependability modelling tool, SURF-2. This tool is designed to support an evaluation based design approach. Multiple design solutions can be compared to assess the implications of design decisions on the dependability of the system under development. The software demonstration will show how the interface between the Development Framework and SURF-2 is used to model the inclusion of selected fault tolerant mechanisms in the system under development.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125799243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1