Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.最新文献

英文中文

The /spl phi/ accrual failure detector /spl phi/累计故障检测器

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353004

Naohiro Hayashibara, X. Défago, Rami Yared, T. Katayama

The detection of failures is a fundamental issue for fault-tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously. We present a novel abstraction, called accrual failure detectors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failure detectors in distributed systems. Instead of providing information of a binary nature (trust vs. suspect), accrual failure detectors output a suspicion level on a continuous scale. The principal merit of this approach is that it favors a nearly complete decoupling between application requirements and the monitoring of the environment. In this paper, we describe an implementation of such an accrual failure detector, that we call the /spl phi/ failure detector. The particularity of the /spl phi/ failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed. We analyzed the behavior of our /spl phi/ failure detector over an intercontinental communication link over a week. Our experimental results show that if performs equally well as other known adaptive failure detection mechanisms, with an improved flexibility.

故障检测是分布式系统容错的一个基本问题。最近，许多人开始意识到故障检测应该作为某种形式的通用服务提供，类似于IP地址查找或时间同步。然而，到目前为止，这还没有成功;其中一个原因是传统的故障检测器不能同时满足多个应用程序的需求。我们提出了一种新的抽象，称为累积故障检测器，它强调灵活性和表达性，可以作为在分布式系统中实现故障检测器的基本构建块。与其提供二元性质的信息(信任vs.怀疑)，累积故障检测器以连续刻度输出怀疑级别。这种方法的主要优点是，它支持在应用程序需求和环境监视之间几乎完全解耦。在本文中，我们描述了这样一个累加式故障检测器的实现，我们称之为/spl phi/故障检测器。/spl phi/故障检测器的特殊之处在于它能根据当前网络条件动态调整表示怀疑程度的尺度。我们分析了我们的/spl phi/故障检测器在洲际通信链路上一周的行为。实验结果表明，该方法的性能与其他已知的自适应故障检测机制一样好，并且具有更高的灵活性。

{"title":"The /spl phi/ accrual failure detector","authors":"Naohiro Hayashibara, X. Défago, Rami Yared, T. Katayama","doi":"10.1109/RELDIS.2004.1353004","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353004","url":null,"abstract":"The detection of failures is a fundamental issue for fault-tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously. We present a novel abstraction, called accrual failure detectors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failure detectors in distributed systems. Instead of providing information of a binary nature (trust vs. suspect), accrual failure detectors output a suspicion level on a continuous scale. The principal merit of this approach is that it favors a nearly complete decoupling between application requirements and the monitoring of the environment. In this paper, we describe an implementation of such an accrual failure detector, that we call the /spl phi/ failure detector. The particularity of the /spl phi/ failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed. We analyzed the behavior of our /spl phi/ failure detector over an intercontinental communication link over a week. Our experimental results show that if performs equally well as other known adaptive failure detection mechanisms, with an improved flexibility.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116780980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 121

An hoarding approach for supporting disconnected write operations in mobile environments 一种在移动环境中支持断开连接的写操作的囤积方法

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353028

A. Vora, Z. Tari, P. Bertók

Caching is one technique that reduces costs and improves performance in mobile environments. It also increases availability during temporary, involuntary disconnections. However, our focus is on voluntary, client initiated disconnections, where hoarding can be used to predict data requirements. Existing hoarding approaches ignore conflicts arising out of write sharing and are thus unable to deal with them. However, since conflicts are detrimental to bandwidth utilisation, for scenarios with high write sharing, hoarding techniques need to provide support for sharing in a manner that reduces or avoids conflicts. We propose a hoarding approach for disconnected write operations that focuses on reducing the likelihood of conflicts, arising from write sharing, in a highly concurrent environment. Data that clients might need when disconnected is predicted based on the notion of semantic similarity. To avoid/reduce conflicts, data are first clustered based on their update probabilities. The hoard tree is then created based on the clusters and semantic similarity between data. Simulations show an increase in the cache hit-rate along with an reduction in the total number of conflicts.

缓存是一种在移动环境中降低成本并提高性能的技术。它还增加了临时、非自愿断网期间的可用性。然而，我们的重点是自愿的，客户发起的断开连接，其中囤积可以用来预测数据需求。现有的囤积方法忽视了由于共享写而产生的冲突，因此无法处理这些冲突。然而，由于冲突不利于带宽利用，对于具有高写共享的场景，囤积技术需要以减少或避免冲突的方式为共享提供支持。我们提出了一种用于断开连接的写操作的囤积方法，该方法侧重于减少在高度并发环境中由写共享引起的冲突的可能性。客户机在断开连接时可能需要的数据是基于语义相似性的概念来预测的。为了避免/减少冲突，首先根据更新概率对数据进行聚类。然后根据数据之间的聚类和语义相似性创建存储树。模拟显示缓存命中率的增加以及冲突总数的减少。

{"title":"An hoarding approach for supporting disconnected write operations in mobile environments","authors":"A. Vora, Z. Tari, P. Bertók","doi":"10.1109/RELDIS.2004.1353028","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353028","url":null,"abstract":"Caching is one technique that reduces costs and improves performance in mobile environments. It also increases availability during temporary, involuntary disconnections. However, our focus is on voluntary, client initiated disconnections, where hoarding can be used to predict data requirements. Existing hoarding approaches ignore conflicts arising out of write sharing and are thus unable to deal with them. However, since conflicts are detrimental to bandwidth utilisation, for scenarios with high write sharing, hoarding techniques need to provide support for sharing in a manner that reduces or avoids conflicts. We propose a hoarding approach for disconnected write operations that focuses on reducing the likelihood of conflicts, arising from write sharing, in a highly concurrent environment. Data that clients might need when disconnected is predicted based on the notion of semantic similarity. To avoid/reduce conflicts, data are first clustered based on their update probabilities. The hoard tree is then created based on the clusters and semantic similarity between data. Simulations show an increase in the cache hit-rate along with an reduction in the total number of conflicts.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116900352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

How to tolerate half less one Byzantine nodes in practical distributed systems 在实际的分布式系统中，如何容忍一个拜占庭节点减少一半

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353018

M. Correia, N. Neves, P. Veríssimo

The application of dependability concepts and techniques to the design of secure distributed systems is raising a considerable amount of interest in both communities under the designation of intrusion tolerance. However, practical intrusion-tolerant replicated systems based on the state machine approach (SMA) can handle at most f Byzantine components out of a total of n = 3f + 1, which is the maximum resilience in asynchronous systems. This paper extends the normal asynchronous system with a special distributed oracle called TTCB. Using this extended system we manage to implement an intrusion-tolerant service based on the SMA with only 2f + 1 replicas. Albeit a few other papers in the literature present intrusion-tolerant services with this approach, this is the first time the number of replicas is reduced from 3f + 1 to 2f + 1. Another interesting characteristic of the described service is a low time complexity.

将可靠性概念和技术应用到安全分布式系统的设计中，在入侵容忍度的框架下，引起了两个领域相当大的兴趣。然而，基于状态机方法(SMA)的实际容错复制系统最多可以处理n = 3f + 1个拜占庭组件中的f个，这是异步系统中的最大弹性。本文用一种特殊的分布式oracle TTCB对普通异步系统进行了扩展。使用这个扩展系统，我们成功地实现了一个基于SMA的入侵容忍服务，只有2f + 1个副本。虽然文献中也有其他一些论文用这种方法提出了抗入侵服务，但这是第一次将副本的数量从3f + 1减少到2f + 1。所描述的服务的另一个有趣的特征是低时间复杂度。

引用次数: 158

State maintenance and its impact on the performability of multi-tiered Internet services 状态维护及其对多层Internet服务可执行性的影响

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353015

G. Gama, K. Nagaraja, R. Bianchini, R. Martin, Wagner Meira Jr, Thu D. Nguyen

In this paper, we evaluate the performance, availability, and combined performability of four soft state maintenance strategies in two multitier Internet services, an online book store and an auction service. To take soft state and service latency into account, we propose an extension of our previous quantification methodology, and novel availability and performability metrics. Our results demonstrate that storing the soft state in a database can achieve better performability than storing it in main memory, even when the state is efficiently replicated. Strategies that offload the handling of soft state from the database increase the load on other tiers and, consequently, increase the impact of faults in these tiers on service availability. Based on these results, we conclude that service designers need to provision the cluster and balance the load with availability and cost, as well as performance, in mind.

在本文中，我们评估了四种软状态维护策略在两种多层互联网服务(在线书店和拍卖服务)中的性能、可用性和综合性能。为了考虑软状态和服务延迟，我们提出了对以前的量化方法的扩展，以及新的可用性和可执行性度量。我们的结果表明，将软状态存储在数据库中比将其存储在主存中可以获得更好的性能，即使状态被有效地复制。从数据库中卸载软状态处理的策略会增加其他层的负载，从而增加这些层中的故障对服务可用性的影响。根据这些结果，我们得出结论，服务设计人员需要提供集群，并在考虑可用性、成本和性能的情况下平衡负载。

引用次数: 11

The mutable consensus protocol 可变共识协议

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353023

J. Pereira, R. Oliveira

In this paper we propose the mutable consensus protocol, a pragmatic and theoretically appealing approach to enhance the performance of distributed consensus. First, an apparently inefficient protocol is developed using the simple stubborn channel abstraction for unreliable message passing. Then, performance is improved by introducing judiciously chosen finite delays in the implementation of channels. Although this does not compromise correctness, which rests on an asynchronous system model, it makes it likely that the transmission of some messages is avoided and thus the message exchange pattern at the network level changes noticeably. By choosing different delays in the underlying stubborn channels, the mutable consensus protocol can actually be made to resemble several different protocols. Besides presenting the mutable consensus protocol and four different mutations, we evaluate in detail the particularly interesting permutation gossip mutation, which allows the protocol to scale gracefully to a large number of processes by balancing the number of messages to be handled by each process with the number of communication steps required to decide. The evaluation is performed using a realistic simulation model which accurately reproduces resource consumption in real systems.

在本文中，我们提出了可变共识协议，这是一种实用且理论上有吸引力的方法来提高分布式共识的性能。首先，使用简单的顽固通道抽象来实现不可靠的消息传递，开发了一个明显效率低下的协议。然后，通过在信道的实现中引入明智选择的有限延迟来提高性能。尽管这不会影响正确性(这取决于异步系统模型)，但它可能会避免某些消息的传输，从而使网络级别的消息交换模式发生显著变化。通过在底层顽固通道中选择不同的延迟，可变共识协议实际上可以类似于几种不同的协议。除了介绍可变共识协议和四种不同的突变外，我们还详细评估了特别有趣的置换八卦突变，它允许协议通过平衡每个进程要处理的消息数量和需要决定的通信步骤数量来优雅地扩展到大量进程。评估采用了一个真实的仿真模型，该模型准确地再现了实际系统中的资源消耗。

{"title":"The mutable consensus protocol","authors":"J. Pereira, R. Oliveira","doi":"10.1109/RELDIS.2004.1353023","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353023","url":null,"abstract":"In this paper we propose the mutable consensus protocol, a pragmatic and theoretically appealing approach to enhance the performance of distributed consensus. First, an apparently inefficient protocol is developed using the simple stubborn channel abstraction for unreliable message passing. Then, performance is improved by introducing judiciously chosen finite delays in the implementation of channels. Although this does not compromise correctness, which rests on an asynchronous system model, it makes it likely that the transmission of some messages is avoided and thus the message exchange pattern at the network level changes noticeably. By choosing different delays in the underlying stubborn channels, the mutable consensus protocol can actually be made to resemble several different protocols. Besides presenting the mutable consensus protocol and four different mutations, we evaluate in detail the particularly interesting permutation gossip mutation, which allows the protocol to scale gracefully to a large number of processes by balancing the number of messages to be handled by each process with the number of communication steps required to decide. The evaluation is performed using a realistic simulation model which accurately reproduces resource consumption in real systems.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132178460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Slow advances in fault-tolerant real-time distributed computing 容错实时分布式计算进展缓慢

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353009

K. Kim

Is fault-tolerant (FT) real-time computing a specialized branch of FT computing? The key issue in real-time (RT) computing is to economically produce systems that yield temporal behavior which is relatively easily analyzable and acceptable in given application environments. Fault-tolerant (FT) RT computing has been treated by the predominant segment of the FT computing research community as a highly specialized branch of FT computing. This author believes that the situation should be changed. It seems safe to say that FT techniques for which useful characterizations of temporal behavior have not been or cannot be developed, are at best immature, if not entirely useless. This means that FT RT computing is at the core of FT computing.

容错(FT)实时计算是FT计算的一个专门分支吗?实时(RT)计算的关键问题是如何经济地生成能够产生在给定应用环境中相对容易分析和可接受的时间行为的系统。容错RT计算已被FT计算研究社区的主要部分视为FT计算的一个高度专业化的分支。作者认为这种情况应该改变。似乎可以肯定地说，对时间行为进行有用描述的金融时报技术，如果不是完全没用，至少也是不成熟的。这意味着傅里叶变换计算是傅里叶变换计算的核心。

引用次数: 7

Simple and efficient oracle-based consensus protocols for asynchronous Byzantine systems 用于异步拜占庭系统的简单高效的基于oracle的共识协议

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/TDSC.2005.13

R. Friedman, A. Mostéfaoui, M. Raynal

This paper is on the consensus problem in asynchronous distributed systems where (up to f) processes (among n) can exhibit a Byzantine behavior, i.e., can deviate arbitrarily from their specification. A way to solve the consensus problem in such a context consists of enriching the system with additional oracles that are powerful enough to cope with the uncertainty and unpredictability created by the combined effect of Byzantine behavior and asynchrony. Considering two types of such oracles, namely, an oracle that provides processes with random values, and a failure detector oracle, the paper presents two families of Byzantine asynchronous consensus protocols. Two of these protocols are particularly noteworthy: they allow the processes to decide in one communication step in favorable circumstances. The first is a randomized protocol that assumes n > 5f. The second one is a failure detector-based protocol that assumes n > 6f. These protocols are designed to be particularly simple and efficient in terms of communication steps, the number of messages they generate in each step, and the size of messages. So, although they are not optimal in the number of Byzantine processes that can be tolerated, they are particularly efficient when we consider the number of communication steps they require to decide, and the number and size of the messages they use. In that sense, they are practically appealing.

本文研究异步分布式系统中的一致性问题，其中(n中)(最多f)个进程可能表现出拜占庭行为，即可以任意偏离其规范。在这种情况下，解决共识问题的一种方法是用额外的神谕来丰富系统，这些神谕足够强大，可以应对拜占庭行为和异步的综合影响所产生的不确定性和不可预测性。考虑到这两种类型的oracle，即为进程提供随机值的oracle和故障检测oracle，本文提出了两类拜占庭异步共识协议。其中两个协议特别值得注意:它们允许进程在有利的情况下在一个通信步骤中决定。第一种是随机协议，假设n > 5f。第二种是基于故障检测器的协议，它假设n > 6f。这些协议在通信步骤、每一步生成的消息数量和消息大小方面被设计得特别简单和有效。因此，尽管它们在可容忍的拜占庭进程数量上不是最优的，但当我们考虑到它们需要决定的通信步骤的数量以及它们使用的消息的数量和大小时，它们特别有效。从这个意义上说，它们实际上很有吸引力。

{"title":"Simple and efficient oracle-based consensus protocols for asynchronous Byzantine systems","authors":"R. Friedman, A. Mostéfaoui, M. Raynal","doi":"10.1109/TDSC.2005.13","DOIUrl":"https://doi.org/10.1109/TDSC.2005.13","url":null,"abstract":"This paper is on the consensus problem in asynchronous distributed systems where (up to f) processes (among n) can exhibit a Byzantine behavior, i.e., can deviate arbitrarily from their specification. A way to solve the consensus problem in such a context consists of enriching the system with additional oracles that are powerful enough to cope with the uncertainty and unpredictability created by the combined effect of Byzantine behavior and asynchrony. Considering two types of such oracles, namely, an oracle that provides processes with random values, and a failure detector oracle, the paper presents two families of Byzantine asynchronous consensus protocols. Two of these protocols are particularly noteworthy: they allow the processes to decide in one communication step in favorable circumstances. The first is a randomized protocol that assumes n > 5f. The second one is a failure detector-based protocol that assumes n > 6f. These protocols are designed to be particularly simple and efficient in terms of communication steps, the number of messages they generate in each step, and the size of messages. So, although they are not optimal in the number of Byzantine processes that can be tolerated, they are particularly efficient when we consider the number of communication steps they require to decide, and the number and size of the messages they use. In that sense, they are practically appealing.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"365 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114098203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

Crash-resilient time-free eventual leadership 抗崩溃、不受时间限制的最终领导

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353022

A. Mostéfaoui, M. Raynal, Corentin Travers

Leader-based protocols rest on a primitive able to provide the processes with the same unique leader. Such protocols are very common in distributed computing to solve synchronization or coordination problems. Unfortunately, providing such a primitive is far from being trivial in asynchronous distributed systems prone to process crashes. (It is even impossible in fault-prone purely asynchronous systems.) To circumvent this difficulty, several protocols have been proposed that build a leader facility on top of an asynchronous distributed system enriched with synchrony assumptions. This paper consider another approach to build a leader facility, namely, it considers a behavioral property on the flow of messages that are exchanged. This property has the noteworthy feature not to involve timing assumptions. Two protocols based on this time-free property that implement a leader primitive are described. The first one uses potentially unbounded counters, while the second one (which is a little more involved) requires only finite memory. These protocols rely on simple design principles that make them attractive, easy to understand and provably correct.

基于leader的协议依赖于能够为进程提供相同唯一leader的原语。这种协议在分布式计算中非常常见，用于解决同步或协调问题。不幸的是，在容易发生进程崩溃的异步分布式系统中，提供这样的原语远非微不足道。(这在容易出错的纯异步系统中甚至是不可能的。)为了克服这个困难，已经提出了几个协议，这些协议在异步分布式系统上建立了一个具有丰富同步假设的领导者设施。本文考虑另一种构建领导者设施的方法，即考虑交换消息流的行为属性。这个属性有一个值得注意的特点，它不涉及时间假设。描述了两个基于这种无时间特性的协议，它们实现了领导者原语。第一个使用可能无限的计数器，而第二个(稍微复杂一点)只需要有限的内存。这些协议依赖于简单的设计原则，使它们具有吸引力，易于理解并且可证明是正确的。

{"title":"Crash-resilient time-free eventual leadership","authors":"A. Mostéfaoui, M. Raynal, Corentin Travers","doi":"10.1109/RELDIS.2004.1353022","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353022","url":null,"abstract":"Leader-based protocols rest on a primitive able to provide the processes with the same unique leader. Such protocols are very common in distributed computing to solve synchronization or coordination problems. Unfortunately, providing such a primitive is far from being trivial in asynchronous distributed systems prone to process crashes. (It is even impossible in fault-prone purely asynchronous systems.) To circumvent this difficulty, several protocols have been proposed that build a leader facility on top of an asynchronous distributed system enriched with synchrony assumptions. This paper consider another approach to build a leader facility, namely, it considers a behavioral property on the flow of messages that are exchanged. This property has the noteworthy feature not to involve timing assumptions. Two protocols based on this time-free property that implement a leader primitive are described. The first one uses potentially unbounded counters, while the second one (which is a little more involved) requires only finite memory. These protocols rely on simple design principles that make them attractive, easy to understand and provably correct.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"236 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121161201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Self checking network protocols: a monitor based approach 自检网络协议:基于监视器的方法

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353000

G. Khanna, Padma Varadharajan, S. Bagchi

The wide deployment of high-speed computer networks has made distributed systems ubiquitous in today's connected world. The machines on which the distributed applications are hosted are heterogeneous in nature, the applications often run legacy code without the availability of their source code, the systems are of very large scales, and often have soft real-time guarantees. In this paper, we target the problem of online detection of disruptions through a generic external entity called Monitor that is able to observe the exchanged messages between the protocol participants and deduce any ongoing disruption by matching against a rule base composed of combinatorial and temporal rules. The Monitor architecture is application neutral, with the rule base making it specific to a protocol. To make the detection infrastructure scalable and dependable, we extend it to a hierarchical Monitor structure. The infrastructure is applied to a streaming video application running on a reliable multicast protocol called TRAM installed on the campus wide network. The evaluation brings out the scalability of the monitor infrastructure and detection coverage under different kinds of faults for the single level and the hierarchical arrangements.

高速计算机网络的广泛部署使得分布式系统在当今的互联世界中无处不在。托管分布式应用程序的机器本质上是异构的，应用程序经常运行遗留代码而没有其源代码的可用性，系统规模非常大，并且通常具有软实时保证。在本文中，我们通过一个称为Monitor的通用外部实体来解决在线检测中断的问题，该实体能够观察协议参与者之间交换的消息，并通过匹配由组合规则和时间规则组成的规则库来推断任何正在进行的中断。Monitor体系结构与应用程序无关，规则库使其特定于协议。为了使检测基础设施具有可扩展性和可靠性，我们将其扩展为分层Monitor结构。该基础架构应用于一个运行在可靠组播协议TRAM上的流视频应用程序，该协议安装在校园网上。评价结果表明，在单级和分层布置的情况下，监测基础设施的可扩展性和不同类型故障下的检测覆盖范围。

{"title":"Self checking network protocols: a monitor based approach","authors":"G. Khanna, Padma Varadharajan, S. Bagchi","doi":"10.1109/RELDIS.2004.1353000","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353000","url":null,"abstract":"The wide deployment of high-speed computer networks has made distributed systems ubiquitous in today's connected world. The machines on which the distributed applications are hosted are heterogeneous in nature, the applications often run legacy code without the availability of their source code, the systems are of very large scales, and often have soft real-time guarantees. In this paper, we target the problem of online detection of disruptions through a generic external entity called Monitor that is able to observe the exchanged messages between the protocol participants and deduce any ongoing disruption by matching against a rule base composed of combinatorial and temporal rules. The Monitor architecture is application neutral, with the rule base making it specific to a protocol. To make the detection infrastructure scalable and dependable, we extend it to a hierarchical Monitor structure. The infrastructure is applied to a streaming video application running on a reliable multicast protocol called TRAM installed on the campus wide network. The evaluation brings out the scalability of the monitor infrastructure and detection coverage under different kinds of faults for the single level and the hierarchical arrangements.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125929959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

XNET: a reliable content-based publish/subscribe system 一个可靠的基于内容的发布/订阅系统

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353027

Raphaël Chand, P. Felber

Content-based publish/subscribe systems are usually implemented as a network of brokers that collaboratively route messages from information providers to consumers. A major challenge of such middleware infrastructures is their reliability and their ability to cope with failures in the system. In this paper, we present the architecture of the XNET XML content network and we detail the mechanisms that we implemented to gracefully handle failures and maintain the system state consistent with the consumer population at all times. In particular, we propose several approaches to fault tolerance so that our system can recover from various types of router and link failures. We analyze the efficiency of our techniques in a large scale experimental deployment on the PlanetLab testbed. We show that XNET does not only offer good performance and scalability with large consumer populations under normal operation, but can also quickly recover from system failures.

基于内容的发布/订阅系统通常被实现为一个代理网络，它以协作方式将消息从信息提供者路由到消费者。这种中间件基础设施的一个主要挑战是它们的可靠性和处理系统故障的能力。在本文中，我们介绍了XNET XML内容网络的体系结构，并详细介绍了我们实现的机制，以优雅地处理故障，并始终保持系统状态与用户群体一致。特别是，我们提出了几种容错方法，以便我们的系统可以从各种类型的路由器和链路故障中恢复。我们分析了我们的技术在PlanetLab测试平台上大规模实验部署的效率。我们展示了XNET不仅在正常操作下为大量用户提供了良好的性能和可伸缩性，而且还可以从系统故障中快速恢复。

引用次数: 89

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀