A practical approach for 'zero' downtime in an operational information system

Proceedings 22nd International Conference on Distributed Computing Systems Pub Date : 2002-07-02 DOI:10.1109/ICDCS.2002.1022272

Ada Gavrilovska, K. Schwan, Van Oleson

{"title":"A practical approach for 'zero' downtime in an operational information system","authors":"Ada Gavrilovska, K. Schwan, Van Oleson","doi":"10.1109/ICDCS.2002.1022272","DOIUrl":null,"url":null,"abstract":"An operational information system (OIS) supports a real-time view of an organization's information critical to its logistical business operations. A central component of an OIS is an engine that integrates data events captured from distributed, remote sources in order to derive meaningful real-time views of current operations. This event derivation engine (EDE) continuously updates these views and also publishes them to a potentially large number of remote subscribers. The paper first describes a sample OIS and EDE in the context of an airline's operations. It then defines the performance and availability requirements to be met by this system, specifically focusing on the EDE component. One particular requirement for the EDE is that subscribers to its output events should not experience downtime due to EDE failures, crashes or increased processing loads. Toward this end, we develop and evaluate a practical technique for masking failures and for hiding the costs of recovery from EDE subscribers. This technique utilizes redundant EDEs that coordinate view replicas with a relaxed synchronous fault tolerance protocol. A combination of pre- and post-buffering of replicas is used to attain a solution that offers low response times (i.e., 'zero' downtime) while also preventing system failures in the presence of deterministic faults like 'ill-formed' messages. Parallelism realized via a cluster machine and application-specific techniques for reducing synchronization across replicas are used to scale a 'zero' downtime EDE to support the large number of subscribers it must service.","PeriodicalId":186210,"journal":{"name":"Proceedings 22nd International Conference on Distributed Computing Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 22nd International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2002.1022272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

An operational information system (OIS) supports a real-time view of an organization's information critical to its logistical business operations. A central component of an OIS is an engine that integrates data events captured from distributed, remote sources in order to derive meaningful real-time views of current operations. This event derivation engine (EDE) continuously updates these views and also publishes them to a potentially large number of remote subscribers. The paper first describes a sample OIS and EDE in the context of an airline's operations. It then defines the performance and availability requirements to be met by this system, specifically focusing on the EDE component. One particular requirement for the EDE is that subscribers to its output events should not experience downtime due to EDE failures, crashes or increased processing loads. Toward this end, we develop and evaluate a practical technique for masking failures and for hiding the costs of recovery from EDE subscribers. This technique utilizes redundant EDEs that coordinate view replicas with a relaxed synchronous fault tolerance protocol. A combination of pre- and post-buffering of replicas is used to attain a solution that offers low response times (i.e., 'zero' downtime) while also preventing system failures in the presence of deterministic faults like 'ill-formed' messages. Parallelism realized via a cluster machine and application-specific techniques for reducing synchronization across replicas are used to scale a 'zero' downtime EDE to support the large number of subscribers it must service.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在运行信息系统中实现“零”停机的实用方法

操作信息系统(OIS)支持对组织的后勤业务操作至关重要的信息的实时视图。OIS的核心组件是一个引擎，它集成了从分布式远程数据源捕获的数据事件，从而获得当前操作的有意义的实时视图。此事件派生引擎(EDE)不断更新这些视图，并将它们发布给潜在的大量远程订阅者。本文首先描述了航空公司运营背景下的OIS和EDE样本。然后定义该系统要满足的性能和可用性需求，特别关注EDE组件。EDE的一个特殊要求是，其输出事件的订阅者不应由于EDE故障、崩溃或处理负载增加而经历停机。为此，我们开发并评估了一种实用的技术，用于屏蔽故障和向EDE订阅者隐藏恢复成本。这种技术利用冗余的ede，用宽松的同步容错协议来协调视图副本。使用副本的预缓冲和后缓冲的组合来获得提供低响应时间(即“零”停机时间)的解决方案，同时还可以防止存在确定性错误(如“格式错误”消息)时的系统故障。通过集群机器实现的并行性和用于减少副本间同步的特定于应用程序的技术用于扩展“零”停机时间EDE，以支持它必须服务的大量订阅者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings 22nd International Conference on Distributed Computing Systems

自引率

0.00%

发文量