A practical approach for 'zero' downtime in an operational information system

Ada Gavrilovska, K. Schwan, Van Oleson
{"title":"A practical approach for 'zero' downtime in an operational information system","authors":"Ada Gavrilovska, K. Schwan, Van Oleson","doi":"10.1109/ICDCS.2002.1022272","DOIUrl":null,"url":null,"abstract":"An operational information system (OIS) supports a real-time view of an organization's information critical to its logistical business operations. A central component of an OIS is an engine that integrates data events captured from distributed, remote sources in order to derive meaningful real-time views of current operations. This event derivation engine (EDE) continuously updates these views and also publishes them to a potentially large number of remote subscribers. The paper first describes a sample OIS and EDE in the context of an airline's operations. It then defines the performance and availability requirements to be met by this system, specifically focusing on the EDE component. One particular requirement for the EDE is that subscribers to its output events should not experience downtime due to EDE failures, crashes or increased processing loads. Toward this end, we develop and evaluate a practical technique for masking failures and for hiding the costs of recovery from EDE subscribers. This technique utilizes redundant EDEs that coordinate view replicas with a relaxed synchronous fault tolerance protocol. A combination of pre- and post-buffering of replicas is used to attain a solution that offers low response times (i.e., 'zero' downtime) while also preventing system failures in the presence of deterministic faults like 'ill-formed' messages. Parallelism realized via a cluster machine and application-specific techniques for reducing synchronization across replicas are used to scale a 'zero' downtime EDE to support the large number of subscribers it must service.","PeriodicalId":186210,"journal":{"name":"Proceedings 22nd International Conference on Distributed Computing Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 22nd International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2002.1022272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31

Abstract

An operational information system (OIS) supports a real-time view of an organization's information critical to its logistical business operations. A central component of an OIS is an engine that integrates data events captured from distributed, remote sources in order to derive meaningful real-time views of current operations. This event derivation engine (EDE) continuously updates these views and also publishes them to a potentially large number of remote subscribers. The paper first describes a sample OIS and EDE in the context of an airline's operations. It then defines the performance and availability requirements to be met by this system, specifically focusing on the EDE component. One particular requirement for the EDE is that subscribers to its output events should not experience downtime due to EDE failures, crashes or increased processing loads. Toward this end, we develop and evaluate a practical technique for masking failures and for hiding the costs of recovery from EDE subscribers. This technique utilizes redundant EDEs that coordinate view replicas with a relaxed synchronous fault tolerance protocol. A combination of pre- and post-buffering of replicas is used to attain a solution that offers low response times (i.e., 'zero' downtime) while also preventing system failures in the presence of deterministic faults like 'ill-formed' messages. Parallelism realized via a cluster machine and application-specific techniques for reducing synchronization across replicas are used to scale a 'zero' downtime EDE to support the large number of subscribers it must service.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在运行信息系统中实现“零”停机的实用方法
操作信息系统(OIS)支持对组织的后勤业务操作至关重要的信息的实时视图。OIS的核心组件是一个引擎,它集成了从分布式远程数据源捕获的数据事件,从而获得当前操作的有意义的实时视图。此事件派生引擎(EDE)不断更新这些视图,并将它们发布给潜在的大量远程订阅者。本文首先描述了航空公司运营背景下的OIS和EDE样本。然后定义该系统要满足的性能和可用性需求,特别关注EDE组件。EDE的一个特殊要求是,其输出事件的订阅者不应由于EDE故障、崩溃或处理负载增加而经历停机。为此,我们开发并评估了一种实用的技术,用于屏蔽故障和向EDE订阅者隐藏恢复成本。这种技术利用冗余的ede,用宽松的同步容错协议来协调视图副本。使用副本的预缓冲和后缓冲的组合来获得提供低响应时间(即“零”停机时间)的解决方案,同时还可以防止存在确定性错误(如“格式错误”消息)时的系统故障。通过集群机器实现的并行性和用于减少副本间同步的特定于应用程序的技术用于扩展“零”停机时间EDE,以支持它必须服务的大量订阅者。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Cooperative computing for distributed embedded systems A self-stabilizing protocol for pipelined PIF in tree networks A pluggable service-to-service communication mechanism for VNA architecture Dynamic replica control based on fairly assigned variation of data with weak consistency for loosely coupled distributed systems Migratory TCP: connection migration for service continuity in the Internet
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1