Efficient and fault-tolerant distributed host monitoring using system-level diagnosis

M. Bearden, R. Bianchini
{"title":"Efficient and fault-tolerant distributed host monitoring using system-level diagnosis","authors":"M. Bearden, R. Bianchini","doi":"10.1109/ICDP.1996.864200","DOIUrl":null,"url":null,"abstract":"This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decentralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by applying the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment. DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The key results of this paper include the extension of a distributed system-level diagnosis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state.","PeriodicalId":127207,"journal":{"name":"Proceedings of IFIP/IEEE International Conference on Distributed Platforms","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of IFIP/IEEE International Conference on Distributed Platforms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDP.1996.864200","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decentralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by applying the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment. DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The key results of this paper include the extension of a distributed system-level diagnosis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用系统级诊断的高效和容错分布式主机监控
本文提出了一种高效、容错的分布式网络处理器状态监控方法。分布式系统监视器(DSMon)是一个分布式的、分散的程序,它在每个处理器上并行地收集处理器信息,如CPU负载、用户信息、网络和磁盘统计信息,并可靠地将这些信息在线分发到所有无故障的处理器。信息在每个处理器上进行过滤,并按不同的优先级分发,以节省通信资源。容错是在系统级诊断研究成果的基础上实现的。扩展了基于PMC故障模型和全连接网络的在线分布式系统级诊断算法,使其能够在不可靠环境下始终保持用户自定义状态信息。DSMon已经实施,目前在卡内基梅隆大学电气和计算机工程系的大约200个联网工作站上运行。本文的主要成果包括对当前全局状态可靠广播的分布式系统级诊断算法的扩展,以及DSMon的规范。引入了一种宽松形式的可靠广播,称为浓缩可靠广播,以保证提供最近的广播更新,而不保证所有广播更新的完整历史记录。介绍了DSMon的实现,并分析了其在实际分布式网络环境中的运行情况。这项工作的扩展包括其他故障和系统模型,以及对其他需要一致的分布式全局状态的分布式应用程序的适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
SMT: a system monitoring tool for DCE Distributed computing environment (DCE) porting tool Standards for distributed platforms Performance evaluation of a distributed application performance monitor The impact of mobility on distributed systems platforms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1