使用系统级诊断的高效和容错分布式主机监控

Proceedings of IFIP/IEEE International Conference on Distributed Platforms Pub Date : 1900-01-01 DOI:10.1109/ICDP.1996.864200

M. Bearden, R. Bianchini

{"title":"使用系统级诊断的高效和容错分布式主机监控","authors":"M. Bearden, R. Bianchini","doi":"10.1109/ICDP.1996.864200","DOIUrl":null,"url":null,"abstract":"This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decentralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by applying the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment. DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The key results of this paper include the extension of a distributed system-level diagnosis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state.","PeriodicalId":127207,"journal":{"name":"Proceedings of IFIP/IEEE International Conference on Distributed Platforms","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Efficient and fault-tolerant distributed host monitoring using system-level diagnosis\",\"authors\":\"M. Bearden, R. Bianchini\",\"doi\":\"10.1109/ICDP.1996.864200\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decentralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by applying the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment. DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The key results of this paper include the extension of a distributed system-level diagnosis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state.\",\"PeriodicalId\":127207,\"journal\":{\"name\":\"Proceedings of IFIP/IEEE International Conference on Distributed Platforms\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of IFIP/IEEE International Conference on Distributed Platforms\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDP.1996.864200\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of IFIP/IEEE International Conference on Distributed Platforms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDP.1996.864200","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

本文提出了一种高效、容错的分布式网络处理器状态监控方法。分布式系统监视器(DSMon)是一个分布式的、分散的程序，它在每个处理器上并行地收集处理器信息，如CPU负载、用户信息、网络和磁盘统计信息，并可靠地将这些信息在线分发到所有无故障的处理器。信息在每个处理器上进行过滤，并按不同的优先级分发，以节省通信资源。容错是在系统级诊断研究成果的基础上实现的。扩展了基于PMC故障模型和全连接网络的在线分布式系统级诊断算法，使其能够在不可靠环境下始终保持用户自定义状态信息。DSMon已经实施，目前在卡内基梅隆大学电气和计算机工程系的大约200个联网工作站上运行。本文的主要成果包括对当前全局状态可靠广播的分布式系统级诊断算法的扩展，以及DSMon的规范。引入了一种宽松形式的可靠广播，称为浓缩可靠广播，以保证提供最近的广播更新，而不保证所有广播更新的完整历史记录。介绍了DSMon的实现，并分析了其在实际分布式网络环境中的运行情况。这项工作的扩展包括其他故障和系统模型，以及对其他需要一致的分布式全局状态的分布式应用程序的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Efficient and fault-tolerant distributed host monitoring using system-level diagnosis

This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decentralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by applying the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment. DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The key results of this paper include the extension of a distributed system-level diagnosis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助