分布式流处理系统的预测性故障管理

Xiaohui Gu, S. Papadimitriou, Philip S. Yu, Shu-Ping Chang
{"title":"分布式流处理系统的预测性故障管理","authors":"Xiaohui Gu, S. Papadimitriou, Philip S. Yu, Shu-Ping Chang","doi":"10.1109/ICDCS.2008.34","DOIUrl":null,"url":null,"abstract":"Distributed stream processing systems (DSPSs) have many important applications such as sensor data analysis, network security, and business intelligence. Failure management is essential for DSPSs that often require highly-available system operations. In this paper, we explore a new predictive failure management approach that employs online failure prediction to achieve more efficient failure management than previous reactive or proactive failure management approaches. We employ light-weight stream-based classification methods to perform online failure forecast. Based on the prediction results, the system can take differentiated failure preventions on abnormal components only. Our failure prediction model is tunable, which can achieve a desired tradeoff between failure penalty reduction and prevention cost based on a user-defined reward function. To achieve low-overhead online learning, we propose adaptive data stream sampling schemes to adaptively adjust measurement sampling rates based on the states of monitored components, and maintain a limited size of historical training data using reservoir sampling. We have implemented an initial prototype of the predictive failure management framework within the IBM System S distributed stream processing system. Experiment results show that our system can achieve more efficient failure management than conventional reactive and proactive approaches, while imposing low overhead to the DSPS.","PeriodicalId":240205,"journal":{"name":"2008 The 28th International Conference on Distributed Computing Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":"{\"title\":\"Toward Predictive Failure Management for Distributed Stream Processing Systems\",\"authors\":\"Xiaohui Gu, S. Papadimitriou, Philip S. Yu, Shu-Ping Chang\",\"doi\":\"10.1109/ICDCS.2008.34\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributed stream processing systems (DSPSs) have many important applications such as sensor data analysis, network security, and business intelligence. Failure management is essential for DSPSs that often require highly-available system operations. In this paper, we explore a new predictive failure management approach that employs online failure prediction to achieve more efficient failure management than previous reactive or proactive failure management approaches. We employ light-weight stream-based classification methods to perform online failure forecast. Based on the prediction results, the system can take differentiated failure preventions on abnormal components only. Our failure prediction model is tunable, which can achieve a desired tradeoff between failure penalty reduction and prevention cost based on a user-defined reward function. To achieve low-overhead online learning, we propose adaptive data stream sampling schemes to adaptively adjust measurement sampling rates based on the states of monitored components, and maintain a limited size of historical training data using reservoir sampling. We have implemented an initial prototype of the predictive failure management framework within the IBM System S distributed stream processing system. Experiment results show that our system can achieve more efficient failure management than conventional reactive and proactive approaches, while imposing low overhead to the DSPS.\",\"PeriodicalId\":240205,\"journal\":{\"name\":\"2008 The 28th International Conference on Distributed Computing Systems\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"35\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 The 28th International Conference on Distributed Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS.2008.34\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 The 28th International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2008.34","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35

摘要

分布式流处理系统在传感器数据分析、网络安全、商业智能等方面有着重要的应用。故障管理对于通常需要高可用性系统操作的dsp是必不可少的。在本文中,我们探索了一种新的预测性故障管理方法,该方法采用在线故障预测来实现比以前的被动或主动故障管理方法更有效的故障管理。我们采用轻量级的基于流的分类方法来进行在线故障预测。根据预测结果,系统可对异常部件采取差异化的故障预防措施。我们的故障预测模型是可调的,它可以基于用户自定义的奖励函数在故障惩罚减少和预防成本之间实现理想的权衡。为了实现低开销的在线学习,我们提出了自适应数据流采样方案,根据被监测组件的状态自适应调整测量采样率,并使用储层采样保持有限规模的历史训练数据。我们已经在IBM System S分布式流处理系统中实现了预测故障管理框架的初始原型。实验结果表明,与传统的被动和主动方法相比,该系统可以实现更有效的故障管理,同时降低了dsp的开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Toward Predictive Failure Management for Distributed Stream Processing Systems
Distributed stream processing systems (DSPSs) have many important applications such as sensor data analysis, network security, and business intelligence. Failure management is essential for DSPSs that often require highly-available system operations. In this paper, we explore a new predictive failure management approach that employs online failure prediction to achieve more efficient failure management than previous reactive or proactive failure management approaches. We employ light-weight stream-based classification methods to perform online failure forecast. Based on the prediction results, the system can take differentiated failure preventions on abnormal components only. Our failure prediction model is tunable, which can achieve a desired tradeoff between failure penalty reduction and prevention cost based on a user-defined reward function. To achieve low-overhead online learning, we propose adaptive data stream sampling schemes to adaptively adjust measurement sampling rates based on the states of monitored components, and maintain a limited size of historical training data using reservoir sampling. We have implemented an initial prototype of the predictive failure management framework within the IBM System S distributed stream processing system. Experiment results show that our system can achieve more efficient failure management than conventional reactive and proactive approaches, while imposing low overhead to the DSPS.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Relative Network Positioning via CDN Redirections Compiler-Assisted Application-Level Checkpointing for MPI Programs Exploring Anti-Spam Models in Large Scale VoIP Systems Correlation-Aware Object Placement for Multi-Object Operations Probing Queries in Wireless Sensor Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1