Proactive detection of software aging mechanisms in performance critical computers

K. Gross, V. Bhardwaj, R. Bickford
{"title":"Proactive detection of software aging mechanisms in performance critical computers","authors":"K. Gross, V. Bhardwaj, R. Bickford","doi":"10.1109/SEW.2002.1199445","DOIUrl":null,"url":null,"abstract":"Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990 's the U.S. Dept. of Energy and NASA funded development of an advanced statistical pattern recognition method called the multivariate state estimation technique (MSET) for proactive online detection of dynamic sensor and signal anomalies in nuclear power plants and Space Shuttle Main Engine telemetry data. The present investigation was undertaken to investigate the feasibility and practicability of applying MSET for realtime proactive detection of software aging mechanisms in complex, multiCPU servers. The procedure uses MSET for model based parameter estimation in conjunction with statistical fault detection and Bayesian fault decision processing. A realtime software telemetry harness was designed to continuously sample over 50 performance metrics related to computer system load, throughput, queue lengths, and transaction latencies. A series of fault injection experiments was conducted using a \"memory leak\" injector tool with controllable parasitic resource consumption rates. MSET was able to reliably detect the onset of resource contention problems with high sensitivity and excellent false-alarm avoidance. Spin-off applications of this NASA-funded innovation for business critical eCommerce servers are described.","PeriodicalId":146269,"journal":{"name":"27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2002-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEW.2002.1199445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 50

Abstract

Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990 's the U.S. Dept. of Energy and NASA funded development of an advanced statistical pattern recognition method called the multivariate state estimation technique (MSET) for proactive online detection of dynamic sensor and signal anomalies in nuclear power plants and Space Shuttle Main Engine telemetry data. The present investigation was undertaken to investigate the feasibility and practicability of applying MSET for realtime proactive detection of software aging mechanisms in complex, multiCPU servers. The procedure uses MSET for model based parameter estimation in conjunction with statistical fault detection and Bayesian fault decision processing. A realtime software telemetry harness was designed to continuously sample over 50 performance metrics related to computer system load, throughput, queue lengths, and transaction latencies. A series of fault injection experiments was conducted using a "memory leak" injector tool with controllable parasitic resource consumption rates. MSET was able to reliably detect the onset of resource contention problems with high sensitivity and excellent false-alarm avoidance. Spin-off applications of this NASA-funded innovation for business critical eCommerce servers are described.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
性能关键型计算机中软件老化机制的主动检测
软件老化是一种现象,通常由资源争用引起,它会导致关键任务和关键业务计算机系统挂起、出现故障或性能下降。如果可以在性能下降之前可靠地检测到软件老化机制的开始或开始,则可以采取纠正措施来防止系统挂起,或者可以在容错系统中触发动态故障转移事件。在20世纪90年代,美国能源部和美国国家航空航天局资助开发了一种先进的统计模式识别方法,称为多元状态估计技术(MSET),用于主动在线检测核电站和航天飞机主发动机遥测数据中的动态传感器和信号异常。本研究旨在探讨在复杂的多cpu服务器中应用MSET实时主动检测软件老化机制的可行性和实用性。该过程将MSET用于基于模型的参数估计,并结合统计故障检测和贝叶斯故障决策处理。设计了一个实时软件遥测工具,用于连续采样与计算机系统负载、吞吐量、队列长度和事务延迟相关的50多个性能指标。利用可控寄生资源消耗率的“内存泄漏”注入工具进行了一系列故障注入实验。MSET能够可靠地检测资源争用问题的发生,具有较高的灵敏度和良好的误报避免能力。本文描述了这种由nasa资助的用于关键业务电子商务服务器的创新的衍生应用程序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Composite propositions: toward support for formal specification of system properties Packaging and disseminating lessons learned from COTS-based software development Distributed simulation communication through an active real-time database Extending software change impact analysis into COTS components Towards autonomic computing: effective event management
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1