使用基于不变量分析的有限状态机识别SaaS服务的静默故障

Geetika Goel, A. Roy, R. Ganesan
{"title":"使用基于不变量分析的有限状态机识别SaaS服务的静默故障","authors":"Geetika Goel, A. Roy, R. Ganesan","doi":"10.1109/ISSREW.2013.6688909","DOIUrl":null,"url":null,"abstract":"Field failure analysis is usually driven by a characterization of the different time related properties of failure. This characterization does not help the production support team in understanding the root cause. In order to pinpoint the root cause of failure, one of the most effective techniques used is checking for violations of the system invariants which are the consistent, time invariant correlations that exist in the system. Understanding when and where these violations happen helps in detecting the root cause of the failure. Silent failures, on the other hand are characterized by no evidence of failures either in the console or in the field failure logs. They are unearthed at moments of crisis, either with a customer complaint or other cascading failures. These failures often result in data loss or data corruption, creating many latent errors. Accumulation of these errors over time results in degraded system performance. This represents the problem of software aging and restoration of the system, i.e. its rejuvenation becomes a critical need. Subsequent to the restoration, a rigorous failure detection mechanism is needed to detect them early. What we describe in the paper is a novel method that could be used to detect silent failures using a combination of invariant violation checking and finite state machine based analysis of the system. We use the audit-trail logs of system to extract information about the state and transitions for FSM representation. Currently our research work was limited to proving its efficiency. We applied this approach to our SaaS platform and were able to detect 36 silent failures over a period of 9 months. As next steps, we will implement this as a part of automated failure detection in the operational SaaS platforms.","PeriodicalId":332420,"journal":{"name":"2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Identifying silent failures of SaaS services using finite state machine based invariant analysis\",\"authors\":\"Geetika Goel, A. Roy, R. Ganesan\",\"doi\":\"10.1109/ISSREW.2013.6688909\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Field failure analysis is usually driven by a characterization of the different time related properties of failure. This characterization does not help the production support team in understanding the root cause. In order to pinpoint the root cause of failure, one of the most effective techniques used is checking for violations of the system invariants which are the consistent, time invariant correlations that exist in the system. Understanding when and where these violations happen helps in detecting the root cause of the failure. Silent failures, on the other hand are characterized by no evidence of failures either in the console or in the field failure logs. They are unearthed at moments of crisis, either with a customer complaint or other cascading failures. These failures often result in data loss or data corruption, creating many latent errors. Accumulation of these errors over time results in degraded system performance. This represents the problem of software aging and restoration of the system, i.e. its rejuvenation becomes a critical need. Subsequent to the restoration, a rigorous failure detection mechanism is needed to detect them early. What we describe in the paper is a novel method that could be used to detect silent failures using a combination of invariant violation checking and finite state machine based analysis of the system. We use the audit-trail logs of system to extract information about the state and transitions for FSM representation. Currently our research work was limited to proving its efficiency. We applied this approach to our SaaS platform and were able to detect 36 silent failures over a period of 9 months. As next steps, we will implement this as a part of automated failure detection in the operational SaaS platforms.\",\"PeriodicalId\":332420,\"journal\":{\"name\":\"2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSREW.2013.6688909\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSREW.2013.6688909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

现场失效分析通常是由对不同时间相关的失效特性的描述驱动的。这种描述并不能帮助生产支持团队理解问题的根本原因。为了查明故障的根本原因,使用的最有效的技术之一是检查是否违反系统不变量,系统不变量是系统中存在的一致的、时不变的相关性。了解这些违规发生的时间和地点有助于检测故障的根本原因。另一方面,静默故障的特点是在控制台或现场故障日志中都没有故障证据。它们在危机时刻被挖掘出来,要么是客户投诉,要么是其他一连串的失败。这些故障通常会导致数据丢失或数据损坏,从而产生许多潜在的错误。随着时间的推移,这些错误的积累会导致系统性能下降。这代表了软件老化和系统恢复的问题,即它的复兴成为一个关键的需求。在恢复之后,需要一个严格的故障检测机制来早期检测它们。本文描述的是一种新的方法,该方法可以使用不变违例检查和基于有限状态机的系统分析相结合来检测无声故障。我们使用系统的审计跟踪日志来提取关于FSM表示的状态和转换的信息。目前我们的研究工作还局限于验证其效率。我们将这种方法应用于我们的SaaS平台,并在9个月的时间内检测到36个无声故障。在接下来的步骤中,我们将把它作为可操作SaaS平台中自动故障检测的一部分来实现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Identifying silent failures of SaaS services using finite state machine based invariant analysis
Field failure analysis is usually driven by a characterization of the different time related properties of failure. This characterization does not help the production support team in understanding the root cause. In order to pinpoint the root cause of failure, one of the most effective techniques used is checking for violations of the system invariants which are the consistent, time invariant correlations that exist in the system. Understanding when and where these violations happen helps in detecting the root cause of the failure. Silent failures, on the other hand are characterized by no evidence of failures either in the console or in the field failure logs. They are unearthed at moments of crisis, either with a customer complaint or other cascading failures. These failures often result in data loss or data corruption, creating many latent errors. Accumulation of these errors over time results in degraded system performance. This represents the problem of software aging and restoration of the system, i.e. its rejuvenation becomes a critical need. Subsequent to the restoration, a rigorous failure detection mechanism is needed to detect them early. What we describe in the paper is a novel method that could be used to detect silent failures using a combination of invariant violation checking and finite state machine based analysis of the system. We use the audit-trail logs of system to extract information about the state and transitions for FSM representation. Currently our research work was limited to proving its efficiency. We applied this approach to our SaaS platform and were able to detect 36 silent failures over a period of 9 months. As next steps, we will implement this as a part of automated failure detection in the operational SaaS platforms.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Bug localisation through diverse sources of information A chain of accountabilities in open systems based on assured entrustments Estimating response time distribution of server application in software aging phenomenon Safety assessment of software-intensive medical devices: Introducing a safety quality model approach Detection of missing requirements using base requirements pairs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1