{"title":"使用基于不变量分析的有限状态机识别SaaS服务的静默故障","authors":"Geetika Goel, A. Roy, R. Ganesan","doi":"10.1109/ISSREW.2013.6688909","DOIUrl":null,"url":null,"abstract":"Field failure analysis is usually driven by a characterization of the different time related properties of failure. This characterization does not help the production support team in understanding the root cause. In order to pinpoint the root cause of failure, one of the most effective techniques used is checking for violations of the system invariants which are the consistent, time invariant correlations that exist in the system. Understanding when and where these violations happen helps in detecting the root cause of the failure. Silent failures, on the other hand are characterized by no evidence of failures either in the console or in the field failure logs. They are unearthed at moments of crisis, either with a customer complaint or other cascading failures. These failures often result in data loss or data corruption, creating many latent errors. Accumulation of these errors over time results in degraded system performance. This represents the problem of software aging and restoration of the system, i.e. its rejuvenation becomes a critical need. Subsequent to the restoration, a rigorous failure detection mechanism is needed to detect them early. What we describe in the paper is a novel method that could be used to detect silent failures using a combination of invariant violation checking and finite state machine based analysis of the system. We use the audit-trail logs of system to extract information about the state and transitions for FSM representation. Currently our research work was limited to proving its efficiency. We applied this approach to our SaaS platform and were able to detect 36 silent failures over a period of 9 months. As next steps, we will implement this as a part of automated failure detection in the operational SaaS platforms.","PeriodicalId":332420,"journal":{"name":"2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Identifying silent failures of SaaS services using finite state machine based invariant analysis\",\"authors\":\"Geetika Goel, A. Roy, R. Ganesan\",\"doi\":\"10.1109/ISSREW.2013.6688909\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Field failure analysis is usually driven by a characterization of the different time related properties of failure. This characterization does not help the production support team in understanding the root cause. In order to pinpoint the root cause of failure, one of the most effective techniques used is checking for violations of the system invariants which are the consistent, time invariant correlations that exist in the system. Understanding when and where these violations happen helps in detecting the root cause of the failure. Silent failures, on the other hand are characterized by no evidence of failures either in the console or in the field failure logs. They are unearthed at moments of crisis, either with a customer complaint or other cascading failures. These failures often result in data loss or data corruption, creating many latent errors. Accumulation of these errors over time results in degraded system performance. This represents the problem of software aging and restoration of the system, i.e. its rejuvenation becomes a critical need. Subsequent to the restoration, a rigorous failure detection mechanism is needed to detect them early. What we describe in the paper is a novel method that could be used to detect silent failures using a combination of invariant violation checking and finite state machine based analysis of the system. We use the audit-trail logs of system to extract information about the state and transitions for FSM representation. Currently our research work was limited to proving its efficiency. We applied this approach to our SaaS platform and were able to detect 36 silent failures over a period of 9 months. As next steps, we will implement this as a part of automated failure detection in the operational SaaS platforms.\",\"PeriodicalId\":332420,\"journal\":{\"name\":\"2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSREW.2013.6688909\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSREW.2013.6688909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Identifying silent failures of SaaS services using finite state machine based invariant analysis
Field failure analysis is usually driven by a characterization of the different time related properties of failure. This characterization does not help the production support team in understanding the root cause. In order to pinpoint the root cause of failure, one of the most effective techniques used is checking for violations of the system invariants which are the consistent, time invariant correlations that exist in the system. Understanding when and where these violations happen helps in detecting the root cause of the failure. Silent failures, on the other hand are characterized by no evidence of failures either in the console or in the field failure logs. They are unearthed at moments of crisis, either with a customer complaint or other cascading failures. These failures often result in data loss or data corruption, creating many latent errors. Accumulation of these errors over time results in degraded system performance. This represents the problem of software aging and restoration of the system, i.e. its rejuvenation becomes a critical need. Subsequent to the restoration, a rigorous failure detection mechanism is needed to detect them early. What we describe in the paper is a novel method that could be used to detect silent failures using a combination of invariant violation checking and finite state machine based analysis of the system. We use the audit-trail logs of system to extract information about the state and transitions for FSM representation. Currently our research work was limited to proving its efficiency. We applied this approach to our SaaS platform and were able to detect 36 silent failures over a period of 9 months. As next steps, we will implement this as a part of automated failure detection in the operational SaaS platforms.