{"title":"关联故障容错系统的最优检查点","authors":"Bentolhoda Jafary, L. Fiondella","doi":"10.1109/RAM.2017.7889756","DOIUrl":null,"url":null,"abstract":"Checkpointing is a technique to backup work at periodic intervals so that if computation fails it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This paper presents a method to model the impact of correlated failures on a system that performs checkpointing. We map the checkpointing process to a state space model and superimpose a correlated life distribution. Examples illustrate that the model identifies the optimal number of checkpoints despite the negative impact of correlation on system reliability.","PeriodicalId":138871,"journal":{"name":"2017 Annual Reliability and Maintainability Symposium (RAMS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Optimal checkpointing of fault tolerant systems subject to correlated failure\",\"authors\":\"Bentolhoda Jafary, L. Fiondella\",\"doi\":\"10.1109/RAM.2017.7889756\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Checkpointing is a technique to backup work at periodic intervals so that if computation fails it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This paper presents a method to model the impact of correlated failures on a system that performs checkpointing. We map the checkpointing process to a state space model and superimpose a correlated life distribution. Examples illustrate that the model identifies the optimal number of checkpoints despite the negative impact of correlation on system reliability.\",\"PeriodicalId\":138871,\"journal\":{\"name\":\"2017 Annual Reliability and Maintainability Symposium (RAMS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 Annual Reliability and Maintainability Symposium (RAMS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RAM.2017.7889756\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Annual Reliability and Maintainability Symposium (RAMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RAM.2017.7889756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimal checkpointing of fault tolerant systems subject to correlated failure
Checkpointing is a technique to backup work at periodic intervals so that if computation fails it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This paper presents a method to model the impact of correlated failures on a system that performs checkpointing. We map the checkpointing process to a state space model and superimpose a correlated life distribution. Examples illustrate that the model identifies the optimal number of checkpoints despite the negative impact of correlation on system reliability.