{"title":"基于重提交影响的高分布环境下科学工作流容错启发式算法","authors":"Kassian Plankensteiner, R. Prodan, T. Fahringer","doi":"10.1109/e-Science.2009.51","DOIUrl":null,"url":null,"abstract":"Even though highly distributed environments such as Clouds and Grids are increasingly used for e-Science high performance applications, they still cannot deliver the robustness and reliability needed for widespread acceptance as ubiquitous scientific tools. To overcome this problem, existing systems resort to fault tolerance mechanisms such as task replication and task resubmission. In this paper we propose a new heuristic called Resubmission Impact to enhance the fault tolerance support for scientific workflows in highly distributed systems. In contrast to related approaches, our method can be used effectively on systems even in the absence of historic failure trace data. Simulated experiments of three real scientific workflows in the Austrian Grid environment show that our algorithm drastically reduces the resource waste compared to conservative task replication and resubmission techniques, while having a comparable execution performance and only a slight decrease in the success probability.","PeriodicalId":325840,"journal":{"name":"2009 Fifth IEEE International Conference on e-Science","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":"{\"title\":\"A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact\",\"authors\":\"Kassian Plankensteiner, R. Prodan, T. Fahringer\",\"doi\":\"10.1109/e-Science.2009.51\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Even though highly distributed environments such as Clouds and Grids are increasingly used for e-Science high performance applications, they still cannot deliver the robustness and reliability needed for widespread acceptance as ubiquitous scientific tools. To overcome this problem, existing systems resort to fault tolerance mechanisms such as task replication and task resubmission. In this paper we propose a new heuristic called Resubmission Impact to enhance the fault tolerance support for scientific workflows in highly distributed systems. In contrast to related approaches, our method can be used effectively on systems even in the absence of historic failure trace data. Simulated experiments of three real scientific workflows in the Austrian Grid environment show that our algorithm drastically reduces the resource waste compared to conservative task replication and resubmission techniques, while having a comparable execution performance and only a slight decrease in the success probability.\",\"PeriodicalId\":325840,\"journal\":{\"name\":\"2009 Fifth IEEE International Conference on e-Science\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"48\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Fifth IEEE International Conference on e-Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/e-Science.2009.51\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Fifth IEEE International Conference on e-Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/e-Science.2009.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact
Even though highly distributed environments such as Clouds and Grids are increasingly used for e-Science high performance applications, they still cannot deliver the robustness and reliability needed for widespread acceptance as ubiquitous scientific tools. To overcome this problem, existing systems resort to fault tolerance mechanisms such as task replication and task resubmission. In this paper we propose a new heuristic called Resubmission Impact to enhance the fault tolerance support for scientific workflows in highly distributed systems. In contrast to related approaches, our method can be used effectively on systems even in the absence of historic failure trace data. Simulated experiments of three real scientific workflows in the Austrian Grid environment show that our algorithm drastically reduces the resource waste compared to conservative task replication and resubmission techniques, while having a comparable execution performance and only a slight decrease in the success probability.