{"title":"Fault-Tolerant Orchestration of Bags-of-Tasks with Application-Directed Checkpointing in a Distributed Environment","authors":"Georgios L. Stavrinides, H. Karatza","doi":"10.1109/CCCI52664.2021.9583187","DOIUrl":null,"url":null,"abstract":"A wide spectrum of applications, ranging from big data analytics to financial risk modeling and genomics, feature a high degree of parallelism, forming bags-of-tasks. Such applications are typically processed on distributed resources and are often prone to transient software failures. Consequently, load balancing and fault tolerance are two crucial aspects of such environments. In this paper, we consider bag-of-tasks jobs that utilize application-directed checkpointing. According to this technique, each component task is responsible for checkpointing its own progress at regular intervals during its execution. When a failure occurs, the affected task is rolled back to its most recent checkpoint and resumes execution. In order to investigate the impact of transient software failures on the performance of the system, we employ four resource allocation strategies, two well-known and two novel ones. The routing policies are compared through simulation experiments, under different task failure probabilities and load cases.","PeriodicalId":136382,"journal":{"name":"2021 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCCI52664.2021.9583187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
A wide spectrum of applications, ranging from big data analytics to financial risk modeling and genomics, feature a high degree of parallelism, forming bags-of-tasks. Such applications are typically processed on distributed resources and are often prone to transient software failures. Consequently, load balancing and fault tolerance are two crucial aspects of such environments. In this paper, we consider bag-of-tasks jobs that utilize application-directed checkpointing. According to this technique, each component task is responsible for checkpointing its own progress at regular intervals during its execution. When a failure occurs, the affected task is rolled back to its most recent checkpoint and resumes execution. In order to investigate the impact of transient software failures on the performance of the system, we employ four resource allocation strategies, two well-known and two novel ones. The routing policies are compared through simulation experiments, under different task failure probabilities and load cases.