{"title":"Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems","authors":"R. Camargo, Renato Cerqueira, Fabio Kon","doi":"10.1145/1101499.1101500","DOIUrl":null,"url":null,"abstract":"Dealing with the large amounts of data generated by long-running parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more data. The classical approach is to employ high-throughput checkpoint servers connected to the computational nodes by high speed networks. In the case of Opportunistic Grid Computing, we do not want to be forced to rely on such dedicated hardware. Instead, we want to use the shared Grid nodes to store application data in a distributed fashion.In this work, we evaluate several strategies to store checkpoints on distributed non-dedicated repositories. We consider the tradeoff among computational overhead, storage overhead, and degree of fault-tolerance of these strategies. We compare the use of replication, parity information, and information dispersal (IDA). We used InteGrade, an object-oriented Grid middleware, to implement the storage strategies and perform evaluation experiments.","PeriodicalId":313448,"journal":{"name":"Middleware for Grid Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2005-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Middleware for Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1101499.1101500","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
Abstract
Dealing with the large amounts of data generated by long-running parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more data. The classical approach is to employ high-throughput checkpoint servers connected to the computational nodes by high speed networks. In the case of Opportunistic Grid Computing, we do not want to be forced to rely on such dedicated hardware. Instead, we want to use the shared Grid nodes to store application data in a distributed fashion.In this work, we evaluate several strategies to store checkpoints on distributed non-dedicated repositories. We consider the tradeoff among computational overhead, storage overhead, and degree of fault-tolerance of these strategies. We compare the use of replication, parity information, and information dispersal (IDA). We used InteGrade, an object-oriented Grid middleware, to implement the storage strategies and perform evaluation experiments.