{"title":"RAS by the Yard","authors":"A. Wood, S. Nathan","doi":"10.1109/DSN.2007.80","DOIUrl":null,"url":null,"abstract":"Different applications require different levels of fault tolerance. Therefore, it is important to create a flexible architecture that allows a customer to choose the appropriate amount of fault tolerance, a concept we call \"RAS by the yard. \" In this paper we describe a next generation supercomputer and the design flexibility that allows us to offer a range of alternatives for RAS (reliability, availability, serviceability). In particular we explain how checkpointing can provide an availability continuum. Design alternatives that improve RAS may be expensive, so it is important to do cost/benefit studies of the alternatives. For a fixed budget and specified system balance ratios, such as Bytes/FIOPS, we analyze the system performance impact of alternative RAS strategies. We show how to optimize the amount of RAS purchased by using a performability measure.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2007.80","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Different applications require different levels of fault tolerance. Therefore, it is important to create a flexible architecture that allows a customer to choose the appropriate amount of fault tolerance, a concept we call "RAS by the yard. " In this paper we describe a next generation supercomputer and the design flexibility that allows us to offer a range of alternatives for RAS (reliability, availability, serviceability). In particular we explain how checkpointing can provide an availability continuum. Design alternatives that improve RAS may be expensive, so it is important to do cost/benefit studies of the alternatives. For a fixed budget and specified system balance ratios, such as Bytes/FIOPS, we analyze the system performance impact of alternative RAS strategies. We show how to optimize the amount of RAS purchased by using a performability measure.