广义笛卡儿分布的二部匹配容忍相关失效

ACM International Conference on Computing Frontiers Pub Date : 2011-05-03 DOI:10.1145/2016604.2016649

N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily

{"title":"广义笛卡儿分布的二部匹配容忍相关失效","authors":"N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily","doi":"10.1145/2016604.2016649","DOIUrl":null,"url":null,"abstract":"Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"147 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Tolerating correlated failures for generalized Cartesian distributions via bipartite matching\",\"authors\":\"N. Ali, S. Krishnamoorthy, M. Halappanavar, J. Daily\",\"doi\":\"10.1145/2016604.2016649\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.\",\"PeriodicalId\":430420,\"journal\":{\"name\":\"ACM International Conference on Computing Frontiers\",\"volume\":\"147 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM International Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2016604.2016649\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2016604.2016649","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

预计故障将在如何设计算法和应用程序以在未来的极端规模系统上运行方面发挥越来越重要的作用。基于算法的容错(ABFT)是一种很有前途的方法，它涉及修改算法，以比复制存储更低的开销从故障中恢复，并且与检查点重新启动技术相比，显著减少了丢失的工作。容错线性代数(FTLA)算法采用额外的处理器，这些处理器沿着矩阵的维度存储奇偶，以容忍多个同时发生的错误。现有的方法假设有规则的数据分布(阻塞或块循环)，每个数据块的故障是独立的。为了匹配并行计算机上的故障特征，我们将这些方法扩展到以几种重要方式映射奇偶校验块。首先，我们处理广义笛卡尔数据分布的奇偶性计算，每个处理器持有笛卡尔分布数组中的任意块子集。其次，介绍了处理相关故障的技术，即可能同时发生故障的多个处理器。第三，我们处理奇偶校验块与数据块的并置，并且不要求它们在额外的处理器上。提出了几种基于图匹配的替代方法，这些方法试图平衡处理器上的内存开销，同时保证与现有方法相同的容错特性，这些方法假设常规阻塞数据分布上的独立故障。对这些算法的评估表明，所提出的方法以最小的开销提供了额外的理想性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM International Conference on Computing Frontiers

自引率

0.00%

发文量