大规模并行处理器阵列上的按需容错循环处理

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI:10.1109/ASAP.2015.7245734

Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari

{"title":"大规模并行处理器阵列上的按需容错循环处理","authors":"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari","doi":"10.1109/ASAP.2015.7245734","DOIUrl":null,"url":null,"abstract":"We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"70 1","pages":"194-201"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"On-demand fault-tolerant loop processing on massively parallel processor arrays\",\"authors\":\"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari\",\"doi\":\"10.1109/ASAP.2015.7245734\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.\",\"PeriodicalId\":6642,\"journal\":{\"name\":\"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)\",\"volume\":\"70 1\",\"pages\":\"194-201\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASAP.2015.7245734\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2015.7245734","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

我们提出了一种基于编译的技术，为大规模并行处理器阵列提供按需结构冗余。因此，应用程序程序员获得了根据应用程序需求以吞吐量换取可靠性的能力。为了防止并行环路计算出错，我们建议将众所周知的容错方案双模冗余(DMR)和三模冗余(TMR)应用于处理器阵列的整个区域，而不是单个处理元素。在源代码级别，编译器通过程序转换实现这些复制方案:(1)分别为DMR或TMR复制并行循环程序两次或三次，以及(2)引入适当的投票操作，其频率和位置可以从三个建议的变体中选择。选择哪种变体取决于，例如，应用程序的错误弹性需求或预期的软错误率。最后，我们在性能开销和错误检测延迟方面探讨了这些变体的不同权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

On-demand fault-tolerant loop processing on massively parallel processor arrays

We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

自引率

0.00%

发文量

期刊最新文献

Message from the Conference Chairs - ASAP 2020 Message from the ASAP 2016 chairs An IEEE 754 double-precision floating-point multiplier for denormalized and normalized floating-point numbers Application-set driven exploration for custom processor architectures Stochastic circuit design and performance evaluation of vector quantization