2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)最新文献

英文中文

[Copyright notice] (版权)

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

Pub Date : 2019-11-01 DOI: 10.1109/ftxs49593.2019.00002

引用次数: 0

Enforcing Crash Consistency of Scientific Applications in Non-Volatile Main Memory Systems 在非易失性主存系统中加强科学应用程序的崩溃一致性

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

Pub Date : 2019-11-01 DOI: 10.1109/FTXS49593.2019.00007

Tyler Coy, Xuechen Zhang

To fully leverage the emerging non-volatile main memory (NVMM) for high-performance computing, programmers need efficient data structures that are aware of NVMM memory models and provide crash consistency. Manual creation of NVMM-aware persistent data structures requires a deep understanding of how to create persistent snapshots of memory objects corresponding to the data structures and substantial code modification, which makes it very difficult to use in its manual form even for experienced programmers. To simplify the process, we design a compiler-assistant technique, NVPath. With the aid of compilers, it automatically generates NVMM-aware persistent data structures that provide the same level of guarantee of crash consistency compared to the baseline code. Compiler-assistant code annotation and transformation are general and can be applied to applications using various data structures. Furthermore, it is a gray-box technique which requires minimum users’ input. Finally, it keeps the baseline code structure for good readability and maintenance. Our experimental results with real-world scientific applications (e.g., matrix multiplication, LU decomposition, adaptive-mesh refinement, and page ranking) show that the performance of annotated programs is commensurate with the version using the manual code transformation on the Titan supercomputer.

为了充分利用新兴的非易失性主存储器(NVMM)进行高性能计算，程序员需要高效的数据结构，这些数据结构能够识别NVMM内存模型并提供崩溃一致性。手动创建支持nvmm的持久数据结构需要深入了解如何创建与数据结构相对应的内存对象的持久快照和大量的代码修改，这使得即使对于经验丰富的程序员也很难以手动形式使用它。为了简化这个过程，我们设计了一个编译器辅助技术，NVPath。在编译器的帮助下，它自动生成nvmm感知的持久数据结构，提供与基线代码相同级别的崩溃一致性保证。编译器辅助代码注释和转换是通用的，可以应用于使用各种数据结构的应用程序。此外，它是一种灰盒技术，需要最少的用户输入。最后，它保持了基线代码结构，以获得良好的可读性和可维护性。我们对现实世界的科学应用(例如矩阵乘法、LU分解、自适应网格细化和页面排序)的实验结果表明，带注释的程序的性能与在Titan超级计算机上使用手动代码转换的版本相当。

{"title":"Enforcing Crash Consistency of Scientific Applications in Non-Volatile Main Memory Systems","authors":"Tyler Coy, Xuechen Zhang","doi":"10.1109/FTXS49593.2019.00007","DOIUrl":"https://doi.org/10.1109/FTXS49593.2019.00007","url":null,"abstract":"To fully leverage the emerging non-volatile main memory (NVMM) for high-performance computing, programmers need efficient data structures that are aware of NVMM memory models and provide crash consistency. Manual creation of NVMM-aware persistent data structures requires a deep understanding of how to create persistent snapshots of memory objects corresponding to the data structures and substantial code modification, which makes it very difficult to use in its manual form even for experienced programmers. To simplify the process, we design a compiler-assistant technique, NVPath. With the aid of compilers, it automatically generates NVMM-aware persistent data structures that provide the same level of guarantee of crash consistency compared to the baseline code. Compiler-assistant code annotation and transformation are general and can be applied to applications using various data structures. Furthermore, it is a gray-box technique which requires minimum users’ input. Finally, it keeps the baseline code structure for good readability and maintenance. Our experimental results with real-world scientific applications (e.g., matrix multiplication, LU decomposition, adaptive-mesh refinement, and page ranking) show that the performance of annotated programs is commensurate with the version using the manual code transformation on the Titan supercomputer.","PeriodicalId":199103,"journal":{"name":"2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116105531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes 无替换节点的节点抗故障预条件共轭梯度法

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

Pub Date : 2019-11-01 DOI: 10.1109/FTXS49593.2019.00009

C. Pachajoa, Christina Pacher, W. Gansterer

As HPC systems grow in scale to meet increased computational demands, the incidence of faults in a given window of time is expected to grow. This issue is addressed by the scientific community with research on solutions in every computational layer. In this paper, we explore strategies for fault tolerance at the algorithmic level. We propose a node-failure-tolerant preconditioned conjugate gradient method, which is able to efficiently recover from node failures without the use of extra spare nodes, i. e., without any overhead in terms of available hardware. For purposes of load balancing, we redistribute the surviving and reconstructed solver data. The objective is to reconstruct the system either as it was before the node failure, or an equivalent, permuted version, and then continue the execution of the solver only on the surviving nodes. In our experimental evaluations, the recovery stage of the solver typically takes around 10% or less of the solver runtime, including the time to retrieve the problem-defining static data from the hard disk, and, when using a suitable preconditioner, an average solver runtime overhead of 3.5% over that of a resilient solver that uses a replacement node. We investigate the influence of the preconditioner on a trade-off between load-balancing and communication cost in the recovery phase. The obtained solutions are correct, and our method is thus a feasible way to recover from a node failure and continue the execution of the solver only on the surviving nodes.

随着高性能计算系统的规模不断扩大，以满足不断增长的计算需求，在给定的时间窗口内，故障的发生率预计会增加。科学界通过研究每个计算层的解决方案来解决这个问题。在本文中，我们从算法层面探讨了容错策略。我们提出了一种节点容错预条件共轭梯度方法，该方法能够在不使用额外备用节点的情况下有效地从节点故障中恢复，即在可用硬件方面没有任何开销。为了实现负载平衡，我们重新分配了幸存的和重构的求解器数据。目标是重建节点故障之前的系统，或者是一个等效的、排列的版本，然后只在幸存的节点上继续执行求解器。在我们的实验评估中，求解器的恢复阶段通常占用求解器运行时的10%或更少的时间，包括从硬盘检索定义问题的静态数据的时间，并且，当使用合适的预处理时，求解器的平均运行时开销比使用替换节点的弹性求解器的运行时开销低3.5%。我们研究了前置条件对恢复阶段的负载平衡和通信成本之间权衡的影响。得到的解是正确的，因此我们的方法是一种可行的方法，可以从节点故障中恢复，并仅在幸存的节点上继续执行求解器。

{"title":"Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes","authors":"C. Pachajoa, Christina Pacher, W. Gansterer","doi":"10.1109/FTXS49593.2019.00009","DOIUrl":"https://doi.org/10.1109/FTXS49593.2019.00009","url":null,"abstract":"As HPC systems grow in scale to meet increased computational demands, the incidence of faults in a given window of time is expected to grow. This issue is addressed by the scientific community with research on solutions in every computational layer. In this paper, we explore strategies for fault tolerance at the algorithmic level. We propose a node-failure-tolerant preconditioned conjugate gradient method, which is able to efficiently recover from node failures without the use of extra spare nodes, i. e., without any overhead in terms of available hardware. For purposes of load balancing, we redistribute the surviving and reconstructed solver data. The objective is to reconstruct the system either as it was before the node failure, or an equivalent, permuted version, and then continue the execution of the solver only on the surviving nodes. In our experimental evaluations, the recovery stage of the solver typically takes around 10% or less of the solver runtime, including the time to retrieve the problem-defining static data from the hard disk, and, when using a suitable preconditioner, an average solver runtime overhead of 3.5% over that of a resilient solver that uses a replacement node. We investigate the influence of the preconditioner on a trade-off between load-balancing and communication cost in the recovery phase. The obtained solutions are correct, and our method is thus a feasible way to recover from a node failure and continue the execution of the solver only on the surviving nodes.","PeriodicalId":199103,"journal":{"name":"2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128343853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications 用于MPI应用程序本地回滚的异步接收器驱动重放

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

Pub Date : 2019-11-01 DOI: 10.1109/FTXS49593.2019.00006

Nuria Losada, Aurélien Bouteiller, G. Bosilca

With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high-performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.

随着超级计算机的规模和体系结构复杂性的增加，故障管理已成为成功执行长时间运行的高性能计算应用程序的必要条件。在许多情况下，故障具有局部范围，通常影响正在使用的资源子集，但广泛使用的故障恢复策略(如检查点/重新启动)无法利用和依赖全局同步恢复操作。即使使用本地回滚恢复(其中只有受故障影响的进程从检查点重新启动)，也可以通过重播来自消息日志的通信来实现执行中进一步进展的一致性。这种理论上合理的方法遇到了一些实际的限制:集体操作的存在迫使同步恢复，阻止幸存者进程继续执行，消除了与恢复重叠的进一步计算的任何可能性;而恢复对等体所需的资源数量可能难以维持。在这项工作中，我们通过实现点对点和集体通信的异步、接收器驱动的重放，以及利用远程内存访问功能访问消息日志，解决了这两个问题。在用户级故障缓解容错消息传递接口(MPI)的本地回滚实现中对该新协议进行了评估。它将失败进程的恢复时间平均减少了59%，而与等效的全局回滚协议相比，幸存者进程的恢复时间减少了95%，从而实现了恢复操作真正本地化影响的承诺。

{"title":"Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications","authors":"Nuria Losada, Aurélien Bouteiller, G. Bosilca","doi":"10.1109/FTXS49593.2019.00006","DOIUrl":"https://doi.org/10.1109/FTXS49593.2019.00006","url":null,"abstract":"With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high-performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.","PeriodicalId":199103,"journal":{"name":"2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122469192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Self-stabilizing Connected Components 自稳定连接元件

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

Pub Date : 2019-11-01 DOI: 10.1109/FTXS49593.2019.00011

Piyush Sao, C. Engelmann, Srinivas Eswar, Oded Green, R. Vuduc

For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of emph{self-stabilization}. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs $bigo{V log V}$ additional computation and requires $bigo{V}$ additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional labelprop). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in $80%$ of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault-tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.

对于图的连通分量的计算问题，本文考虑了对瞬态硬件故障(如位翻转)具有弹性的算法设计。更具体地说，它应用了emph{自稳定}技术。如果系统在从有效或无效状态开始时，保证在有限的步骤数之后达到有效状态，则系统是自稳定的。因此，在发生短暂故障的机器上，如果该故障导致系统进入无效状态，则自稳定算法可以恢复。对标签传播过程中的有效和无效状态进行了全面的分析，并推导了验证和纠正无效状态的算法。自稳定标签传播算法执行$bigo{V log V}$额外的计算，并且需要$bigo{V}$额外的存储(因此，不会比传统的labelprop增加渐近复杂性)。通过模拟故障注入测试，在$80%$的情况下，自稳定标签传播算法比基于三模冗余(TMR)的容错算法表现出更强的弹性行为。从性能的角度来看，它也优于TMR，因为它总共需要更少的迭代。除了自稳定标签传播的容错特性外，我们认为它们从理论角度上是有用的;并且可能有其他用例。

{"title":"Self-stabilizing Connected Components","authors":"Piyush Sao, C. Engelmann, Srinivas Eswar, Oded Green, R. Vuduc","doi":"10.1109/FTXS49593.2019.00011","DOIUrl":"https://doi.org/10.1109/FTXS49593.2019.00011","url":null,"abstract":"For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of emph{self-stabilization}. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs $bigo{V log V}$ additional computation and requires $bigo{V}$ additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional labelprop). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in $80%$ of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault-tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.","PeriodicalId":199103,"journal":{"name":"2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133854176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FaultSight: A Fault Analysis Tool for HPC Researchers FaultSight:高性能计算研究人员的故障分析工具

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

Pub Date : 2019-11-01 DOI: 10.1109/FTXS49593.2019.00008

Einar Horn, Dakota Fulp, Jon C. Calhoun, Luke N. Olson

System reliability is expected to be a significant challenge for future extreme-scale systems. Poor reliability results in a higher frequency of interruptions in high-performance computer (HPC) applications due to system/application crashes or data corruption due to soft errors. In response, application level error detection and recovery schemes are devised to mitigate the impact of these interruptions. Evaluating these schemes and the reliability of an application re- quires the analysis of thousands of fault injection trials, resulting in tedious and time-consuming process. Furthermore, there is no one data analysis tool that can work with all of the fault injection frameworks currently in use. In this paper, we present FaultSight, a fault injection analysis tool capable of efficiently assisting in the analysis of HPC application reliability as well as the effectiveness of resiliency schemes. FaultSight is designed to be flexible and work with data coming from a variety of fault injection frameworks. The effectiveness of FaultSight is demonstrated by exploring the reliability of different versions of the Matrix-Matrix Multiplication kernel using two different fault injection tools. In addition, the detection and recovery schemes are highlighted for the HPCCG mini-app.

系统可靠性预计将成为未来极端规模系统的重大挑战。可靠性差导致高性能计算机(HPC)应用程序由于系统/应用程序崩溃或由于软错误导致的数据损坏而中断的频率更高。作为响应，设计了应用程序级错误检测和恢复方案来减轻这些中断的影响。评估这些方案和应用程序的可靠性需要对数千次故障注入试验进行分析，这是一个繁琐而耗时的过程。此外，没有一种数据分析工具可以与当前使用的所有故障注入框架一起工作。在本文中，我们提出了FaultSight，一个故障注入分析工具，能够有效地协助分析高性能计算应用的可靠性和弹性方案的有效性。FaultSight的设计非常灵活，可以处理来自各种故障注入框架的数据。通过使用两种不同的故障注入工具探索不同版本的矩阵-矩阵乘法核的可靠性，验证了FaultSight的有效性。此外，重点介绍了HPCCG小应用程序的检测和恢复方案。

{"title":"FaultSight: A Fault Analysis Tool for HPC Researchers","authors":"Einar Horn, Dakota Fulp, Jon C. Calhoun, Luke N. Olson","doi":"10.1109/FTXS49593.2019.00008","DOIUrl":"https://doi.org/10.1109/FTXS49593.2019.00008","url":null,"abstract":"System reliability is expected to be a significant challenge for future extreme-scale systems. Poor reliability results in a higher frequency of interruptions in high-performance computer (HPC) applications due to system/application crashes or data corruption due to soft errors. In response, application level error detection and recovery schemes are devised to mitigate the impact of these interruptions. Evaluating these schemes and the reliability of an application re- quires the analysis of thousands of fault injection trials, resulting in tedious and time-consuming process. Furthermore, there is no one data analysis tool that can work with all of the fault injection frameworks currently in use. In this paper, we present FaultSight, a fault injection analysis tool capable of efficiently assisting in the analysis of HPC application reliability as well as the effectiveness of resiliency schemes. FaultSight is designed to be flexible and work with data coming from a variety of fault injection frameworks. The effectiveness of FaultSight is demonstrated by exploring the reliability of different versions of the Matrix-Matrix Multiplication kernel using two different fault injection tools. In addition, the detection and recovery schemes are highlighted for the HPCCG mini-app.","PeriodicalId":199103,"journal":{"name":"2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127466058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors 用实际硬件错误评估编译器ir级选择性指令复制

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

Pub Date : 2019-11-01 DOI: 10.1109/FTXS49593.2019.00010

Chun-Kai Chang, Guanpeng Li, M. Erez

Hardware faults (i.e., soft errors) are projected to increase in modern HPC systems. The faults often lead to error propagation in programs and result in silent data corruptions (SDCs), seriously compromising system reliability. Selective instruction duplication, a widely used software-based error detector, has been shown to be effective in detecting SDCs with low performance overhead. In the past, researchers have relied on compiler intermediate representation (IR) for program reliability analysis and code transformation in selective instruction duplication. However, they assumed that the IR-based analysis and protection are representative under realistic fault models (i.e., faults originated at lower hardware layers). Unfortunately, the assumptions have not been fully validated, leading to questions about the accuracy and efficiency of the protection since IR is a higher level of abstraction and far away from hardware layers. In this paper, we verify the assumption by injecting realistic hardware faults to programs that are guided and protected by IR-based selective instruction duplication. We find that the protection yields high SDC coverage with low performance overhead even under realistic fault models, albeit a small amount of such faults escaping the detector. Our observations confirm that IR-based selective instruction duplication is a cost-effective method to protect programs from soft errors.

硬件故障(即软错误)预计将在现代高性能计算系统中增加。这些故障通常会导致错误在程序中传播，并导致静默数据损坏(sdc)，严重影响系统可靠性。选择性指令重复是一种广泛使用的基于软件的错误检测器，它在检测sdc方面具有较低的性能开销。过去，研究人员主要依靠编译器中间表示(IR)来进行程序可靠性分析和选择性指令复制中的代码转换。然而，他们假设基于ir的分析和保护在实际故障模型(即源于较低硬件层的故障)下具有代表性。不幸的是，这些假设还没有得到充分验证，这导致了对保护的准确性和效率的质疑，因为IR是更高层次的抽象，远离硬件层。在本文中，我们通过将真实的硬件故障注入到由基于ir的选择性指令复制引导和保护的程序中来验证这一假设。我们发现，即使在实际故障模型下，该保护也能以低性能开销产生高SDC覆盖率，尽管有少量此类故障逃过检测器。我们的观察证实，基于ir的选择性指令复制是保护程序免受软错误的一种经济有效的方法。

{"title":"Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors","authors":"Chun-Kai Chang, Guanpeng Li, M. Erez","doi":"10.1109/FTXS49593.2019.00010","DOIUrl":"https://doi.org/10.1109/FTXS49593.2019.00010","url":null,"abstract":"Hardware faults (i.e., soft errors) are projected to increase in modern HPC systems. The faults often lead to error propagation in programs and result in silent data corruptions (SDCs), seriously compromising system reliability. Selective instruction duplication, a widely used software-based error detector, has been shown to be effective in detecting SDCs with low performance overhead. In the past, researchers have relied on compiler intermediate representation (IR) for program reliability analysis and code transformation in selective instruction duplication. However, they assumed that the IR-based analysis and protection are representative under realistic fault models (i.e., faults originated at lower hardware layers). Unfortunately, the assumptions have not been fully validated, leading to questions about the accuracy and efficiency of the protection since IR is a higher level of abstraction and far away from hardware layers. In this paper, we verify the assumption by injecting realistic hardware faults to programs that are guided and protected by IR-based selective instruction duplication. We find that the protection yields high SDC coverage with low performance overhead even under realistic fault models, albeit a small amount of such faults escaping the detector. Our observations confirm that IR-based selective instruction duplication is a cost-effective method to protect programs from soft errors.","PeriodicalId":199103,"journal":{"name":"2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114980181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀