A. Haidar, Yulu Jia, P. Luszczek, S. Tomov, A. YarKhan, J. Dongarra
A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.
{"title":"Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators","authors":"A. Haidar, Yulu Jia, P. Luszczek, S. Tomov, A. YarKhan, J. Dongarra","doi":"10.1145/2832080.2832085","DOIUrl":"https://doi.org/10.1145/2832080.2832085","url":null,"abstract":"A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123334113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An enhanced version of a stochastic SParse Approximate Inverse (SPAI) preconditioner for general matrices is presented in this paper. This is a Monte Carlo preconditioner based on Markov Chain Monte Carlo (MCMC) methods to compute a rough approximate matrix inverse first, which can further be optimized by an iterative filter process and a parallel refinement, to enhance the accuracy of the inverse and the preconditioner respectively. The above Monte Carlo preconditioner is further used to solve systems of linear algebraic equations thus delivering hybrid stochastic/deterministic algorithms. The advantage of the proposed approach is that the sparse Monte Carlo matrix inversion has a computational complexity linear of the size of the matrix, it is inherently parallel and thus can be obtained very efficiently for large matrices and can be used also as an efficient preconditioner while solving systems of linear algebraic equations. Computational experiments on the Monte Carlo preconditioners and the hybrid algorithms using BiCGSTAB and GMRES as SLAEs solvers are presented and the results are compared to those of MSPAI (parallel and optimized version of the deterministic SPAI) and combined MSPAI and BiCGSTAB and GMRES approaches to solve SLAEs. The experiment are carried out on classes of matrices from the matrix market and show the efficiency of the proposed approach.
本文提出了一种改进的一般矩阵随机稀疏近似逆(SPAI)预条件。这是一个基于Markov Chain Monte Carlo (MCMC)方法的蒙特卡罗预调节器,首先计算一个粗略的近似矩阵逆,然后通过迭代滤波过程和并行细化进行优化,分别提高逆和预调节器的精度。上述蒙特卡罗预条件进一步用于求解线性代数方程组,从而提供混合随机/确定性算法。该方法的优点是稀疏蒙特卡罗矩阵反演的计算复杂度与矩阵的大小成线性关系,具有内在的并行性,因此可以非常有效地求解大型矩阵,也可以作为求解线性代数方程组的有效预条件。给出了以BiCGSTAB和GMRES作为SLAEs求解器的蒙特卡罗预条件和混合算法的计算实验,并与MSPAI(确定性SPAI的并行和优化版本)以及MSPAI与BiCGSTAB和GMRES相结合的方法求解SLAEs的结果进行了比较。在矩阵市场的矩阵类上进行了实验,并证明了该方法的有效性。
{"title":"On efficient Monte Carlo preconditioners and hybrid Monte Carlo methods for linear algebra","authors":"V. Alexandrov, Oscar A. Esquivel-Flores","doi":"10.1145/2832080.2832086","DOIUrl":"https://doi.org/10.1145/2832080.2832086","url":null,"abstract":"An enhanced version of a stochastic SParse Approximate Inverse (SPAI) preconditioner for general matrices is presented in this paper. This is a Monte Carlo preconditioner based on Markov Chain Monte Carlo (MCMC) methods to compute a rough approximate matrix inverse first, which can further be optimized by an iterative filter process and a parallel refinement, to enhance the accuracy of the inverse and the preconditioner respectively. The above Monte Carlo preconditioner is further used to solve systems of linear algebraic equations thus delivering hybrid stochastic/deterministic algorithms. The advantage of the proposed approach is that the sparse Monte Carlo matrix inversion has a computational complexity linear of the size of the matrix, it is inherently parallel and thus can be obtained very efficiently for large matrices and can be used also as an efficient preconditioner while solving systems of linear algebraic equations. Computational experiments on the Monte Carlo preconditioners and the hybrid algorithms using BiCGSTAB and GMRES as SLAEs solvers are presented and the results are compared to those of MSPAI (parallel and optimized version of the deterministic SPAI) and combined MSPAI and BiCGSTAB and GMRES approaches to solve SLAEs. The experiment are carried out on classes of matrices from the matrix market and show the efficiency of the proposed approach.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125113216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper discusses an efficient parallel implementation of the ensemble Kalman filter based on the modified Cholesky decomposition. The proposed implementation starts with decomposing the domain into sub-domains. In each sub-domain a sparse estimation of the inverse background error covariance matrix is computed via a modified Cholesky decomposition; the estimates are computed concurrently on separate processors. The sparsity of this estimator is dictated by the conditional independence of model components for some radius of influence. Then, the assimilation step is carried out in parallel without the need of inter-processor communication. Once the local analysis states are computed, the analysis sub-domains are mapped back onto the global domain to obtain the analysis ensemble. Computational experiments are performed using the Atmospheric General Circulation Model (SPEEDY) with the T-63 resolution on the Blueridge cluster at Virginia Tech. The number of processors used in the experiments ranges from 96 to 2,048. The proposed implementation outperforms in terms of accuracy the well-known local ensemble transform Kalman filter (LETKF) for all the model variables. The computational time of the proposed implementation is similar to that of the parallel LETKF method (where no covariance estimation is performed). Finally, for the largest number of processors, the proposed parallel implementation is 400 times faster than the serial version of the proposed method.
{"title":"A parallel ensemble Kalman filter implementation based on modified Cholesky decomposition","authors":"E. Niño, Adrian Sandu, Xinwei Deng","doi":"10.1145/2832080.2832084","DOIUrl":"https://doi.org/10.1145/2832080.2832084","url":null,"abstract":"This paper discusses an efficient parallel implementation of the ensemble Kalman filter based on the modified Cholesky decomposition. The proposed implementation starts with decomposing the domain into sub-domains. In each sub-domain a sparse estimation of the inverse background error covariance matrix is computed via a modified Cholesky decomposition; the estimates are computed concurrently on separate processors. The sparsity of this estimator is dictated by the conditional independence of model components for some radius of influence. Then, the assimilation step is carried out in parallel without the need of inter-processor communication. Once the local analysis states are computed, the analysis sub-domains are mapped back onto the global domain to obtain the analysis ensemble. Computational experiments are performed using the Atmospheric General Circulation Model (SPEEDY) with the T-63 resolution on the Blueridge cluster at Virginia Tech. The number of processors used in the experiments ranges from 96 to 2,048. The proposed implementation outperforms in terms of accuracy the well-known local ensemble transform Kalman filter (LETKF) for all the model variables. The computational time of the proposed implementation is similar to that of the parallel LETKF method (where no covariance estimation is performed). Finally, for the largest number of processors, the proposed parallel implementation is 400 times faster than the serial version of the proposed method.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115885467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the transistor's feature size decreases following Moore's Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.
{"title":"Tuning stationary iterative solvers for fault resilience","authors":"H. Anzt, J. Dongarra, E. S. Quintana‐Ortí","doi":"10.1145/2832080.2832081","DOIUrl":"https://doi.org/10.1145/2832080.2832081","url":null,"abstract":"As the transistor's feature size decreases following Moore's Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community.\u0000 In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129996530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chander Iyer, H. Avron, G. Kollias, Y. Ineichen, C. Carothers, P. Drineas
We present a fast randomized least-squares solver for distributed-memory platforms. Our solver is based on the Blendenpik algorithm, but employs a batchwise randomized unitary transformation scheme. The batchwise transformation enables our algorithm to scale the distributed memory vanilla implementation of Blendenpik by up to ×3 and provides up to ×7.5 speedup over a state-of-the-art scalable least-squares solver based on the classic QR based algorithm. Experimental evaluations on terabyte scale matrices demonstrate excellent speedups on up to 16384 cores on a Blue Gene/Q supercomputer.
{"title":"A scalable randomized least squares solver for dense overdetermined systems","authors":"Chander Iyer, H. Avron, G. Kollias, Y. Ineichen, C. Carothers, P. Drineas","doi":"10.1145/2832080.2832083","DOIUrl":"https://doi.org/10.1145/2832080.2832083","url":null,"abstract":"We present a fast randomized least-squares solver for distributed-memory platforms. Our solver is based on the Blendenpik algorithm, but employs a batchwise randomized unitary transformation scheme. The batchwise transformation enables our algorithm to scale the distributed memory vanilla implementation of Blendenpik by up to ×3 and provides up to ×7.5 speedup over a state-of-the-art scalable least-squares solver based on the classic QR based algorithm. Experimental evaluations on terabyte scale matrices demonstrate excellent speedups on up to 16384 cores on a Blue Gene/Q supercomputer.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115286671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Yamazaki, S. Tomov, J. Kurzak, J. Dongarra, J. Barlow
The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a dramatic impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7.1× while maintaining about the same order of the numerical errors.
{"title":"Mixed-precision block gram Schmidt orthogonalization","authors":"I. Yamazaki, S. Tomov, J. Kurzak, J. Dongarra, J. Barlow","doi":"10.1145/2832080.2832082","DOIUrl":"https://doi.org/10.1145/2832080.2832082","url":null,"abstract":"The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a dramatic impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7.1× while maintaining about the same order of the numerical errors.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122724119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many simulations of evolutionary ordinary and partial differential equations require implicit time integration methods to avoid stability restrictions on the step size. The computation and communication costs associated with solving nonlinear systems at each time step dominates the total simulation cost. Rosenbrock-Krylov (Rosenbrock-K) methods alleviate this major bottleneck by using Krylov space approximations tightly coupled with the time discretization. This work studies the performance of Rosenbrock-K methods on accelerated hardware. GPU acceleration is used to expedite computations of the semi-discrete right hand side and the linear-algebra computations in the time integration method. A novel parallelization of the Arnoldi procedure for the construction of the Krylov based approximations of the Jacobian matrix is presented. Rosenbrock-K methods' unique ability to operate almost entirely in a reduced space make them especially suitable for efficient utilization of accelerated hardware, where standard implicit approaches may lead to systems too large for device memory.
{"title":"CUDA acceleration of a matrix-free Rosenbrock-K method applied to the shallow water equations","authors":"P. Tranquilli, Ross Glandon, Adrian Sandu","doi":"10.1145/2530268.2530273","DOIUrl":"https://doi.org/10.1145/2530268.2530273","url":null,"abstract":"Many simulations of evolutionary ordinary and partial differential equations require implicit time integration methods to avoid stability restrictions on the step size. The computation and communication costs associated with solving nonlinear systems at each time step dominates the total simulation cost. Rosenbrock-Krylov (Rosenbrock-K) methods alleviate this major bottleneck by using Krylov space approximations tightly coupled with the time discretization.\u0000 This work studies the performance of Rosenbrock-K methods on accelerated hardware. GPU acceleration is used to expedite computations of the semi-discrete right hand side and the linear-algebra computations in the time integration method. A novel parallelization of the Arnoldi procedure for the construction of the Krylov based approximations of the Jacobian matrix is presented. Rosenbrock-K methods' unique ability to operate almost entirely in a reduced space make them especially suitable for efficient utilization of accelerated hardware, where standard implicit approaches may lead to systems too large for device memory.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128313526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur.
随着功能尺寸越来越小,现代处理器和内存芯片的集成密度越来越高,软错误对在现代硬件上运行的应用程序构成了真正的挑战。软错误表现为改变用户值的位翻转,而数字软件是一类对此类数据变化敏感的软件。在本文中,我们提出了一种对软错误具有弹性的双对角约简算法的设计,并描述了其在混合CPU-GPU架构上的实现。我们的容错算法采用基于算法的容错,结合反向计算来检测、定位和纠正软错误。测试是在Sandy Bridge CPU和NVIDIA Kepler GPU上进行的。所包含的实验表明,与容易出错的代码相比,我们的弹性双对角约简算法增加的开销非常小。在矩阵大小为10110 x 10110的情况下,当出现一个错误时,我们的算法的性能开销仅为1.085%,当没有错误时,性能开销为0.354%。
{"title":"CPU-GPU hybrid bidiagonal reduction with soft error resilience","authors":"Yulu Jia, P. Luszczek, G. Bosilca, J. Dongarra","doi":"10.1145/2530268.2530270","DOIUrl":"https://doi.org/10.1145/2530268.2530270","url":null,"abstract":"Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130028357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing number of components in HPC systems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.
{"title":"A study of application-level recovery methods for transient network faults","authors":"I. Laguna, E. León, M. Schulz, M. Stephenson","doi":"10.1145/2530268.2530271","DOIUrl":"https://doi.org/10.1145/2530268.2530271","url":null,"abstract":"With the increasing number of components in HPC systems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133570636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We show how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches a valid state within a finite number of steps. This property imbues the system with a natural means of tolerating transient faults. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults. Beyond the specific findings of this paper, we believe self-stabilization has promise to become a useful tool for constructing resilient solvers more generally.
{"title":"Self-stabilizing iterative solvers","authors":"Piyush Sao, R. Vuduc","doi":"10.1145/2530268.2530272","DOIUrl":"https://doi.org/10.1145/2530268.2530272","url":null,"abstract":"We show how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches a valid state within a finite number of steps. This property imbues the system with a natural means of tolerating transient faults. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults. Beyond the specific findings of this paper, we believe self-stabilization has promise to become a useful tool for constructing resilient solvers more generally.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115298461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}