首页 > 最新文献

ACM SIGPLAN Symposium on Scala最新文献

英文 中文
Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators 多并行粒度加权动态调度,将数值工作负载卸载到多个不同的加速器
Pub Date : 2015-11-15 DOI: 10.1145/2832080.2832085
A. Haidar, Yulu Jia, P. Luszczek, S. Tomov, A. YarKhan, J. Dongarra
A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.
现代计算机可以使用各种各样的异构计算资源,包括包含多核cpu的多个插槽,一个或多个不同功率的gpu,以及Intel Xeon Phi等协处理器。领域科学家面临的挑战是如何有效地利用这些不同的资源。例如,为了有效地使用gpu,工作负载必须比为多核cpu设计的工作负载具有更高程度的并行性。领域科学家必须在多个并行度和任务粒度中设计和调度应用程序,以便从资源中获得有效的性能。我们提出了一种从串行代码开始的生产性编程模型,该模型通过使用任务超标量运行时环境使计算适应可用资源,从而实现并行性和可扩展性。自适应是在多个点上完成的,包括多级数据分区、自适应任务粒度大小和动态任务调度。通过实现密集线性代数应用,证明了该方法在利用多路异构硬件资源方面的有效性。
{"title":"Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators","authors":"A. Haidar, Yulu Jia, P. Luszczek, S. Tomov, A. YarKhan, J. Dongarra","doi":"10.1145/2832080.2832085","DOIUrl":"https://doi.org/10.1145/2832080.2832085","url":null,"abstract":"A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123334113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On efficient Monte Carlo preconditioners and hybrid Monte Carlo methods for linear algebra 线性代数的有效蒙特卡罗预调节器和混合蒙特卡罗方法
Pub Date : 2015-11-15 DOI: 10.1145/2832080.2832086
V. Alexandrov, Oscar A. Esquivel-Flores
An enhanced version of a stochastic SParse Approximate Inverse (SPAI) preconditioner for general matrices is presented in this paper. This is a Monte Carlo preconditioner based on Markov Chain Monte Carlo (MCMC) methods to compute a rough approximate matrix inverse first, which can further be optimized by an iterative filter process and a parallel refinement, to enhance the accuracy of the inverse and the preconditioner respectively. The above Monte Carlo preconditioner is further used to solve systems of linear algebraic equations thus delivering hybrid stochastic/deterministic algorithms. The advantage of the proposed approach is that the sparse Monte Carlo matrix inversion has a computational complexity linear of the size of the matrix, it is inherently parallel and thus can be obtained very efficiently for large matrices and can be used also as an efficient preconditioner while solving systems of linear algebraic equations. Computational experiments on the Monte Carlo preconditioners and the hybrid algorithms using BiCGSTAB and GMRES as SLAEs solvers are presented and the results are compared to those of MSPAI (parallel and optimized version of the deterministic SPAI) and combined MSPAI and BiCGSTAB and GMRES approaches to solve SLAEs. The experiment are carried out on classes of matrices from the matrix market and show the efficiency of the proposed approach.
本文提出了一种改进的一般矩阵随机稀疏近似逆(SPAI)预条件。这是一个基于Markov Chain Monte Carlo (MCMC)方法的蒙特卡罗预调节器,首先计算一个粗略的近似矩阵逆,然后通过迭代滤波过程和并行细化进行优化,分别提高逆和预调节器的精度。上述蒙特卡罗预条件进一步用于求解线性代数方程组,从而提供混合随机/确定性算法。该方法的优点是稀疏蒙特卡罗矩阵反演的计算复杂度与矩阵的大小成线性关系,具有内在的并行性,因此可以非常有效地求解大型矩阵,也可以作为求解线性代数方程组的有效预条件。给出了以BiCGSTAB和GMRES作为SLAEs求解器的蒙特卡罗预条件和混合算法的计算实验,并与MSPAI(确定性SPAI的并行和优化版本)以及MSPAI与BiCGSTAB和GMRES相结合的方法求解SLAEs的结果进行了比较。在矩阵市场的矩阵类上进行了实验,并证明了该方法的有效性。
{"title":"On efficient Monte Carlo preconditioners and hybrid Monte Carlo methods for linear algebra","authors":"V. Alexandrov, Oscar A. Esquivel-Flores","doi":"10.1145/2832080.2832086","DOIUrl":"https://doi.org/10.1145/2832080.2832086","url":null,"abstract":"An enhanced version of a stochastic SParse Approximate Inverse (SPAI) preconditioner for general matrices is presented in this paper. This is a Monte Carlo preconditioner based on Markov Chain Monte Carlo (MCMC) methods to compute a rough approximate matrix inverse first, which can further be optimized by an iterative filter process and a parallel refinement, to enhance the accuracy of the inverse and the preconditioner respectively. The above Monte Carlo preconditioner is further used to solve systems of linear algebraic equations thus delivering hybrid stochastic/deterministic algorithms. The advantage of the proposed approach is that the sparse Monte Carlo matrix inversion has a computational complexity linear of the size of the matrix, it is inherently parallel and thus can be obtained very efficiently for large matrices and can be used also as an efficient preconditioner while solving systems of linear algebraic equations. Computational experiments on the Monte Carlo preconditioners and the hybrid algorithms using BiCGSTAB and GMRES as SLAEs solvers are presented and the results are compared to those of MSPAI (parallel and optimized version of the deterministic SPAI) and combined MSPAI and BiCGSTAB and GMRES approaches to solve SLAEs. The experiment are carried out on classes of matrices from the matrix market and show the efficiency of the proposed approach.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125113216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A parallel ensemble Kalman filter implementation based on modified Cholesky decomposition 基于改进Cholesky分解的并行集成卡尔曼滤波实现
Pub Date : 2015-11-15 DOI: 10.1145/2832080.2832084
E. Niño, Adrian Sandu, Xinwei Deng
This paper discusses an efficient parallel implementation of the ensemble Kalman filter based on the modified Cholesky decomposition. The proposed implementation starts with decomposing the domain into sub-domains. In each sub-domain a sparse estimation of the inverse background error covariance matrix is computed via a modified Cholesky decomposition; the estimates are computed concurrently on separate processors. The sparsity of this estimator is dictated by the conditional independence of model components for some radius of influence. Then, the assimilation step is carried out in parallel without the need of inter-processor communication. Once the local analysis states are computed, the analysis sub-domains are mapped back onto the global domain to obtain the analysis ensemble. Computational experiments are performed using the Atmospheric General Circulation Model (SPEEDY) with the T-63 resolution on the Blueridge cluster at Virginia Tech. The number of processors used in the experiments ranges from 96 to 2,048. The proposed implementation outperforms in terms of accuracy the well-known local ensemble transform Kalman filter (LETKF) for all the model variables. The computational time of the proposed implementation is similar to that of the parallel LETKF method (where no covariance estimation is performed). Finally, for the largest number of processors, the proposed parallel implementation is 400 times faster than the serial version of the proposed method.
本文讨论了一种基于改进Cholesky分解的集成卡尔曼滤波器的高效并行实现。建议的实现从将域分解为子域开始。在每个子域上,通过改进的Cholesky分解对逆背景误差协方差矩阵进行稀疏估计;估计是在单独的处理器上并发计算的。该估计量的稀疏性由模型分量在一定影响半径下的条件独立性决定。然后,同化步骤并行进行,不需要处理器间通信。一旦计算出局部分析状态,分析子域就被映射回全局域,以获得分析集成。在弗吉尼亚理工大学的Blueridge集群上使用T-63分辨率的大气环流模型(SPEEDY)进行了计算实验。实验中使用的处理器数量从96到2048不等。对于所有模型变量,所提出的实现在精度方面优于著名的局部集成变换卡尔曼滤波器(LETKF)。所提出的实现的计算时间与并行LETKF方法相似(其中不执行协方差估计)。最后,对于最大数量的处理器,所建议的并行实现比所建议方法的串行版本快400倍。
{"title":"A parallel ensemble Kalman filter implementation based on modified Cholesky decomposition","authors":"E. Niño, Adrian Sandu, Xinwei Deng","doi":"10.1145/2832080.2832084","DOIUrl":"https://doi.org/10.1145/2832080.2832084","url":null,"abstract":"This paper discusses an efficient parallel implementation of the ensemble Kalman filter based on the modified Cholesky decomposition. The proposed implementation starts with decomposing the domain into sub-domains. In each sub-domain a sparse estimation of the inverse background error covariance matrix is computed via a modified Cholesky decomposition; the estimates are computed concurrently on separate processors. The sparsity of this estimator is dictated by the conditional independence of model components for some radius of influence. Then, the assimilation step is carried out in parallel without the need of inter-processor communication. Once the local analysis states are computed, the analysis sub-domains are mapped back onto the global domain to obtain the analysis ensemble. Computational experiments are performed using the Atmospheric General Circulation Model (SPEEDY) with the T-63 resolution on the Blueridge cluster at Virginia Tech. The number of processors used in the experiments ranges from 96 to 2,048. The proposed implementation outperforms in terms of accuracy the well-known local ensemble transform Kalman filter (LETKF) for all the model variables. The computational time of the proposed implementation is similar to that of the parallel LETKF method (where no covariance estimation is performed). Finally, for the largest number of processors, the proposed parallel implementation is 400 times faster than the serial version of the proposed method.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115885467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Tuning stationary iterative solvers for fault resilience 修正故障恢复的平稳迭代解算器
Pub Date : 2015-11-15 DOI: 10.1145/2832080.2832081
H. Anzt, J. Dongarra, E. S. Quintana‐Ortí
As the transistor's feature size decreases following Moore's Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.
随着晶体管的特征尺寸按照摩尔定律减小,硬件将变得更容易出现永久性、间歇性和瞬态错误,从而增加应用程序经历的故障数量,并降低用户的信心。因此,弹性被认为是高性能计算社区面临的最难解决的问题。本文研究了稀疏线性系统的误差弹性迭代解的设计。与大多数以前的方法相反,基于Krylov子空间方法,为此我们分析了平稳分量的弛豫。具体地说,从Jacobi迭代的简单实现开始,我们设计了一种低成本的组件技术,该技术可以优雅地处理位翻转,将初始同步求解器转换为异步迭代。我们的实验研究采用来自几个实际应用的稀疏不完全分解来揭示容错实现所带来的收敛延迟。
{"title":"Tuning stationary iterative solvers for fault resilience","authors":"H. Anzt, J. Dongarra, E. S. Quintana‐Ortí","doi":"10.1145/2832080.2832081","DOIUrl":"https://doi.org/10.1145/2832080.2832081","url":null,"abstract":"As the transistor's feature size decreases following Moore's Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community.\u0000 In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129996530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
A scalable randomized least squares solver for dense overdetermined systems 密集超定系统的可伸缩随机最小二乘求解器
Pub Date : 2015-11-15 DOI: 10.1145/2832080.2832083
Chander Iyer, H. Avron, G. Kollias, Y. Ineichen, C. Carothers, P. Drineas
We present a fast randomized least-squares solver for distributed-memory platforms. Our solver is based on the Blendenpik algorithm, but employs a batchwise randomized unitary transformation scheme. The batchwise transformation enables our algorithm to scale the distributed memory vanilla implementation of Blendenpik by up to ×3 and provides up to ×7.5 speedup over a state-of-the-art scalable least-squares solver based on the classic QR based algorithm. Experimental evaluations on terabyte scale matrices demonstrate excellent speedups on up to 16384 cores on a Blue Gene/Q supercomputer.
针对分布式存储平台,提出了一种快速随机最小二乘求解器。我们的求解器基于Blendenpik算法,但采用了批处理随机化酉变换方案。批处理转换使我们的算法能够将Blendenpik的分布式内存vanilla实现扩展到×3,并提供比基于经典QR算法的最先进的可扩展最小二乘求解器加速到×7.5。在tb规模矩阵上的实验评估表明,在Blue Gene/Q超级计算机上,高达16384个核的加速效果非常好。
{"title":"A scalable randomized least squares solver for dense overdetermined systems","authors":"Chander Iyer, H. Avron, G. Kollias, Y. Ineichen, C. Carothers, P. Drineas","doi":"10.1145/2832080.2832083","DOIUrl":"https://doi.org/10.1145/2832080.2832083","url":null,"abstract":"We present a fast randomized least-squares solver for distributed-memory platforms. Our solver is based on the Blendenpik algorithm, but employs a batchwise randomized unitary transformation scheme. The batchwise transformation enables our algorithm to scale the distributed memory vanilla implementation of Blendenpik by up to ×3 and provides up to ×7.5 speedup over a state-of-the-art scalable least-squares solver based on the classic QR based algorithm. Experimental evaluations on terabyte scale matrices demonstrate excellent speedups on up to 16384 cores on a Blue Gene/Q supercomputer.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115286671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Mixed-precision block gram Schmidt orthogonalization 混合精度块克施密特正交化
Pub Date : 2015-11-15 DOI: 10.1145/2832080.2832082
I. Yamazaki, S. Tomov, J. Kurzak, J. Dongarra, J. Barlow
The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a dramatic impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7.1× while maintaining about the same order of the numerical errors.
混合精度乔列斯基QR (CholQR)能够以最小的通信代价对密集矩阵的列进行正交。此外,其正交性误差仅与输入矩阵的条件数线性相关。然而,当硬件不支持所需的更高精度时,就需要采用软件仿真算法,这将大大增加计算成本。当有大量列需要正交化时,这种计算开销会对正交化时间产生巨大影响,并且混合精度的CholQR可能比标准的CholQR慢得多。在本文中,我们研究了该算法的几个块变体,它们减少了与软件仿真算法相关的计算开销,同时保持与混合精度CholQR相同的正交性误差界。我们在带有GPU的多核CPU以及混合CPU/GPU集群上的数值和性能结果表明,与混合精度的CholQR相比,这种块变体可以在保持数值误差大致相同的顺序的情况下获得高达7.1倍的加速。
{"title":"Mixed-precision block gram Schmidt orthogonalization","authors":"I. Yamazaki, S. Tomov, J. Kurzak, J. Dongarra, J. Barlow","doi":"10.1145/2832080.2832082","DOIUrl":"https://doi.org/10.1145/2832080.2832082","url":null,"abstract":"The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a dramatic impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7.1× while maintaining about the same order of the numerical errors.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122724119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
CUDA acceleration of a matrix-free Rosenbrock-K method applied to the shallow water equations 应用于浅水方程的无矩阵Rosenbrock-K方法的CUDA加速
Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530273
P. Tranquilli, Ross Glandon, Adrian Sandu
Many simulations of evolutionary ordinary and partial differential equations require implicit time integration methods to avoid stability restrictions on the step size. The computation and communication costs associated with solving nonlinear systems at each time step dominates the total simulation cost. Rosenbrock-Krylov (Rosenbrock-K) methods alleviate this major bottleneck by using Krylov space approximations tightly coupled with the time discretization. This work studies the performance of Rosenbrock-K methods on accelerated hardware. GPU acceleration is used to expedite computations of the semi-discrete right hand side and the linear-algebra computations in the time integration method. A novel parallelization of the Arnoldi procedure for the construction of the Krylov based approximations of the Jacobian matrix is presented. Rosenbrock-K methods' unique ability to operate almost entirely in a reduced space make them especially suitable for efficient utilization of accelerated hardware, where standard implicit approaches may lead to systems too large for device memory.
许多演化常微分方程和偏微分方程的模拟需要隐式时间积分方法来避免步长上的稳定性限制。在每个时间步上求解非线性系统的计算和通信费用占仿真总费用的绝大部分。Rosenbrock-Krylov (Rosenbrock-K)方法通过使用Krylov空间近似与时间离散紧密耦合来缓解这一主要瓶颈。本文研究了Rosenbrock-K方法在加速硬件上的性能。在时间积分法中,利用GPU加速加快了半离散右侧的计算和线性代数的计算。提出了一种新的并行化的Arnoldi过程,用于构造基于Krylov的雅可比矩阵近似。Rosenbrock-K方法的独特能力几乎完全在减少的空间中操作,使它们特别适合于有效利用加速硬件,其中标准隐式方法可能导致系统对于设备内存来说太大。
{"title":"CUDA acceleration of a matrix-free Rosenbrock-K method applied to the shallow water equations","authors":"P. Tranquilli, Ross Glandon, Adrian Sandu","doi":"10.1145/2530268.2530273","DOIUrl":"https://doi.org/10.1145/2530268.2530273","url":null,"abstract":"Many simulations of evolutionary ordinary and partial differential equations require implicit time integration methods to avoid stability restrictions on the step size. The computation and communication costs associated with solving nonlinear systems at each time step dominates the total simulation cost. Rosenbrock-Krylov (Rosenbrock-K) methods alleviate this major bottleneck by using Krylov space approximations tightly coupled with the time discretization.\u0000 This work studies the performance of Rosenbrock-K methods on accelerated hardware. GPU acceleration is used to expedite computations of the semi-discrete right hand side and the linear-algebra computations in the time integration method. A novel parallelization of the Arnoldi procedure for the construction of the Krylov based approximations of the Jacobian matrix is presented. Rosenbrock-K methods' unique ability to operate almost entirely in a reduced space make them especially suitable for efficient utilization of accelerated hardware, where standard implicit approaches may lead to systems too large for device memory.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128313526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
CPU-GPU hybrid bidiagonal reduction with soft error resilience 具有软错误弹性的CPU-GPU混合双对角线缩减
Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530270
Yulu Jia, P. Luszczek, G. Bosilca, J. Dongarra
Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur.
随着功能尺寸越来越小,现代处理器和内存芯片的集成密度越来越高,软错误对在现代硬件上运行的应用程序构成了真正的挑战。软错误表现为改变用户值的位翻转,而数字软件是一类对此类数据变化敏感的软件。在本文中,我们提出了一种对软错误具有弹性的双对角约简算法的设计,并描述了其在混合CPU-GPU架构上的实现。我们的容错算法采用基于算法的容错,结合反向计算来检测、定位和纠正软错误。测试是在Sandy Bridge CPU和NVIDIA Kepler GPU上进行的。所包含的实验表明,与容易出错的代码相比,我们的弹性双对角约简算法增加的开销非常小。在矩阵大小为10110 x 10110的情况下,当出现一个错误时,我们的算法的性能开销仅为1.085%,当没有错误时,性能开销为0.354%。
{"title":"CPU-GPU hybrid bidiagonal reduction with soft error resilience","authors":"Yulu Jia, P. Luszczek, G. Bosilca, J. Dongarra","doi":"10.1145/2530268.2530270","DOIUrl":"https://doi.org/10.1145/2530268.2530270","url":null,"abstract":"Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130028357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A study of application-level recovery methods for transient network faults 网络暂态故障的应用层恢复方法研究
Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530271
I. Laguna, E. León, M. Schulz, M. Stephenson
With the increasing number of components in HPC systems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.
随着高性能计算系统中组件数量的增加,暂态故障将变得越来越普遍。目前,网络瞬态故障(如丢失或损坏的网络数据包)由中间件库解决,但代价是高内存使用和数据包重传。但是,可以使用应用程序级容错来消除这些成本。本文从应用层提出了网络暂态故障的恢复方法。这些方法通过插值重建丢失或损坏的数据。我们从生产HPC集群中使用网络故障率推导出一个真实的故障模型,并用它来证明我们的重建方法在FFT内核中的有效性。我们发现FFT计算的归一化均方根误差可以低至0.1%,因此,这表明网络故障可以在应用程序级别处理,在计算数据平滑的应用程序中具有低扰动。
{"title":"A study of application-level recovery methods for transient network faults","authors":"I. Laguna, E. León, M. Schulz, M. Stephenson","doi":"10.1145/2530268.2530271","DOIUrl":"https://doi.org/10.1145/2530268.2530271","url":null,"abstract":"With the increasing number of components in HPC systems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133570636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-stabilizing iterative solvers 自稳定迭代求解器
Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530272
Piyush Sao, R. Vuduc
We show how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches a valid state within a finite number of steps. This property imbues the system with a natural means of tolerating transient faults. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults. Beyond the specific findings of this paper, we believe self-stabilization has promise to become a useful tool for constructing resilient solvers more generally.
我们展示了如何使用源自分布式控制的自稳定思想来制作容错迭代求解器。一般来说,自稳定系统是从任意状态(有效或无效)开始,在有限的步骤内达到有效状态的系统。这一特性赋予了系统一种容忍暂态故障的自然手段。我们给出了两个自稳定迭代线性解的概念证明例子:一个用于最陡下降(SD),一个用于共轭梯度(CG)。我们的SD和CG的自稳定版本需要少量的故障检测,例如,我们可能只检查nan和无穷大。我们通过实验测试了我们的方法,分析了它的收敛性和开销对不同类型和频率的故障。除了本文的具体发现之外,我们相信自稳定有望成为更普遍地构建弹性解的有用工具。
{"title":"Self-stabilizing iterative solvers","authors":"Piyush Sao, R. Vuduc","doi":"10.1145/2530268.2530272","DOIUrl":"https://doi.org/10.1145/2530268.2530272","url":null,"abstract":"We show how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches a valid state within a finite number of steps. This property imbues the system with a natural means of tolerating transient faults. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults. Beyond the specific findings of this paper, we believe self-stabilization has promise to become a useful tool for constructing resilient solvers more generally.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115298461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 98
期刊
ACM SIGPLAN Symposium on Scala
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1