首页 > 最新文献

ACM SIGPLAN Symposium on Scala最新文献

英文 中文
Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers 使用HPX和LibGeoDecomp在异构超级计算机上扩展HPC应用程序
Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530269
T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey
With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well. In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage. The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.
随着PetaFLOP集群的普及,以及配备特殊加速卡(如Xeon Phi b[2])的异构机器的出现,计算机科学家面临着提高应用程序可扩展性的艰巨任务,超出了当今传统技术和编程模型的可能。此外,对高度自适应的运行时算法和处理高度非同构数据的应用程序的需求进一步阻碍了我们高效编写性能和可扩展性良好的代码的能力。在本文中,我们展示了使用HPX[19,3,29]的优势,HPX是一个用于任何规模应用的通用并行运行时系统,作为LibGeoDecomp[25]的后端,用于实现具有本地交互的三维n体模拟。当使用LibGeoDecomp的HPX和MPI后端时,我们比较了这个应用程序的缩放和性能结果。LibGeoDecomp是一个用于几何分解代码的库,实现了用户提供的仿真模型的思想,其中库处理空间和时间循环以及数据存储。在TACC的Stampede超级计算机[1]上,使用多达1024个节点(16384个传统内核)和多达16个Xeon Phi加速器(3856个硬件线程)进行了各种同构和异构运行,获得了上述结果。在使用HPX后端的配置中,已经实现了超过0.35 PFLOPS,这相当于并行应用程序效率约为79%。我们的测量证明了使用HPX公开的内在异步和消息驱动编程模型的优势,它支持更好的延迟隐藏、精细到中等粒度的并行性和基于约束的同步。HPX的统一编程模型简化了为异构资源编写高度并行的代码。
{"title":"Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers","authors":"T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey","doi":"10.1145/2530268.2530269","DOIUrl":"https://doi.org/10.1145/2530268.2530269","url":null,"abstract":"With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well.\u0000 In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage.\u0000 The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123234225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
On scalability behaviour of Monte Carlo sparse approximate inverse for matrix computations 矩阵计算中蒙特卡罗稀疏近似逆的可扩展性
Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530274
J. Strassburg, V. Alexandrov
This paper presents a Monte Carlo SPAI pre-conditioner. In contrast to the standard deterministic SPAI pre-conditioners that use the Frobenius norm, a Monte Carlo alternative that relies on the use of Markov Chain Monte Carlo (MCMC) methods to compute a rough matrix inverse (MI) is given. Monte Carlo methods enable a quick rough estimate of the non-zero elements of the inverse matrix with a given precision and certain probability. The advantage of this method is that the same approach is applied to sparse and dense matrices and that complexity of the Monte Carlo matrix inversion is linear of the size of the matrix. The behaviour of the proposed algorithm is studied, its performance is investigated and a comparison with the standard deterministic SPAI, as well as the optimized and parallel MSPAI version is made. Further Monte Carlo SPAI and MSPAI are used for solving systems of linear algebraic equations (SLAE) using BiCGSTAB and a comparison of the results is made.
本文提出了一种蒙特卡罗SPAI预调节器。与使用Frobenius范数的标准确定性SPAI预调节器相反,给出了一种依赖于使用马尔可夫链蒙特卡罗(MCMC)方法来计算粗糙矩阵逆(MI)的蒙特卡罗替代方法。蒙特卡罗方法能够以给定的精度和一定的概率对逆矩阵的非零元素进行快速粗略估计。这种方法的优点是同样的方法适用于稀疏矩阵和密集矩阵,并且蒙特卡罗矩阵反演的复杂度与矩阵的大小成线性关系。研究了该算法的行为,对其性能进行了研究,并与标准确定性SPAI进行了比较,并给出了优化后的并行MSPAI版本。进一步将Monte Carlo SPAI和MSPAI应用于BiCGSTAB求解线性代数方程组(SLAE),并对结果进行了比较。
{"title":"On scalability behaviour of Monte Carlo sparse approximate inverse for matrix computations","authors":"J. Strassburg, V. Alexandrov","doi":"10.1145/2530268.2530274","DOIUrl":"https://doi.org/10.1145/2530268.2530274","url":null,"abstract":"This paper presents a Monte Carlo SPAI pre-conditioner. In contrast to the standard deterministic SPAI pre-conditioners that use the Frobenius norm, a Monte Carlo alternative that relies on the use of Markov Chain Monte Carlo (MCMC) methods to compute a rough matrix inverse (MI) is given. Monte Carlo methods enable a quick rough estimate of the non-zero elements of the inverse matrix with a given precision and certain probability. The advantage of this method is that the same approach is applied to sparse and dense matrices and that complexity of the Monte Carlo matrix inversion is linear of the size of the matrix. The behaviour of the proposed algorithm is studied, its performance is investigated and a comparison with the standard deterministic SPAI, as well as the optimized and parallel MSPAI version is made. Further Monte Carlo SPAI and MSPAI are used for solving systems of linear algebraic equations (SLAE) using BiCGSTAB and a comparison of the results is made.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124172162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Robust distributed orthogonalization based on randomized aggregation 基于随机聚合的鲁棒分布式正交化
Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133177
W. Gansterer, Gerhard Niederbrucker, H. Straková, Stefan Schulze Grotthoff
The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to node failures compared to existing aggregation methods. On a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method (rdmGS), which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms.
研究了基于随机通信调度的分布式数据聚合算法的矩阵计算分布式算法的构造。为此,本文提出了一种新的对分布值求和或平均的聚合算法——推流算法,与现有的聚合方法相比,该算法在节点故障方面具有更好的恢复性能。在超立方体拓扑上,它需要与最优全对全约简操作相同的迭代次数,并且随着节点数量的增加而扩展得很好。将正交化作为一种典型的矩阵计算任务进行研究。在分布式数据聚合算法的基础上,提出了一种新的容错分布式正交化方法(rdmGS),该方法可以在节点故障的情况下产生准确的结果。
{"title":"Robust distributed orthogonalization based on randomized aggregation","authors":"W. Gansterer, Gerhard Niederbrucker, H. Straková, Stefan Schulze Grotthoff","doi":"10.1145/2133173.2133177","DOIUrl":"https://doi.org/10.1145/2133173.2133177","url":null,"abstract":"The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to node failures compared to existing aggregation methods. On a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method (rdmGS), which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"4290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133388177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
On non-blocking collectives in 3D FFTs 关于3D fft中的非阻塞集体
Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133180
R. S. Saksena
With the inclusion of non-blocking global collective operations in the MPI 3.0 draft specification many fundamental algorithms such as those for performing 3-dimensional (3D) FFTs will be modified to take advantage of non-blocking collectives. Novel modifications to such fundamental algorithms will need to be suitable for incorporation in general-purpose FFT libraries to be routinely used by HPC application users. Here we present such a general-purpose algorithmic strategy to utilize non-blocking collective communications in the calculation of a single parallel 3D FFT. In this scheme, the global collective communication is partitioned into blocking and non-blocking components such that overlap between communication and computation is obtained in the 3D FFT calculation. We present benchmarks of our scheme for overlapping computation and communication in the calculation of single variable 3D FFTs on two different architectures (a) HECToR, a Cray XE6 machine and (b) a Fujitsu PRIMERGY Intel Westmere cluster with InfiniBand interconnect.
随着MPI 3.0草案规范中包含非阻塞全局集体操作,许多基本算法(如执行三维(3D) fft的算法)将被修改以利用非阻塞集体。对这些基本算法的新修改需要适合合并到HPC应用程序用户常规使用的通用FFT库中。在这里,我们提出了这样一种通用的算法策略,以利用非阻塞集体通信来计算单个并行3D FFT。在该方案中,将全局集体通信划分为阻塞和非阻塞组件,从而在三维FFT计算中实现通信和计算的重叠。我们提出了在两种不同架构(a) HECToR, Cray XE6机器和(b)具有InfiniBand互连的Fujitsu PRIMERGY Intel Westmere集群上计算单变量3D fft的重叠计算和通信方案的基准测试。
{"title":"On non-blocking collectives in 3D FFTs","authors":"R. S. Saksena","doi":"10.1145/2133173.2133180","DOIUrl":"https://doi.org/10.1145/2133173.2133180","url":null,"abstract":"With the inclusion of non-blocking global collective operations in the MPI 3.0 draft specification many fundamental algorithms such as those for performing 3-dimensional (3D) FFTs will be modified to take advantage of non-blocking collectives. Novel modifications to such fundamental algorithms will need to be suitable for incorporation in general-purpose FFT libraries to be routinely used by HPC application users. Here we present such a general-purpose algorithmic strategy to utilize non-blocking collective communications in the calculation of a single parallel 3D FFT. In this scheme, the global collective communication is partitioned into blocking and non-blocking components such that overlap between communication and computation is obtained in the 3D FFT calculation. We present benchmarks of our scheme for overlapping computation and communication in the calculation of single variable 3D FFTs on two different architectures (a) HECToR, a Cray XE6 machine and (b) a Fujitsu PRIMERGY Intel Westmere cluster with InfiniBand interconnect.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131380313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The low-power architecture approach towards exascale computing 面向百亿亿次计算的低功耗架构方法
Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133175
Nikola Rajovic, Nikola Puzovic, L. Vilanova, Carlos Villavieja, Alex Ramírez
Energy efficiency is a first-order concern when deploying any computer system. From battery-operated mobile devices, to data centers and supercomputers, energy consumption limits the performance that can be offered. We are exploring an alternative to current supercomputers that builds on the small energy-efficient mobile processors. We present results from the prototype system based on ARM Cortex-A9 and make projections about the possibilities to increase energy efficiency.
在部署任何计算机系统时,能源效率都是首要考虑的问题。从电池供电的移动设备,到数据中心和超级计算机,能源消耗限制了可以提供的性能。我们正在探索一种基于小型节能移动处理器的超级计算机的替代方案。我们介绍了基于ARM Cortex-A9的原型系统的结果,并对提高能源效率的可能性进行了预测。
{"title":"The low-power architecture approach towards exascale computing","authors":"Nikola Rajovic, Nikola Puzovic, L. Vilanova, Carlos Villavieja, Alex Ramírez","doi":"10.1145/2133173.2133175","DOIUrl":"https://doi.org/10.1145/2133173.2133175","url":null,"abstract":"Energy efficiency is a first-order concern when deploying any computer system. From battery-operated mobile devices, to data centers and supercomputers, energy consumption limits the performance that can be offered.\u0000 We are exploring an alternative to current supercomputers that builds on the small energy-efficient mobile processors. We present results from the prototype system based on ARM Cortex-A9 and make projections about the possibilities to increase energy efficiency.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123793422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
Soft error resilient QR factorization for hybrid system with GPGPU GPGPU混合系统的软误差弹性QR分解
Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133179
Peng Du, P. Luszczek, S. Tomov, J. Dongarra
The general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing due to their performance advantages over CPUs. As a result, fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R, and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can success- fully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.
通用图形处理单元(GPGPU)由于其性能优于cpu,越来越多地用于科学计算。因此,与gpgpu专门用于图形应用程序的时期相比,容错性已经成为一个更严重的问题。在混合计算系统中同时使用gpu和cpu可以提高灵活性和性能,但也增加了计算受到软错误影响的可能性。在这项工作中,我们提出了一个软误差弹性算法QR分解在这种混合系统。我们的贡献包括:(1)左因子Q的检查点和恢复机制,其性能在混合系统上是可扩展的;(2)优化了gpgpu上的Givens旋转实用程序,有效地将上Hessenberg矩阵约简为上三角形形式,以保护右因子R; (3) gpgpu上基于QR更新的恢复算法。实验结果表明,本文提出的容错QR分解方法在具有gpgpu的混合系统中能够以很小的开销成功地检测和恢复整个矩阵的软错误。
{"title":"Soft error resilient QR factorization for hybrid system with GPGPU","authors":"Peng Du, P. Luszczek, S. Tomov, J. Dongarra","doi":"10.1145/2133173.2133179","DOIUrl":"https://doi.org/10.1145/2133173.2133179","url":null,"abstract":"The general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing due to their performance advantages over CPUs. As a result, fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R, and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can success- fully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121355153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Performance analysis of a cardiac simulation code using IPM 使用IPM的心脏模拟代码的性能分析
Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133186
P. Strazdins, M. Hegland
This paper details our experiences in performing a detailed performance analysis on a large-scale parallel cardiac simulation by the Chaste software on an Nehalem and Infiniband-based cluster. Our methodology achieves good accuracy for relatively modest amounts of cluster time. The use of sections in the Chaste internal profiler, coupled with the IPM tool, enabled some detailed insights into the performance and scalability of the application. For large core counts, our analysis showed that performance was no longer dominated by the linear systems solver. The computationally-intensive components scaled well up to 2048 cores, and poorly scaling and highly imbalanced components associated with program output and miscellaneous functions were limiting scalability.
本文详细介绍了我们在基于Nehalem和infiniband的集群上使用Chaste软件对大规模并行心脏模拟进行详细性能分析的经验。我们的方法在相对适度的集群时间内实现了良好的准确性。在Chaste内部分析器中使用部分,再加上IPM工具,可以对应用程序的性能和可伸缩性进行一些详细的了解。对于大型核心计数,我们的分析表明,性能不再由线性系统求解器主导。计算密集型组件可以很好地扩展到2048个内核,而与程序输出和杂项功能相关的组件的扩展性差和高度不平衡限制了可伸缩性。
{"title":"Performance analysis of a cardiac simulation code using IPM","authors":"P. Strazdins, M. Hegland","doi":"10.1145/2133173.2133186","DOIUrl":"https://doi.org/10.1145/2133173.2133186","url":null,"abstract":"This paper details our experiences in performing a detailed performance analysis on a large-scale parallel cardiac simulation by the Chaste software on an Nehalem and Infiniband-based cluster. Our methodology achieves good accuracy for relatively modest amounts of cluster time. The use of sections in the Chaste internal profiler, coupled with the IPM tool, enabled some detailed insights into the performance and scalability of the application.\u0000 For large core counts, our analysis showed that performance was no longer dominated by the linear systems solver. The computationally-intensive components scaled well up to 2048 cores, and poorly scaling and highly imbalanced components associated with program output and miscellaneous functions were limiting scalability.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"PP 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115520935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line 容错矩阵-矩阵乘法:在线修正软错误
Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133185
Panruo Wu, Chong Ding, Longxiang Chen, Feng Gao, T. Davies, Christer Karlsson, Zizhong Chen
Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results can not be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detect in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e., less than 1%) performance penalty over the ATLAS dgemm().
软错误是一次性事件,会破坏计算系统的状态,但不会破坏其整体功能。软错误通常不会中断受影响程序的执行,但受影响的计算结果不再可信。基于算法的容错(ABFT)是一种众所周知的修正矩阵-矩阵乘法软误差的技术。虽然ABFT比三模冗余(TMR)——一种传统的纠正软错误的通用技术——具有更高的效率,但ABFT和TMR都是在计算完成后离线检测错误。本文将传统的ABFT技术从离线扩展到在线,在程序执行过程中可以在计算过程中检测到矩阵-矩阵乘法中的软错误,并通过及时纠正错误计算来提高效率。实验结果表明,与ATLAS dgemm()相比,所提出的技术可以每十秒纠正一个错误,而性能损失可以忽略不计(即小于1%)。
{"title":"Fault tolerant matrix-matrix multiplication: correcting soft errors on-line","authors":"Panruo Wu, Chong Ding, Longxiang Chen, Feng Gao, T. Davies, Christer Karlsson, Zizhong Chen","doi":"10.1145/2133173.2133185","DOIUrl":"https://doi.org/10.1145/2133173.2133185","url":null,"abstract":"Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results can not be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detect in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e., less than 1%) performance penalty over the ATLAS dgemm().","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114319460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Layout-aware scientific computing: a case study using MILC 感知布局的科学计算:使用MILC的案例研究
Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133183
Jun He, J. Kowalkowski, M. Paterno, D. Holmgren, J. Simone, Xian-He Sun
Nowadays, high performance computers have more cores and nodes than ever before. Computation is spread out among them, leading to more communication. For this reason, communication can easily become the bottleneck of a system and limit its scalability. The layout of an application on a computer is the key factor to preserve communication locality and reduce its cost. In this paper, we propose a simple model to optimize the layout for scientific applications by minimizing inter-node communication cost. The model takes into account the latency and bandwidth of the network and associates them with the dominant layout variables of the application. We take MILC as an example and analyze its communication patterns. According to our experimental results, the model developed for MILC achieved a satisfactory accuracy for predicting the performance, leading to up to 31% performance improvement.
如今,高性能计算机拥有比以往更多的核心和节点。计算在它们之间展开,导致更多的通信。由于这个原因,通信很容易成为系统的瓶颈并限制其可伸缩性。应用程序在计算机上的布局是保证通信局部性和降低通信成本的关键因素。在本文中,我们提出了一个简单的模型,以最小化节点间通信成本来优化科学应用的布局。该模型考虑了网络的延迟和带宽,并将它们与应用程序的主要布局变量相关联。本文以MILC为例,分析了MILC的通信模式。根据我们的实验结果,为MILC开发的模型在预测性能方面取得了令人满意的精度,性能提高了31%。
{"title":"Layout-aware scientific computing: a case study using MILC","authors":"Jun He, J. Kowalkowski, M. Paterno, D. Holmgren, J. Simone, Xian-He Sun","doi":"10.1145/2133173.2133183","DOIUrl":"https://doi.org/10.1145/2133173.2133183","url":null,"abstract":"Nowadays, high performance computers have more cores and nodes than ever before. Computation is spread out among them, leading to more communication. For this reason, communication can easily become the bottleneck of a system and limit its scalability. The layout of an application on a computer is the key factor to preserve communication locality and reduce its cost. In this paper, we propose a simple model to optimize the layout for scientific applications by minimizing inter-node communication cost. The model takes into account the latency and bandwidth of the network and associates them with the dominant layout variables of the application. We take MILC as an example and analyze its communication patterns. According to our experimental results, the model developed for MILC achieved a satisfactory accuracy for predicting the performance, leading to up to 31% performance improvement.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121588533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract 自顶向下的编程方法和工具与stars -支持可扩展的编程范例:扩展抽象
Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133182
Rosa M. Badia
Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs. To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few. This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation. The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform. Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence. While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.
目前的超级计算机正在向具有非常多节点的集群发展,而且节点每一次都变得更加复杂,这些节点由多个多核芯片和gpu组成。有了这样的体系结构,应用程序开发人员每次都面临着更复杂的任务。另一方面,大多数HPC应用程序是用MPI编写的科学遗留代码,最多为数千个处理器设计。目前的努力是将这些应用扩展到更大的内核数量,并与CUDA或OpenCL相结合,以便在gpu上高效运行。为了使给定的应用程序适合在新的异构超级计算机中运行,应用程序开发人员可以采用不同的替代方案。改进MPI瓶颈的优化,例如,通过使用异步通信,或对顺序代码进行优化以改进其局部性,或在节点级别进行优化以避免资源争用,等等。本文提出了一种方法,使当前的MPI应用程序能够使用MPI/ stars编程模型进行改进。stars[2]是一种基于任务的编程模型,它可以通过用编译器指令注释代码来并行化顺序应用程序。更重要的是,它支持它们在异构平台上的执行,包括gpu集群。它还与MPI[1]很好地杂交,实现了通信和计算的重叠。该方法基于在执行时生成有向无环图(DAG),其中图的节点表示应用程序中的任务,边表示任务之间的数据依赖关系。一旦生成了部分DAG, stars运行时就能够将任务调度到平台的不同核心或gpu上。另一个相关的方面是,编程模型为应用程序开发人员提供了一个单一的名称空间,而实际的内存地址可以分布(如在集群或带有gpu的节点中)。stars运行时维护一个分层目录,其中包含关于在哪里找到每个数据块的信息,并且在每个分布式内存空间中维护不同的软件缓存。运行时负责在不同的内存空间之间传输数据并保持一致性。虽然编程模型本身具有非常简单的语法,但识别任务有时可能不像预期的那么容易,特别是在尝试为MPI应用程序分配任务时。为了简化这个过程,开发了一组符合框架的工具:Ssgrind,它有助于识别任务和tasksâǍŹ参数的方向性;Ayudame和Temanejo,它有助于调试stars应用程序;Paraver, Cube和scala,它可以对应用程序进行详细的性能分析。本文的扩展版将详细介绍编程方法,并举例说明。
{"title":"Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract","authors":"Rosa M. Badia","doi":"10.1145/2133173.2133182","DOIUrl":"https://doi.org/10.1145/2133173.2133182","url":null,"abstract":"Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs.\u0000 To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few.\u0000 This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation.\u0000 The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform.\u0000 Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence.\u0000 While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130002104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
ACM SIGPLAN Symposium on Scala
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1