首页 > 最新文献

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems最新文献

英文 中文
Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems 第八届大规模系统可扩展算法最新进展研讨会论文集
V. Alexandrov, A. Geist, J. Dongarra
Novel scalable scientific algorithms are needed in order to enable key science applications to exploit the computational power of large-scale systems. This is especially true for the current tier of leading petascale machines and the road to exascale computing as HPC systems continue to scale up in compute node and processor core count. These extreme-scale systems require novel scientific algorithms to hide network and memory latency, have very high computation/communication overlap, have minimal communication, and have no synchronization points. With the advent of Big Data in the past few years the need of such scalable mathematical methods and algorithms able to handle data and compute intensive applications at scale becomes even more important. Scientific algorithms for multi-petaflop and exa-flop systems also need to be fault tolerant and fault resilient, since the probability of faults increases with scale. Resilience at the system software and at the algorithmic level is needed as a crosscutting effort. Finally, with the advent of heterogeneous compute nodes that employ standard processors as well as GPGPUs, scientific algorithms need to match these architectures to extract the most performance. This includes different system-specific levels of parallelism as well as co-scheduling of computation. Key science applications require novel mathematics and mathematical models and system software that address the scalability and resilience challenges of current- and future-generation extreme-scale HPC systems. The goal of this workshop is to bring together experts in the area of scalable algorithms to present the latest achievements and to discuss the challenges ahead.
为了使关键科学应用能够利用大规模系统的计算能力,需要新颖的可扩展科学算法。这对于当前领先的千万亿级机器和通往百亿亿级计算的道路来说尤其如此,因为HPC系统在计算节点和处理器核心数量上不断扩大。这些极端规模的系统需要新的科学算法来隐藏网络和内存延迟,具有非常高的计算/通信重叠,具有最小的通信,并且没有同步点。随着过去几年大数据的出现,对这种能够大规模处理数据和计算密集型应用的可扩展数学方法和算法的需求变得更加重要。用于千万亿次和exa-flop系统的科学算法也需要容错和故障弹性,因为故障的概率随着规模的增加而增加。系统软件和算法级别的弹性需要作为横切工作。最后,随着采用标准处理器和gpgpu的异构计算节点的出现,科学算法需要匹配这些架构以提取最大的性能。这包括不同系统特定级别的并行性以及计算的协同调度。关键科学应用需要新颖的数学和数学模型以及系统软件,以解决当前和未来一代极端规模HPC系统的可扩展性和弹性挑战。本次研讨会的目标是将可扩展算法领域的专家聚集在一起,介绍最新的成就并讨论未来的挑战。
{"title":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","authors":"V. Alexandrov, A. Geist, J. Dongarra","doi":"10.1145/3148226","DOIUrl":"https://doi.org/10.1145/3148226","url":null,"abstract":"Novel scalable scientific algorithms are needed in order to enable key science applications to exploit the computational power of large-scale systems. This is especially true for the current tier of leading petascale machines and the road to exascale computing as HPC systems continue to scale up in compute node and processor core count. These extreme-scale systems require novel scientific algorithms to hide network and memory latency, have very high computation/communication overlap, have minimal communication, and have no synchronization points. With the advent of Big Data in the past few years the need of such scalable mathematical methods and algorithms able to handle data and compute intensive applications at scale becomes even more important. \u0000 \u0000Scientific algorithms for multi-petaflop and exa-flop systems also need to be fault tolerant and fault resilient, since the probability of faults increases with scale. Resilience at the system software and at the algorithmic level is needed as a crosscutting effort. Finally, with the advent of heterogeneous compute nodes that employ standard processors as well as GPGPUs, scientific algorithms need to match these architectures to extract the most performance. This includes different system-specific levels of parallelism as well as co-scheduling of computation. Key science applications require novel mathematics and mathematical models and system software that address the scalability and resilience challenges of current- and future-generation extreme-scale HPC systems. \u0000 \u0000The goal of this workshop is to bring together experts in the area of scalable algorithms to present the latest achievements and to discuss the challenges ahead.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116403653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms 避免通信的广义最小残差法在多核心平台上的陀螺动力学五维欧拉码中的应用
Y. Idomura, Takuya Ina, Akie Mayumi, S. Yamada, Kazuya Matsumoto, Y. Asahi, Toshiyuki Imamura
A communication-avoiding generalized minimal residual (CA-GMRES) method is applied to the gyrokinetic toroidal five dimensional Eulerian code GT5D, and its performance is compared against the original code with a generalized conjugate residual (GCR) method on the JAEA ICEX (Haswell), the Plasma Simulator (FX100), and the Oakforest-PACS (KNL). Although the CA-GMRES method dramatically reduces the number of data reduction communications, computation is largely increased compared with the GCR method. To resolve this issue, we propose a modified CA-GMRES method, which reduces both computation and memory access by ~ 30% with keeping the same CA property as the original CA-GMRES method. The modified CA-GMRES method has ~ 3.8X higher arithmetic intensity than the GCR method, and thus, is suitable for future Exa-scale architectures with limited memory and network bandwidths. The CA-GMRES solver is implemented using a hybrid CA approach, in which we apply CA to data reduction communications and use communication overlap for halo data communications, and is highly optimized for distributed caches on KNL. It is shown that compared with the GCR solver, its computing kernels are accelerated by 1.47X ~ 2.39X, and the cost of data reduction communication is reduced from 5% ~ 13% to ~ 1% of the total cost at 1,280 nodes.
将避免通信的广义最小残差(CA-GMRES)方法应用于陀螺动力学环面五维欧拉码GT5D,并在JAEA ICEX (Haswell)、等离子体模拟器(FX100)和Oakforest-PACS (KNL)上采用广义共轭残差(GCR)方法与原始码进行性能比较。CA-GMRES方法虽然大大减少了数据约简通信的数量,但与GCR方法相比,计算量大大增加。为了解决这个问题,我们提出了一种改进的CA- gmres方法,该方法在保持与原始CA- gmres方法相同的CA属性的情况下,将计算量和内存访问减少了约30%。改进的CA-GMRES算法的算法强度比GCR算法高3.8倍,适用于未来内存和网络带宽有限的超大规模架构。CA- gmres求解器使用混合CA方法实现,其中我们将CA应用于数据缩减通信,并使用通信重叠进行halo数据通信,并对KNL上的分布式缓存进行了高度优化。结果表明,与GCR求解器相比,其计算内核速度提高了1.47 ~ 2.39倍,1280个节点的数据约简通信成本从总成本的5% ~ 13%降低到1%。
{"title":"Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms","authors":"Y. Idomura, Takuya Ina, Akie Mayumi, S. Yamada, Kazuya Matsumoto, Y. Asahi, Toshiyuki Imamura","doi":"10.1145/3148226.3148234","DOIUrl":"https://doi.org/10.1145/3148226.3148234","url":null,"abstract":"A communication-avoiding generalized minimal residual (CA-GMRES) method is applied to the gyrokinetic toroidal five dimensional Eulerian code GT5D, and its performance is compared against the original code with a generalized conjugate residual (GCR) method on the JAEA ICEX (Haswell), the Plasma Simulator (FX100), and the Oakforest-PACS (KNL). Although the CA-GMRES method dramatically reduces the number of data reduction communications, computation is largely increased compared with the GCR method. To resolve this issue, we propose a modified CA-GMRES method, which reduces both computation and memory access by ~ 30% with keeping the same CA property as the original CA-GMRES method. The modified CA-GMRES method has ~ 3.8X higher arithmetic intensity than the GCR method, and thus, is suitable for future Exa-scale architectures with limited memory and network bandwidths. The CA-GMRES solver is implemented using a hybrid CA approach, in which we apply CA to data reduction communications and use communication overlap for halo data communications, and is highly optimized for distributed caches on KNL. It is shown that compared with the GCR solver, its computing kernels are accelerated by 1.47X ~ 2.39X, and the cost of data reduction communication is reduced from 5% ~ 13% to ~ 1% of the total cost at 1,280 nodes.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131096460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Flexible batched sparse matrix-vector product on GPUs gpu上的柔性批处理稀疏矩阵向量积
H. Anzt, Gary Collins, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí
We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order to handle a batch of matrices which differ in size, nonzero count, and nonzero distribution. Furthermore, they support three most commonly used sparse storage formats: CSR, COO and ELL. Our experimental results on a state-of-the-art GPU reveal performance improvements of up to 25X compared to non-batched SpMV routines.
我们提出了各种批处理例程,用于在图形处理单元(gpu)上并发处理大量小尺寸,独立稀疏矩阵向量积(SpMV)。这些批处理的SpMV内核被设计得非常灵活,以便处理一批大小、非零计数和非零分布不同的矩阵。此外,它们还支持三种最常用的稀疏存储格式:CSR、COO和ELL。我们在最先进的GPU上的实验结果显示,与非批处理SpMV例程相比,性能提高高达25倍。
{"title":"Flexible batched sparse matrix-vector product on GPUs","authors":"H. Anzt, Gary Collins, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí","doi":"10.1145/3148226.3148230","DOIUrl":"https://doi.org/10.1145/3148226.3148230","url":null,"abstract":"We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order to handle a batch of matrices which differ in size, nonzero count, and nonzero distribution. Furthermore, they support three most commonly used sparse storage formats: CSR, COO and ELL. Our experimental results on a state-of-the-art GPU reveal performance improvements of up to 25X compared to non-batched SpMV routines.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Investigating half precision arithmetic to accelerate dense linear system solvers 研究半精度算法加速密集线性系统求解
A. Haidar, Panruo Wu, S. Tomov, J. Dongarra
The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique - we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.
在混合精度计算方法中使用低精度算法已成为加速众多科学计算应用的有力工具。特别是人工智能(AI)将这一点推向了当前的极端,在基于神经网络的方法中使用半精度浮点算法(FP16)。FP16的吸引力在于在当今强大的多核GPU加速器上使用它可以实现的高性能,例如,像NVIDIA V100, FP16可以提供120 TeraFLOPS。我们提出了一项调查,表明其他HPC应用也可以利用这种能力,特别是解决一般HPC问题Ax = b,其中A是一个大的密集矩阵,解决方案需要在FP32或FP64精度。我们的方法基于混合精度迭代优化技术-我们将先前的进展推广并扩展到一个框架中,为此我们开发了特定于体系结构的算法和高度调优的实现,以解决在高端gpu上有效并行化,缩放和使用FP16算法的主要计算挑战。随后,我们首次展示了FP16算法的使用如何显着加速,以及制作更节能的FP32或fp64精度Ax = b求解器。我们的结果是可重复的,开发将通过MAGMA库提供。我们在实践中量化了该方法的性能和局限性。
{"title":"Investigating half precision arithmetic to accelerate dense linear system solvers","authors":"A. Haidar, Panruo Wu, S. Tomov, J. Dongarra","doi":"10.1145/3148226.3148237","DOIUrl":"https://doi.org/10.1145/3148226.3148237","url":null,"abstract":"The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique - we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114975116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs 利用NVLINK和异步数据传输来扩展gpu的内存容量
D. Appelhans, B. Walkup
In this paper we demonstrate the utility of fast GPU to CPU interconnects to weak scale on hierarchical nodes without being limited to problem sizes that fit only in the GPU memory capacity. We show the speedup possible for a new regime of algorithms which traditionally have not benefited from being ported to GPUs because of an insufficient amount of computational work relative to bytes of data that must be transferred (offload intensity). This new capability is demonstrated with an example of our hierarchical GPU port of UMT, the 51K line CORAL benchmark application for Lawrence Livermore National Lab's radiation transport code. By overlapping data transfers and using the NVLINK connection between IBM POWER 8 CPUs and NVIDIA P100 GPUs, we demonstrate a speedup that continues even when scaling the problem size well beyond the memory capacity of the GPUs. Scaling to large local domains per MPI process is a necessary step to solving very large problems, and in the case of UMT, large local domains improve the convergence as the number of MPI ranks are weak scaled.
在本文中,我们展示了快速GPU到CPU互连在分层节点上的弱规模的效用,而不限于仅适合GPU内存容量的问题大小。我们展示了一种新算法的加速可能性,这种算法传统上没有从移植到gpu中受益,因为相对于必须传输的数据字节(卸载强度)的计算工作量不足。这种新功能通过我们的分层GPU端口UMT的示例进行了演示,该示例是劳伦斯利弗莫尔国家实验室辐射传输代码的51K线CORAL基准应用程序。通过重叠数据传输并使用IBM POWER 8 cpu和NVIDIA P100 gpu之间的NVLINK连接,我们展示了即使在扩展问题大小远远超过gpu的内存容量时仍能持续的加速。每个MPI过程扩展到大的局部域是解决非常大问题的必要步骤,在UMT的情况下,大的局部域提高了收敛性,因为MPI秩的数量是弱缩放的。
{"title":"Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs","authors":"D. Appelhans, B. Walkup","doi":"10.1145/3148226.3148232","DOIUrl":"https://doi.org/10.1145/3148226.3148232","url":null,"abstract":"In this paper we demonstrate the utility of fast GPU to CPU interconnects to weak scale on hierarchical nodes without being limited to problem sizes that fit only in the GPU memory capacity. We show the speedup possible for a new regime of algorithms which traditionally have not benefited from being ported to GPUs because of an insufficient amount of computational work relative to bytes of data that must be transferred (offload intensity). This new capability is demonstrated with an example of our hierarchical GPU port of UMT, the 51K line CORAL benchmark application for Lawrence Livermore National Lab's radiation transport code. By overlapping data transfers and using the NVLINK connection between IBM POWER 8 CPUs and NVIDIA P100 GPUs, we demonstrate a speedup that continues even when scaling the problem size well beyond the memory capacity of the GPUs. Scaling to large local domains per MPI process is a necessary step to solving very large problems, and in the case of UMT, large local domains improve the convergence as the number of MPI ranks are weak scaled.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132630986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Analyzing the criticality of transient faults-induced SDCS on GPU applications GPU应用中暂态故障诱发SDCS的临界性分析
F. Santos, P. Rech
In this paper we compare the soft-error sensitivity of parallel applications on modern Graphics Processing Units (GPUs) obtained through architectural-level fault injections and high-energy particle beam radiation experiments. Fault-injection and beam experiments provide different information and uses different transient-fault sensitivity metrics, which are hard to combine. In this paper we show how correlating beam and fault-injection data can provide a deeper understanding of the behavior of GPUs in the occurrence of transient faults. In particular, we demonstrate that commonly used architecture-level fault models (and fast injection tools) can be used to identify critical kernels and to associate some experimentally observed output errors with their causes. Additionally, we show how register file and instruction-level injections can be used to evaluate ECC efficiency in reducing the radiation-induced error rate.
本文通过结构级故障注入和高能粒子束辐射实验,比较了现代图形处理单元(gpu)上并行应用的软误差灵敏度。故障注入实验和波束实验提供的信息不同,使用的瞬态故障灵敏度指标也不同,两者难以结合。在本文中,我们展示了如何将波束和故障注入数据相关联,以便更深入地了解gpu在瞬态故障发生时的行为。特别是,我们证明了常用的架构级故障模型(和快速注入工具)可用于识别关键内核,并将一些实验观察到的输出错误与其原因联系起来。此外,我们展示了如何使用寄存器文件和指令级注入来评估ECC效率,以降低辐射引起的错误率。
{"title":"Analyzing the criticality of transient faults-induced SDCS on GPU applications","authors":"F. Santos, P. Rech","doi":"10.1145/3148226.3148228","DOIUrl":"https://doi.org/10.1145/3148226.3148228","url":null,"abstract":"In this paper we compare the soft-error sensitivity of parallel applications on modern Graphics Processing Units (GPUs) obtained through architectural-level fault injections and high-energy particle beam radiation experiments. Fault-injection and beam experiments provide different information and uses different transient-fault sensitivity metrics, which are hard to combine. In this paper we show how correlating beam and fault-injection data can provide a deeper understanding of the behavior of GPUs in the occurrence of transient faults. In particular, we demonstrate that commonly used architecture-level fault models (and fast injection tools) can be used to identify critical kernels and to associate some experimentally observed output errors with their causes. Additionally, we show how register file and instruction-level injections can be used to evaluate ECC efficiency in reducing the radiation-induced error rate.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123652656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations 一个高度可扩展的,基于算法的容错解算器,用于回旋动力等离子体模拟
M. Obersteiner, A. Parra-Hinojosa, M. Heene, H. Bungartz, D. Pflüger
With future exascale computers expected to have millions of compute units distributed among thousands of nodes, system faults are predicted to become more frequent. Fault tolerance will thus play a key role in HPC at this scale. In this paper we focus on solving the 5-dimensional gyrokinetic Vlasov-Maxwell equations using the application code GENE as it represents a high-dimensional and resource-intensive problem which is a natural candidate for exascale computing. We discuss the Fault-Tolerant Combination Technique, a resilient version of the Combination Technique, a method to increase the discretization resolution of existing PDE solvers. For the first time, we present an efficient, scalable and fault-tolerant implementation of this algorithm for plasma physics simulations based on a manager-worker model and test it under very realistic and pessimistic environments with simulated faults. We show that the Fault-Tolerant Combination Technique - an algorithm-based forward recovery method - can tolerate a large number of faults with a low overhead and at an acceptable loss in accuracy. Our parallel experiments with up to 32k cores show good scalability at a relative parallel efficiency of 93.61%. We conclude that algorithm-based solutions to fault tolerance are attractive for this type of problems.
由于未来的百亿亿次计算机预计将有数百万个计算单元分布在数千个节点中,预计系统故障将变得更加频繁。因此,容错将在这种规模的高性能计算中发挥关键作用。在本文中,我们着重于用应用程序代码GENE求解5维陀螺动力学Vlasov-Maxwell方程,因为它代表了一个高维和资源密集型的问题,是百亿亿次计算的自然候选。我们讨论了容错组合技术,它是组合技术的一种弹性版本,是一种提高现有PDE解算器离散化分辨率的方法。我们首次提出了一种基于管理者-工作者模型的高效、可扩展和容错的等离子体物理模拟算法实现,并在具有模拟故障的非常现实和悲观的环境下对其进行了测试。我们证明了容错组合技术-一种基于算法的前向恢复方法-能够以低开销和可接受的精度损失容忍大量故障。我们在多达32k核的并行实验中显示出良好的可扩展性,相对并行效率为93.61%。我们得出结论,基于算法的容错解决方案对于这类问题是有吸引力的。
{"title":"A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations","authors":"M. Obersteiner, A. Parra-Hinojosa, M. Heene, H. Bungartz, D. Pflüger","doi":"10.1145/3148226.3148229","DOIUrl":"https://doi.org/10.1145/3148226.3148229","url":null,"abstract":"With future exascale computers expected to have millions of compute units distributed among thousands of nodes, system faults are predicted to become more frequent. Fault tolerance will thus play a key role in HPC at this scale. In this paper we focus on solving the 5-dimensional gyrokinetic Vlasov-Maxwell equations using the application code GENE as it represents a high-dimensional and resource-intensive problem which is a natural candidate for exascale computing. We discuss the Fault-Tolerant Combination Technique, a resilient version of the Combination Technique, a method to increase the discretization resolution of existing PDE solvers. For the first time, we present an efficient, scalable and fault-tolerant implementation of this algorithm for plasma physics simulations based on a manager-worker model and test it under very realistic and pessimistic environments with simulated faults. We show that the Fault-Tolerant Combination Technique - an algorithm-based forward recovery method - can tolerate a large number of faults with a low overhead and at an acceptable loss in accuracy. Our parallel experiments with up to 32k cores show good scalability at a relative parallel efficiency of 93.61%. We conclude that algorithm-based solutions to fault tolerance are attractive for this type of problems.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129451962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Dynamic load balancing of massively parallel unstructured meshes 大规模并行非结构化网格的动态负载平衡
Gerrett Diamond, Cameron W. Smith, M. Shephard
Simulating systems with evolving relational structures on massively parallel computers require the computational work to be evenly distributed across the processing resources throughout the simulation. Adaptive, unstructured, mesh-based finite element and finite volume tools best exemplify this need. We present EnGPar and its diffusive partition improvement method that accounts for multiple application specified criteria. EnGPar's performance is compared against its predecessor, ParMA. Specifically, partition improvement results are provided on up to 512Ki processes of the Argonne Leadership Computing Facility's Mira BlueGene/Q system.
在大规模并行计算机上模拟具有演化关系结构的系统,要求计算工作在整个模拟过程中均匀分布在处理资源上。自适应、非结构化、基于网格的有限元和有限体积工具是这种需求的最佳例证。我们提出了EnGPar及其扩散分区改进方法,该方法考虑了多个应用指定的标准。EnGPar的性能与其前身ParMA进行了比较。具体来说,在阿贡领导计算设施的Mira BlueGene/Q系统的高达512Ki的进程上提供分区改进结果。
{"title":"Dynamic load balancing of massively parallel unstructured meshes","authors":"Gerrett Diamond, Cameron W. Smith, M. Shephard","doi":"10.1145/3148226.3148236","DOIUrl":"https://doi.org/10.1145/3148226.3148236","url":null,"abstract":"Simulating systems with evolving relational structures on massively parallel computers require the computational work to be evenly distributed across the processing resources throughout the simulation. Adaptive, unstructured, mesh-based finite element and finite volume tools best exemplify this need. We present EnGPar and its diffusive partition improvement method that accounts for multiple application specified criteria. EnGPar's performance is compared against its predecessor, ParMA. Specifically, partition improvement results are provided on up to 512Ki processes of the Argonne Leadership Computing Facility's Mira BlueGene/Q system.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115777927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Dynamic task discovery in PaRSEC: a data-flow task-based runtime PaRSEC中的动态任务发现:基于数据流任务的运行时
Reazul Hoque, T. Hérault, G. Bosilca, J. Dongarra
Successfully exploiting distributed collections of heterogeneous many-cores architectures with complex memory hierarchy through a portable programming model is a challenge for application developers. The literature is not short of proposals addressing this problem, including many evolutionary solutions that seek to extend the capabilities of current message passing paradigms with intra-node features (MPI+X). A different, more revolutionary, solution explores data-flow task-based runtime systems as a substitute to both local and distributed data dependencies management. The solution explored in this paper, PaRSEC, is based on such a programming paradigm, supported by a highly efficient task-based runtime. This paper compares two programming paradigms present in PaRSEC, Parameterized Task Graph (PTG) and Dynamic Task Discovery (DTD) in terms of capabilities, overhead and potential benefits.
通过可移植编程模型成功利用具有复杂内存层次结构的异构多核架构的分布式集合是应用程序开发人员面临的一个挑战。文献中不乏解决此问题的建议,包括许多寻求扩展具有节点内特征(MPI+X)的当前消息传递范例功能的进化解决方案。一种不同的、更具革命性的解决方案探索了基于数据流任务的运行时系统,作为本地和分布式数据依赖关系管理的替代品。本文探讨的解决方案PaRSEC就是基于这样的编程范例,并由高效的基于任务的运行时提供支持。本文比较了PaRSEC中存在的两种编程范式,参数化任务图(PTG)和动态任务发现(DTD)在功能、开销和潜在收益方面的差异。
{"title":"Dynamic task discovery in PaRSEC: a data-flow task-based runtime","authors":"Reazul Hoque, T. Hérault, G. Bosilca, J. Dongarra","doi":"10.1145/3148226.3148233","DOIUrl":"https://doi.org/10.1145/3148226.3148233","url":null,"abstract":"Successfully exploiting distributed collections of heterogeneous many-cores architectures with complex memory hierarchy through a portable programming model is a challenge for application developers. The literature is not short of proposals addressing this problem, including many evolutionary solutions that seek to extend the capabilities of current message passing paradigms with intra-node features (MPI+X). A different, more revolutionary, solution explores data-flow task-based runtime systems as a substitute to both local and distributed data dependencies management. The solution explored in this paper, PaRSEC, is based on such a programming paradigm, supported by a highly efficient task-based runtime. This paper compares two programming paradigms present in PaRSEC, Parameterized Task Graph (PTG) and Dynamic Task Discovery (DTD) in terms of capabilities, overhead and potential benefits.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130413269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Parallel jaccard and related graph clustering techniques 并行jard和相关的图聚类技术
Alexandre Fender, N. Emad, S. Petiton, Joe Eaton, M. Naumov
In this paper we propose to generalize Jaccard and related measures, often used as similarity coefficients between two sets. We define Jaccard, Dice-Sorensen and Tversky edge weights on a graph and generalize them to account for vertex weights. We develop an efficient parallel algorithm for computing Jaccard edge and PageRank vertex weights. We highlight that the weights computation can obtain more than 10X speedup on the GPU versus CPU on large realistic data sets. Also, we show that finding a minimum balanced cut for modified weights can be related to minimizing the sum of ratios of the intersection and union of nodes on the boundary of clusters. Finally, we show that the novel weights can improve the quality of the graph clustering by about 15% and 80% for multi-level and spectral graph partitioning and clustering schemes, respectively.
在本文中,我们提出了Jaccard和相关度量的推广,这些度量通常被用作两个集合之间的相似系数。我们在图上定义了Jaccard, Dice-Sorensen和Tversky边权值,并将它们推广到顶点权值。我们开发了一种计算Jaccard边和PageRank顶点权重的高效并行算法。我们强调,在大型真实数据集上,权重计算在GPU上比CPU上可以获得10倍以上的加速。此外,我们还表明,找到修改权重的最小平衡切割可能与最小化集群边界上节点的相交和并的比率和有关。最后,我们表明,对于多级和谱图划分和聚类方案,新的权值可以分别将图的聚类质量提高约15%和80%。
{"title":"Parallel jaccard and related graph clustering techniques","authors":"Alexandre Fender, N. Emad, S. Petiton, Joe Eaton, M. Naumov","doi":"10.1145/3148226.3148231","DOIUrl":"https://doi.org/10.1145/3148226.3148231","url":null,"abstract":"In this paper we propose to generalize Jaccard and related measures, often used as similarity coefficients between two sets. We define Jaccard, Dice-Sorensen and Tversky edge weights on a graph and generalize them to account for vertex weights. We develop an efficient parallel algorithm for computing Jaccard edge and PageRank vertex weights. We highlight that the weights computation can obtain more than 10X speedup on the GPU versus CPU on large realistic data sets. Also, we show that finding a minimum balanced cut for modified weights can be related to minimizing the sum of ratios of the intersection and union of nodes on the boundary of clusters. Finally, we show that the novel weights can improve the quality of the graph clustering by about 15% and 80% for multi-level and spectral graph partitioning and clustering schemes, respectively.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126728438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1