首页 > 最新文献

Parallel Computing最新文献

英文 中文
NekRS, a GPU-accelerated spectral element Navier–Stokes solver NekRS, gpu加速谱元Navier-Stokes解算器
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102982
Paul Fischer , Stefan Kerkemeier , Misun Min , Yu-Hsiang Lan , Malachi Phillips , Thilina Rathnayake , Elia Merzari , Ananias Tomboulides , Ali Karakus , Noel Chalmers , Tim Warburton

The development of NekRS, a GPU-oriented thermal-fluids simulation code based on the spectral element method (SEM) is described. For performance portability, the code is based on the open concurrent compute abstraction and leverages scalable developments in the SEM code Nek5000 and in libParanumal, which is a library of high-performance kernels for high-order discretizations and PDE-based miniapps. Critical performance sections of the Navier–Stokes time advancement are addressed. Performance results on several platforms are presented, including scaling to 27,648 V100s on OLCF Summit, for calculations of up to 60B grid points (240B degrees-of-freedom).

介绍了基于谱元法(SEM)的面向gpu的热流体模拟程序NekRS的开发过程。为了性能可移植性,代码基于开放并发计算抽象,并利用SEM代码Nek5000和libParanumal中的可扩展开发,libParanumal是一个用于高阶离散化和基于pde的小型应用程序的高性能内核库。解决了Navier-Stokes时间推进的关键性能部分。给出了在几个平台上的性能结果,包括在OLCF Summit上扩展到27,648 v100,用于计算多达60B个网格点(240B自由度)。
{"title":"NekRS, a GPU-accelerated spectral element Navier–Stokes solver","authors":"Paul Fischer ,&nbsp;Stefan Kerkemeier ,&nbsp;Misun Min ,&nbsp;Yu-Hsiang Lan ,&nbsp;Malachi Phillips ,&nbsp;Thilina Rathnayake ,&nbsp;Elia Merzari ,&nbsp;Ananias Tomboulides ,&nbsp;Ali Karakus ,&nbsp;Noel Chalmers ,&nbsp;Tim Warburton","doi":"10.1016/j.parco.2022.102982","DOIUrl":"10.1016/j.parco.2022.102982","url":null,"abstract":"<div><p><span><span>The development of NekRS, a GPU-oriented thermal-fluids simulation code based on the spectral element method (SEM) is described. For performance portability, the code is based on the open concurrent compute abstraction and leverages scalable developments in the SEM code Nek5000 and in libParanumal, which is a library of high-performance kernels for high-order </span>discretizations and PDE-based miniapps. Critical performance sections of the Navier–Stokes </span>time advancement are addressed. Performance results on several platforms are presented, including scaling to 27,648 V100s on OLCF Summit, for calculations of up to 60B grid points (240B degrees-of-freedom).</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102982"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81085812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
SGPM: A coroutine framework for transaction processing SGPM:用于事务处理的协程框架
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102980
Xinyuan Wang, Hejiao Huang

Coroutine is able to increase program concurrency and processor core utilization. However, for adapting the coroutine-to-transaction model, the existing coroutine package has the following disadvantages: (1) Additional scheduler threads incur synchronization overhead when the load between scheduler threads and worker threads is unbalanced. (2) Coroutines are swapped out periodically to prevent deadlocks, which will increase the conflict rate by adding suspended transactions. (3) Supporting only the swap-out function (yield, await, etc.) cannot flexibly control the transaction swap-in time. In this paper, we present SGPM, a coroutine framework for transaction processing. To adapt to the coroutine-to-transaction model, SGPM has the following properties: First, it eliminates scheduler threads and the periodic coroutine switch. Second, it provides a variety of coroutine scheduling strategies to make all types of concurrency control protocols run on SGPM reasonably. We implement eight well-known concurrency control on SGPM and, particularly, we use SGPM to optimize the performance of four wound-wait concurrency control among them, including 2PL, SS2PL, Calvin, and EWV. The experiment result demonstrates that after SGPM optimization 2PL and SS2PL outperform OCC and MVCC, and the throughput of Calvin and EWV is also improved by 1.2x and 1.3x respectively.

协程能够提高程序并发性和处理器核心利用率。然而,为了适应协程到事务模型,现有的协程包有以下缺点:(1)当调度程序线程和工作线程之间的负载不平衡时,额外的调度程序线程会导致同步开销。(2)定期交换协程以防止死锁,这将增加挂起的事务,从而增加冲突率。(3)仅支持swap-out功能(yield、await等),无法灵活控制事务的swap-in时间。在本文中,我们提出了SGPM,一个用于事务处理的协同程序框架。为了适应协程到事务模型,SGPM具有以下属性:首先,它消除了调度器线程和周期性的协程切换。其次,它提供了多种协程调度策略,使所有类型的并发控制协议在SGPM上合理运行。我们在SGPM上实现了8个众所周知的并发控制,特别地,我们使用SGPM来优化其中4个等待并发控制的性能,包括2PL、SS2PL、Calvin和EWV。实验结果表明,经过SGPM优化后,2PL和SS2PL的性能优于OCC和MVCC, Calvin和EWV的吞吐量也分别提高了1.2倍和1.3倍。
{"title":"SGPM: A coroutine framework for transaction processing","authors":"Xinyuan Wang,&nbsp;Hejiao Huang","doi":"10.1016/j.parco.2022.102980","DOIUrl":"10.1016/j.parco.2022.102980","url":null,"abstract":"<div><p><span><span>Coroutine is able to increase program concurrency and processor core utilization. However, for adapting the coroutine-to-transaction model, the existing coroutine package has the following disadvantages: (1) Additional scheduler threads incur synchronization overhead when the load between scheduler threads and worker threads is unbalanced. (2) Coroutines are swapped out periodically to prevent </span>deadlocks, which will increase the conflict rate by adding suspended transactions. (3) Supporting only the swap-out function (yield, await, etc.) cannot flexibly control the transaction swap-in time. In this paper, we present SGPM, a coroutine framework for </span>transaction processing<span>. To adapt to the coroutine-to-transaction model, SGPM has the following properties: First, it eliminates scheduler threads and the periodic coroutine switch. Second, it provides a variety of coroutine scheduling strategies to make all types of concurrency control protocols run on SGPM reasonably. We implement eight well-known concurrency control on SGPM and, particularly, we use SGPM to optimize the performance of four wound-wait concurrency control among them, including 2PL, SS2PL, Calvin, and EWV. The experiment result demonstrates that after SGPM optimization 2PL and SS2PL outperform OCC and MVCC, and the throughput of Calvin and EWV is also improved by 1.2x and 1.3x respectively.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102980"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77557910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA Tausch:一个halo交换库,用于使用MPI、OpenCL和CUDA的大型异构计算系统
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102973
Lukas Spies , Amanda Bienz , David Moulton , Luke Olson , Andrew Reisner

Exchanging halo data is a common task in modern scientific computing applications and efficient handling of this operation is critical for the performance of the overall simulation. Tausch is a novel header-only library that provides a simple API for efficiently handling these types of data movements. Tausch supports both simple CPU-only systems, but also more complex heterogeneous systems with both CPUs and GPUs. It currently supports both OpenCL and CUDA for communicating with GPGPU devices, and allows for communication between GPGPUs and CPUs. The API allows for drop-in replacement in existing codes and can be used for the communication layer in new codes. This paper provides an overview of the approach taken in Tausch, and a performance analysis that demonstrates expected and achieved performance. We highlight the ease of use and performance with three applications: First Tausch is compared to the halo exchange framework from two Mantevo applications, HPCCG and miniFE, and then it is used to replace a legacy halo exchange library in the flexible multigrid solver framework Cedar.

在现代科学计算应用中,交换光晕数据是一项常见的任务,有效地处理这一操作对整个模拟的性能至关重要。Tausch是一个新颖的头文件库,它提供了一个简单的API来有效地处理这些类型的数据移动。Tausch既支持简单的只有cpu的系统,也支持更复杂的具有cpu和gpu的异构系统。它目前支持OpenCL和CUDA与GPGPU设备的通信,并允许GPGPU和cpu之间的通信。该API允许在现有代码中插入替换,并可用于新代码中的通信层。本文概述了Tausch采用的方法,并进行了性能分析,展示了预期的性能和已实现的性能。我们强调了三个应用程序的易用性和性能:首先将Tausch与Mantevo的两个应用程序HPCCG和miniFE的halo交换框架进行比较,然后使用Tausch取代灵活的多网格求解器框架Cedar中的遗留halo交换库。
{"title":"Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA","authors":"Lukas Spies ,&nbsp;Amanda Bienz ,&nbsp;David Moulton ,&nbsp;Luke Olson ,&nbsp;Andrew Reisner","doi":"10.1016/j.parco.2022.102973","DOIUrl":"10.1016/j.parco.2022.102973","url":null,"abstract":"<div><p><span>Exchanging halo data is a common task in modern scientific computing<span><span> applications and efficient handling of this operation is critical for the performance of the overall simulation. Tausch is a novel header-only library that provides a simple API for efficiently handling these types of data movements. Tausch supports both simple CPU-only systems, but also more complex heterogeneous systems with both CPUs and </span>GPUs. It currently supports both </span></span>OpenCL<span> and CUDA for communicating with GPGPU devices, and allows for communication between GPGPUs and CPUs. The API allows for drop-in replacement in existing codes and can be used for the communication layer in new codes. This paper provides an overview of the approach taken in Tausch, and a performance analysis that demonstrates expected and achieved performance. We highlight the ease of use and performance with three applications: First Tausch is compared to the halo exchange framework from two Mantevo applications, HPCCG and miniFE, and then it is used to replace a legacy halo exchange library in the flexible multigrid solver framework Cedar.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102973"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85992755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Graph optimization algorithm using symmetry and host bias for low-latency indirect network 基于对称和主机偏差的低延迟间接网络图优化算法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102983
Masahiro Nakao , Masaki Tsukamoto , Yoshiko Hanada , Keiji Yamamoto

It is known that an indirect network with a small host-to-host Average Shortest Path Length (h-ASPL) improves overall system performance in a parallel computer system. As a means to discuss such indirect networks in graph theory, the Order/Radix Problem (ORP) has been proposed. ORP involves finding a graph with a minimum h-ASPL that satisfies a given number of hosts and radix. A graph in ORP represents an indirect network and has two types of vertices: host and switch. We propose an optimization algorithm to generate graphs with a sufficiently small h-ASPL. The primary features of the proposed algorithm are the symmetry of the graph and the bias of the hosts adjacent to each switch. These features reduce the computational time to calculate the h-ASPL and improve the search performance of the algorithm. The performance of the proposed algorithm is evaluated using problems presented by Graph Golf, an international ORP competition. Our results show that the proposed algorithm can generate graphs with a smaller h-ASPL than the existing algorithm. To evaluate the performance of the graphs generated by the proposed algorithm, we use the parallel simulation framework SimGrid and the parallel benchmark collection NAS Parallel Benchmarks. Our results also show that the graphs generated by the proposed algorithm have higher performance than those generated by the existing algorithm.

众所周知,在并行计算机系统中,具有较小的主机到主机平均最短路径长度(h-ASPL)的间接网络可以提高系统的整体性能。作为图论中讨论这种间接网络的一种方法,序/基问题(ORP)被提出。ORP涉及寻找具有最小h-ASPL的图,该图满足给定数量的主机和基数。ORP中的图表示一个间接网络,有两种类型的顶点:主机和交换机。我们提出了一种优化算法来生成具有足够小的h-ASPL的图。该算法的主要特征是图的对称性和每个开关相邻主机的偏置。这些特征减少了h-ASPL的计算时间,提高了算法的搜索性能。利用国际ORP比赛Graph Golf提出的问题对该算法的性能进行了评估。实验结果表明,与现有算法相比,该算法能以更小的h-ASPL生成图。为了评估该算法生成的图形的性能,我们使用了并行仿真框架SimGrid和并行基准集合NAS parallel benchmark。实验结果还表明,该算法生成的图形比现有算法生成的图形具有更高的性能。
{"title":"Graph optimization algorithm using symmetry and host bias for low-latency indirect network","authors":"Masahiro Nakao ,&nbsp;Masaki Tsukamoto ,&nbsp;Yoshiko Hanada ,&nbsp;Keiji Yamamoto","doi":"10.1016/j.parco.2022.102983","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102983","url":null,"abstract":"<div><p>It is known that an indirect network with a small host-to-host Average Shortest Path Length (h-ASPL) improves overall system performance in a parallel computer system. As a means to discuss such indirect networks in graph theory, the Order/Radix Problem (ORP) has been proposed. ORP involves finding a graph with a minimum h-ASPL that satisfies a given number of hosts and radix. A graph in ORP represents an indirect network and has two types of vertices: host and switch. We propose an optimization algorithm to generate graphs with a sufficiently small h-ASPL. The primary features of the proposed algorithm are the symmetry of the graph and the bias of the hosts adjacent to each switch. These features reduce the computational time to calculate the h-ASPL and improve the search performance of the algorithm. The performance of the proposed algorithm is evaluated using problems presented by Graph Golf, an international ORP competition. Our results show that the proposed algorithm can generate graphs with a smaller h-ASPL than the existing algorithm. To evaluate the performance of the graphs generated by the proposed algorithm, we use the parallel simulation framework SimGrid and the parallel benchmark collection NAS Parallel Benchmarks. Our results also show that the graphs generated by the proposed algorithm have higher performance than those generated by the existing algorithm.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102983"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000722/pdfft?md5=70b6cbe2b73c6952541b7170b6406471&pid=1-s2.0-S0167819122000722-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137225368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Operational Data Analytics in practice: Experiences from design to deployment in production HPC environments 操作数据分析的实践:从设计到生产HPC环境部署的经验
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102950
Alessio Netti , Michael Ott , Carla Guillen , Daniele Tafani , Martin Schulz

As HPC systems continue to grow in scale and complexity, efficient and manageable operation is increasingly critical. For this reason, many centers are starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from the massive amounts of data produced by monitoring systems and use it for enacting control over system knobs, or for aiding administrators through visualization. As ODA is a multi-faceted problem, much research effort has gone into finding solutions to its separate aspects: however, comprehensive solutions to enable production use of ODA are still rare, while accounts of ODA experiences and the associated challenges are even harder to come across.

In this work we aim to bridge the gap between ODA research and production use by presenting our own experiences, associated with proactive control of warm-water inlet temperatures and visualization of job data on two different HPC systems. We cover the entire development process, starting from a description of requirements and challenges, and down to design, deployment and evaluation. Moreover, we discuss a series of critical points related to the maintainability of ODA, and propose action items in an effort to drive the community forward. We rely on a series of open-source tools and techniques, which make for a generic ODA framework that is suitable for most use cases.

随着高性能计算系统的规模和复杂性不断增长,高效和可管理的操作变得越来越重要。出于这个原因,许多中心开始探索使用操作数据分析(Operational Data Analytics, ODA)技术,该技术从监视系统产生的大量数据中提取知识,并将其用于对系统旋钮进行控制,或者通过可视化帮助管理员。由于官方发展援助是一个多方面的问题,许多研究工作都是为了寻找解决其各个方面的办法;然而,使官方发展援助能够用于生产的全面解决办法仍然很少,而关于官方发展援助的经验和有关挑战的叙述则更加困难。在这项工作中,我们的目标是通过介绍我们自己的经验,在两种不同的高性能计算系统上主动控制温水入口温度和可视化工作数据,弥合ODA研究和生产使用之间的差距。我们涵盖了整个开发过程,从需求和挑战的描述开始,一直到设计、部署和评估。此外,我们讨论了一系列与官方发展援助可维护性相关的关键点,并提出了行动项目,以努力推动社区向前发展。我们依赖于一系列开源工具和技术,这些工具和技术构成了适用于大多数用例的通用ODA框架。
{"title":"Operational Data Analytics in practice: Experiences from design to deployment in production HPC environments","authors":"Alessio Netti ,&nbsp;Michael Ott ,&nbsp;Carla Guillen ,&nbsp;Daniele Tafani ,&nbsp;Martin Schulz","doi":"10.1016/j.parco.2022.102950","DOIUrl":"10.1016/j.parco.2022.102950","url":null,"abstract":"<div><p><span>As HPC systems continue to grow in scale and complexity, efficient and manageable operation is increasingly critical. For this reason, many centers are starting to explore the use of </span><span><em>Operational </em><em>Data Analytics</em></span> (ODA) techniques, which extract knowledge from the massive amounts of data produced by monitoring systems and use it for enacting control over system knobs, or for aiding administrators through visualization. As ODA is a multi-faceted problem, much research effort has gone into finding solutions to its separate aspects: however, comprehensive solutions to enable production use of ODA are still rare, while accounts of ODA experiences and the associated challenges are even harder to come across.</p><p>In this work we aim to bridge the gap between ODA research and production use by presenting our own experiences, associated with proactive control of warm-water inlet temperatures<span> and visualization of job data on two different HPC systems. We cover the entire development process, starting from a description of requirements and challenges, and down to design, deployment and evaluation. Moreover, we discuss a series of critical points related to the maintainability of ODA, and propose action items in an effort to drive the community forward. We rely on a series of open-source tools and techniques, which make for a generic ODA framework that is suitable for most use cases.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102950"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74644871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Accelerating communication for parallel programming models on GPU systems 加速GPU系统上并行编程模型的通信
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102969
Jaemin Choi , Zane Fink , Sam White , Nitin Bhat , David F. Richards , Laxmikant V. Kale

As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.1x in Charm++, 11.7x in AMPI, and 17.4x in Charm4py. We also observe increases in bandwidth of up to 10.1x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.

随着越来越多的领导级系统采用GPU加速器,GPU数据的高效通信正在成为高性能计算最关键的组成部分之一。对于并行编程模型的开发人员来说,使用CUDA等gpu的本机api实现对gpu感知通信的支持可能是一项艰巨的任务,因为它需要相当大的努力,几乎不能保证性能。在这项工作中,我们展示了统一通信X (UCX)框架组成gpu感知通信层的能力,该通信层服务于Charm++生态系统的多个并行编程模型:Charm++,自适应MPI (AMPI)和Charm4py。我们使用从OSU基准测试套件改编的微基准测试来演示我们的设计对性能的影响,在Charm++中获得了高达10.1倍的延迟改善,在AMPI中获得了11.7倍的延迟改善,在Charm4py中获得了17.4倍的延迟改善。我们还观察到,在Charm++中带宽增加了10.1倍,在AMPI中增加了10倍,在Charm4py中增加了10.5倍。通过评估Jacobi迭代方法的代理应用程序,我们展示了我们的设计对实际应用程序的潜在影响,在Charm++中提高了12.4倍的通信性能,在AMPI中提高了12.8倍,在Charm4py中提高了19.7倍。
{"title":"Accelerating communication for parallel programming models on GPU systems","authors":"Jaemin Choi ,&nbsp;Zane Fink ,&nbsp;Sam White ,&nbsp;Nitin Bhat ,&nbsp;David F. Richards ,&nbsp;Laxmikant V. Kale","doi":"10.1016/j.parco.2022.102969","DOIUrl":"10.1016/j.parco.2022.102969","url":null,"abstract":"<div><p><span>As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of </span>parallel programming models<span>, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks<span> adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.1x in Charm++, 11.7x in AMPI, and 17.4x in Charm4py. We also observe increases in bandwidth of up to 10.1x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.</span></span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102969"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82219606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Optimizing small channel 3D convolution on GPU with tensor core 基于张量核的GPU小通道三维卷积优化
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102954
Jiazhi Jiang, Dan Huang, Jiangsu Du, Yutong Lu, Xiangke Liao

In many scenarios, particularly scientific AI applications, algorithm engineers widely adopt more complex convolution, e.g. 3D CNN, to improve the accuracy. Scientific AI applications with 3D-CNN, which tends to train with volumetric datasets, substantially increase the size of the input, which in turn potentially restricts the channel sizes (e.g. less than 64) under the constraints of limited device memory capacity. Since existing convolution implementations tend to split and parallelize computing the small channel convolution from channel dimension, they usually cannot fully exploit the performance of GPU accelerator, in particular that configured with the emerging tensor core.

In this work, we target on enhancing the performance of small channel 3D convolution on the GPU platform configured with tensor cores. Our analysis shows that the channel size of convolution has a great effect on the performance of existing convolution implementations, that are memory-bound on tensor core. By leveraging the memory hierarchy characteristics and the WMMA API of tensor core, we propose and implement holistic optimizations for both promoting the data access efficiency and intensifying the utilization of computing units. Experiments show that our implementation can obtain 1.1x–5.4x speedup comparing to the cuDNN’s implementations for the 3D convolutions on different GPU platforms. We also evaluate our implementations on two practical scientific AI applications and observe up to 1.7x and 2.0x overall speedups compared with using cuDNN on V100 GPU.

在许多场景下,特别是科学AI应用中,算法工程师广泛采用更复杂的卷积,例如3D CNN,以提高精度。具有3D-CNN的科学AI应用,倾向于使用体积数据集进行训练,大大增加了输入的大小,这反过来又潜在地限制了通道大小(例如,在有限的设备内存容量约束下,小于64)。由于现有的卷积实现倾向于从通道维度拆分和并行计算小通道卷积,它们通常不能充分利用GPU加速器的性能,特别是配置了新兴张量核的GPU加速器。在这项工作中,我们的目标是在配置张量核的GPU平台上提高小通道3D卷积的性能。我们的分析表明,卷积的通道大小对现有的卷积实现的性能有很大的影响,这些卷积实现是在张量核上进行内存限制的。利用张量核的内存层次特性和WMMA API,提出并实现了整体优化,既提高了数据访问效率,又增强了计算单元的利用率。实验表明,与cuDNN在不同GPU平台上实现的3D卷积相比,我们的实现可以获得1.1 - 5.4倍的加速。我们还在两个实际的科学AI应用程序上评估了我们的实现,并观察到与在V100 GPU上使用cuDNN相比,整体速度高达1.7倍和2.0倍。
{"title":"Optimizing small channel 3D convolution on GPU with tensor core","authors":"Jiazhi Jiang,&nbsp;Dan Huang,&nbsp;Jiangsu Du,&nbsp;Yutong Lu,&nbsp;Xiangke Liao","doi":"10.1016/j.parco.2022.102954","DOIUrl":"10.1016/j.parco.2022.102954","url":null,"abstract":"<div><p><span>In many scenarios, particularly scientific AI applications, algorithm engineers widely adopt more complex convolution, e.g. 3D </span>CNN<span>, to improve the accuracy. Scientific AI applications with 3D-CNN, which tends to train with volumetric datasets<span>, substantially increase the size of the input, which in turn potentially restricts the channel sizes (e.g. less than 64) under the constraints of limited device memory capacity. Since existing convolution implementations tend to split and parallelize computing the small channel convolution from channel dimension, they usually cannot fully exploit the performance of GPU accelerator, in particular that configured with the emerging tensor core.</span></span></p><p><span>In this work, we target on enhancing the performance of small channel 3D convolution on the GPU platform configured with tensor cores. Our analysis shows that the channel size of convolution has a great effect on the performance of existing convolution implementations, that are memory-bound on tensor core. By leveraging the memory hierarchy characteristics and the WMMA API of tensor core, we propose and implement holistic optimizations for both promoting the data access efficiency and intensifying the utilization of </span>computing units. Experiments show that our implementation can obtain 1.1x–5.4x speedup comparing to the cuDNN’s implementations for the 3D convolutions on different GPU platforms. We also evaluate our implementations on two practical scientific AI applications and observe up to 1.7x and 2.0x overall speedups compared with using cuDNN on V100 GPU.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102954"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78348079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Graph optimization algorithm using symmetry and host bias for low-latency indirect network 基于对称和主机偏差的低延迟间接网络图优化算法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.2139/ssrn.4048955
M. Nakao, M. Tsukamoto, Y. Hanada, Keiji Yamamoto
{"title":"Graph optimization algorithm using symmetry and host bias for low-latency indirect network","authors":"M. Nakao, M. Tsukamoto, Y. Hanada, Keiji Yamamoto","doi":"10.2139/ssrn.4048955","DOIUrl":"https://doi.org/10.2139/ssrn.4048955","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"36 1","pages":"102983"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90026890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A method for efficient radio astronomical data gridding on multi-core vector processor 一种基于多核矢量处理器的射电天文数据高效网格化方法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102972
Hao Wang , Ce Yu , Jian Xiao , Shanjiang Tang , Yu Lu , Hao Fu , Bo Kang , Gang Zheng , Chenzhou Cui

Gridding is the performance-critical step in the data reduction pipeline for radio astronomy research, allowing astronomers to create the correct sky images for further analysis. Like the 2D stencil computation, gridding iteratively updates the output cells by convolution, where the value at each output cell in the space is computed as a weighted sum of neighboring point values. Existing state-of-the-art works have achieved performance improvement of gridding by using multi-core CPUs and GPUs in real-world applications, and their study proved that gridding is a type of scientific computation with high-density computing characteristics. However, low computational performance or high power consumption becomes the main limitation for their processing of large-scale astronomical data. The high-density computing feature of gridding provides opportunities to accelerate it on the multi-core vector processor with vector-SIMD architectures. However, existing works’ (such as those implemented on CPUs or GPUs) task parallelization and data transfer strategies are inefficient to perform gridding directly on the vector processor without any dedicated mapping algorithm.

M-DSP is a multi-core vector processor with vector-SIMD architectures designed for the next-generation exascale supercomputer, delivering high performance with ultra-low power consumption. In this paper, we present, for the first time, a novel method to achieve efficient gridding on the M-DSP. Specifically, we propose a gridding workflow designed for the vector-SIMD architectures and present a vectorized version of the gridding convolution algorithm to fully exploit the computational power of the M-DSP. In addition, centering on the processor architectures, we propose task-based parallelization strategies for block and line computing as well as different data loading strategies to achieve high parallel performance and high data transfer efficiency. Experimental results show that our work on M-DSP exhibits very competitive performance compared to other methods running on CPUs or GPUs. This demonstrates the efficiency of our method and the fact that the vector-SIMD architecture is beneficial for scientific computing with ”high density” characteristics, which can exploit its wide vector core and achieve higher performance than its competitors.

网格化是射电天文学研究中数据简化管道中性能关键的一步,它允许天文学家为进一步分析创建正确的天空图像。与2D模板计算一样,网格化通过卷积迭代更新输出单元,其中空间中每个输出单元的值被计算为相邻点值的加权和。现有的先进工作已经通过在实际应用中使用多核cpu和gpu实现了网格化的性能提升,他们的研究证明了网格化是一种具有高密度计算特性的科学计算。然而,低计算性能或高功耗成为它们处理大规模天文数据的主要限制。网格的高密度计算特性为在具有矢量simd架构的多核矢量处理器上加速网格提供了机会。然而,现有的工作(如那些在cpu或gpu上实现的)任务并行化和数据传输策略在没有任何专用映射算法的情况下直接在矢量处理器上执行网格划分是低效的。M-DSP是一款多核矢量处理器,采用矢量simd架构,专为下一代百亿亿次超级计算机设计,具有高性能和超低功耗。在本文中,我们首次提出了一种在M-DSP上实现高效网格划分的新方法。具体来说,我们提出了一个为矢量simd架构设计的网格工作流程,并提出了网格卷积算法的矢量化版本,以充分利用M-DSP的计算能力。此外,我们围绕处理器架构提出了基于任务的并行化策略,用于块计算和行计算,以及不同的数据加载策略,以实现高并行性能和高数据传输效率。实验结果表明,与其他在cpu或gpu上运行的方法相比,我们在M-DSP上的工作表现出非常有竞争力的性能。这表明了我们的方法的有效性,并且矢量simd架构有利于具有“高密度”特征的科学计算,可以利用其宽矢量核并获得比竞争对手更高的性能。
{"title":"A method for efficient radio astronomical data gridding on multi-core vector processor","authors":"Hao Wang ,&nbsp;Ce Yu ,&nbsp;Jian Xiao ,&nbsp;Shanjiang Tang ,&nbsp;Yu Lu ,&nbsp;Hao Fu ,&nbsp;Bo Kang ,&nbsp;Gang Zheng ,&nbsp;Chenzhou Cui","doi":"10.1016/j.parco.2022.102972","DOIUrl":"10.1016/j.parco.2022.102972","url":null,"abstract":"<div><p><span><span>Gridding is the performance-critical step in the data reduction pipeline for radio astronomy research, allowing astronomers to create the correct sky images for further analysis. Like the 2D stencil computation, gridding iteratively updates the output cells by convolution, where the value at each output cell in the space is computed as a weighted sum of neighboring point values. Existing state-of-the-art works have achieved performance improvement of gridding by using multi-core CPUs and GPUs in real-world applications, and their study proved that gridding is a type of scientific computation with high-density computing characteristics. However, low computational performance or high </span>power consumption<span> becomes the main limitation for their processing of large-scale astronomical data. The high-density computing feature of gridding provides opportunities to accelerate it on the multi-core vector processor with vector-SIMD architectures. However, existing works’ (such as those implemented on CPUs or GPUs) task </span></span>parallelization<span> and data transfer strategies are inefficient to perform gridding directly on the vector processor without any dedicated mapping algorithm.</span></p><p>M-DSP is a multi-core vector processor with vector-SIMD architectures designed for the next-generation exascale supercomputer<span>, delivering high performance with ultra-low power consumption. In this paper, we present, for the first time, a novel method to achieve efficient gridding on the M-DSP. Specifically, we propose a gridding workflow designed for the vector-SIMD architectures and present a vectorized version<span> of the gridding convolution algorithm to fully exploit the computational power of the M-DSP. In addition, centering on the processor architectures, we propose task-based parallelization strategies for block and line computing as well as different data loading strategies to achieve high parallel performance and high data transfer efficiency. Experimental results show that our work on M-DSP exhibits very competitive performance compared to other methods running on CPUs or GPUs. This demonstrates the efficiency of our method and the fact that the vector-SIMD architecture is beneficial for scientific computing with ”high density” characteristics, which can exploit its wide vector core and achieve higher performance than its competitors.</span></span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102972"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75782731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU 基于qos的动态资源分配,提高了GPU的利用率和能效
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102958
Qingxiao Sun , Liu Yi , Hailong Yang , Mingzhen Li , Zhongzhi Luan , Depei Qian

Although GPUs have been indispensable in data centers, meeting the Quality of Service (QoS) under task consolidation on GPU is extremely challenging. Previous works mostly rely on the static task or resource scheduling and cannot handle the QoS violation during runtime. In addition, existing works fail to exploit the computing characteristics of batch tasks, and thus waste the opportunities to reduce power consumption while improving GPU utilization. To address the above problems, we propose a new runtime mechanism SMQoS that can dynamically adjust the resource allocation during runtime to meet the QoS of latency-sensitive (LS) tasks and determine the optimal resource allocation for batch tasks to improve GPU utilization and power efficiency. We implement the proposed mechanism on both simulator (SMQoS) and real GPU hardware (RH-SMQoS). The experimental results show that both SMQoS and RH-SMQoS can achieve better QoS for LS tasks and higher throughput for batch tasks compared to the state-of-the-art works. With hardware extension, the SMQoS can further reduce the power consumption by power gating idle computing resources.

虽然GPU已经成为数据中心不可或缺的一部分,但在GPU上实现任务整合下的服务质量(QoS)是非常具有挑战性的。以往的工作大多依赖于静态任务或资源调度,无法在运行时处理QoS冲突。此外,现有的工作未能充分利用批处理任务的计算特性,从而浪费了在提高GPU利用率的同时降低功耗的机会。针对上述问题,我们提出了一种新的运行时机制SMQoS,该机制可以在运行时动态调整资源分配,以满足延迟敏感(LS)任务的QoS要求,并确定批处理任务的最优资源分配,从而提高GPU利用率和功耗效率。我们在模拟器(SMQoS)和真实GPU硬件(RH-SMQoS)上实现了所提出的机制。实验结果表明,与现有方法相比,SMQoS和RH-SMQoS都可以实现更好的LS任务QoS和更高的批处理任务吞吐量。通过硬件扩展,SMQoS可以通过对空闲计算资源进行电源门控来进一步降低功耗。
{"title":"QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU","authors":"Qingxiao Sun ,&nbsp;Liu Yi ,&nbsp;Hailong Yang ,&nbsp;Mingzhen Li ,&nbsp;Zhongzhi Luan ,&nbsp;Depei Qian","doi":"10.1016/j.parco.2022.102958","DOIUrl":"10.1016/j.parco.2022.102958","url":null,"abstract":"<div><p><span><span><span>Although GPUs have been indispensable in </span>data centers, meeting the Quality of Service (QoS) under task consolidation on GPU is extremely challenging. Previous works mostly rely on the static task or resource scheduling and cannot handle the QoS violation during runtime. In addition, existing works fail to exploit the computing characteristics of batch tasks, and thus waste the opportunities to reduce </span>power consumption while improving GPU utilization. To address the above problems, we propose a new runtime mechanism </span><em>SMQoS</em> that can dynamically adjust the resource allocation during runtime to meet the QoS of latency-sensitive (LS) tasks and determine the optimal resource allocation for batch tasks to improve GPU utilization and power efficiency. We implement the proposed mechanism on both simulator (<em>SMQoS</em>) and real GPU hardware (<em>RH-SMQoS</em>). The experimental results show that both <em>SMQoS</em> and <em>RH-SMQoS</em><span> can achieve better QoS for LS tasks and higher throughput for batch tasks compared to the state-of-the-art works. With hardware extension, the </span><em>SMQoS</em> can further reduce the power consumption by power gating idle computing resources.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102958"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75432812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1