首页 > 最新文献

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops最新文献

英文 中文
NVIDIA Grace Superchip Early Evaluation for HPC Applications 英伟达™(NVIDIA®)Grace 超级芯片针对高性能计算应用的早期评估
Fabio Banchelli, Joan Vinyals-Ylla-Catala, Josep Pocurull, Marc Clascà, Kilian Peiro, Filippo Spiga, M. Garcia-Gasulla, Filippo Mantovani
Arm-based system in HPC are a reality since more than a decade. However, when a new chip enters the market always implies challenges, not only at ISA level, but also with regards to the SoC integration, the memory subsystem, the board integration, the node interconnection, and finally the OS and all layers of the system software (compiler and libraries). Guided by the procurement of an NVIDIA Grace HPC cluster within the deployment of MareNostrum 5, and emulating the approach of a scientist who needs to migrate its scientific research to a new HPC system, we evaluated five complex scientific applications on engineering sample nodes of NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip (CPU-only). We report intra-node and inter-node scalability and early performance results showing a speed-up between 1.3 × and 4.28 × for all codes when compared to the current generation of MareNostrum 4 powered by Intel Skylake CPUs.
十多年来,基于 Arm 的高性能计算系统已成为现实。然而,新芯片进入市场总是意味着挑战,不仅是在 ISA 层面,还涉及 SoC 集成、内存子系统、板卡集成、节点互连,以及操作系统和系统软件的所有层面(编译器和库)。在 MareNostrum 5 部署范围内采购英伟达™(NVIDIA®)Grace HPC 集群的指导下,并模拟科学家需要将其科学研究迁移到新 HPC 系统的方法,我们在英伟达™(NVIDIA®)Grace CPU 超级芯片和英伟达™(NVIDIA®)Grace Hopper 超级芯片(仅 CPU)的工程样本节点上评估了五个复杂的科学应用。我们报告了节点内和节点间的可扩展性以及早期性能结果,与采用英特尔Skylake CPU的新一代MareNostrum 4相比,所有代码的速度提高了1.3倍至4.28倍。
{"title":"NVIDIA Grace Superchip Early Evaluation for HPC Applications","authors":"Fabio Banchelli, Joan Vinyals-Ylla-Catala, Josep Pocurull, Marc Clascà, Kilian Peiro, Filippo Spiga, M. Garcia-Gasulla, Filippo Mantovani","doi":"10.1145/3636480.3637284","DOIUrl":"https://doi.org/10.1145/3636480.3637284","url":null,"abstract":"Arm-based system in HPC are a reality since more than a decade. However, when a new chip enters the market always implies challenges, not only at ISA level, but also with regards to the SoC integration, the memory subsystem, the board integration, the node interconnection, and finally the OS and all layers of the system software (compiler and libraries). Guided by the procurement of an NVIDIA Grace HPC cluster within the deployment of MareNostrum 5, and emulating the approach of a scientist who needs to migrate its scientific research to a new HPC system, we evaluated five complex scientific applications on engineering sample nodes of NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip (CPU-only). We report intra-node and inter-node scalability and early performance results showing a speed-up between 1.3 × and 4.28 × for all codes when compared to the current generation of MareNostrum 4 powered by Intel Skylake CPUs.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Intel oneAPI for Multi-hybrid Acceleration Programming with GPU and FPGA Coupling 使用英特尔 oneAPI 进行带有 GPU 和 FPGA 耦合的多混合加速编程
Wentao Liang, N. Fujita, Ryohei Kobayashi, T. Boku
Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.
英特尔 oneAPI 是一个编程框架,可接受各种加速器,如 GPU、FPGA 和多核 CPU,重点关注 HPC 应用。用户可以将用一种语言(DPC++)编写的代码应用到这种异构编程环境中。然而,在实践中,它并不容易应用于不同的加速器,特别是对于英伟达和 AMD GPU 等非英特尔设备。我们已经成功构建了一套 oneAPI 环境,利用单一 DPC++ 编程同时处理包括英伟达™(NVIDIA®)GPU 和英特尔 FPGA 在内的真正多异构加速。在本文中,我们将展示如何做到这一点,以及可以针对哪些类型的应用程序。
{"title":"Using Intel oneAPI for Multi-hybrid Acceleration Programming with GPU and FPGA Coupling","authors":"Wentao Liang, N. Fujita, Ryohei Kobayashi, T. Boku","doi":"10.1145/3636480.3637220","DOIUrl":"https://doi.org/10.1145/3636480.3637220","url":null,"abstract":"Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"3 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MPI-Adapter2: An Automatic ABI Translation Library Builder for MPI Application Binary Portability MPI-Adapter2:用于 MPI 应用程序二进制可移植性的自动 ABI 转换库生成器
Shinji Sumimoto, Toshihiro Hanawa, Kengo Nakajima
This paper proposes an automatic MPI ABI (Application Binary Interface) translation library builder named MPI-Adapter2. The container-based job environment is becoming widespread in computer centers. However, when a user uses the container image in another computer center, the container with MPI binary may not work because of the difference in the ABI of MPI libraries. The MPI-Adapter2 enables to building of MPI ABI translation libraries automatically from MPI libraries. MPI-Adapter2 can build MPI ABI translation libraries not only between different MPI implementations, such as Open MPI, MPICH, and Intel MPI but also between different versions of MPI implementation. We have implemented and evaluated MPI-Adapter2 among several versions of Intel MPI, MPICH, MVAPICH, and Open MPI using NAS parallel benchmarks and pHEAT-3D, and found that MPI-Adapter2 worked fine except for Open MPI ver. 4 binary on Open MPI ver. 2 on IS of NAS parallel benchmarks, because of the difference in MPI object size. We also evaluated the pHEAT-3D binary compiled by Open MPI ver.5 using MPI-Adapter2 up to 1024 processes with 128 nodes. The performance overhead between MPI-Adapter2 and Intel native evaluation was 1.3%.
本文提出了一种名为 MPI-Adapter2 的 MPI ABI(应用二进制接口)自动转换库生成器。基于容器的工作环境正在计算机中心普及。然而,当用户在其他计算机中心使用容器镜像时,由于 MPI 库的 ABI 不同,带有 MPI 二进制的容器可能无法工作。MPI-Adapter2 可以从 MPI 库自动构建 MPI ABI 转换库。MPI-Adapter2 不仅能在不同的 MPI 实现(如 Open MPI、MPICH 和 Intel MPI)之间建立 MPI ABI 转换库,还能在不同版本的 MPI 实现之间建立 MPI ABI 转换库。我们使用 NAS 并行基准和 pHEAT-3D 在多个版本的英特尔 MPI、MPICH、MVAPICH 和 Open MPI 之间实现并评估了 MPI-Adapter2,发现 MPI-Adapter2 在 Open MPI ver.4 二进制 Open MPI ver.由于 MPI 对象的大小不同,MPI-Adapter2 在 NAS 并行基准的 IS 上运行良好。我们还使用 MPI-Adapter2 评估了由 Open MPI ver.5 编译的 pHEAT-3D 二进制程序,最高可达 1024 个进程、128 个节点。MPI-Adapter2 和英特尔本地评估的性能开销为 1.3%。
{"title":"MPI-Adapter2: An Automatic ABI Translation Library Builder for MPI Application Binary Portability","authors":"Shinji Sumimoto, Toshihiro Hanawa, Kengo Nakajima","doi":"10.1145/3636480.3637219","DOIUrl":"https://doi.org/10.1145/3636480.3637219","url":null,"abstract":"This paper proposes an automatic MPI ABI (Application Binary Interface) translation library builder named MPI-Adapter2. The container-based job environment is becoming widespread in computer centers. However, when a user uses the container image in another computer center, the container with MPI binary may not work because of the difference in the ABI of MPI libraries. The MPI-Adapter2 enables to building of MPI ABI translation libraries automatically from MPI libraries. MPI-Adapter2 can build MPI ABI translation libraries not only between different MPI implementations, such as Open MPI, MPICH, and Intel MPI but also between different versions of MPI implementation. We have implemented and evaluated MPI-Adapter2 among several versions of Intel MPI, MPICH, MVAPICH, and Open MPI using NAS parallel benchmarks and pHEAT-3D, and found that MPI-Adapter2 worked fine except for Open MPI ver. 4 binary on Open MPI ver. 2 on IS of NAS parallel benchmarks, because of the difference in MPI object size. We also evaluated the pHEAT-3D binary compiled by Open MPI ver.5 using MPI-Adapter2 up to 1024 processes with 128 nodes. The performance overhead between MPI-Adapter2 and Intel native evaluation was 1.3%.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"2 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Error-Energy Tradeoff in Molecular and Molecular-Continuum Fluid Simulations 分子和分子真空流体模拟中的误差-能量权衡
Amartya Das Sharma, Ruben Horn, Philipp Neumann
Energy consumption plays a crucial role when designing simulation studies. In this work, we take a step towards modelling the relationship between statistical error and energy consumption for molecular and molecular-continuum flow simulations. After revisiting statistical error analysis and run time complexities for molecular dynamics (MD) simulations, we verify the respective relationships in stand-alone short-range MD simulations. We then extend the analysis to coupled molecular-continuum simulations, including the multi-instance (i.e., MD ensemble averaging) case, and additionally analyse the impact of noise filters. Our findings suggest that Gauss filters can reduce the statistical error to a similar degree as doubling the number of MD instances would. We further use regression to derive an analytical energy consumption model that predicts energy consumption on our HPC-cluster HSUper, to achieve simulation results at a prescribed statistical error (or gain in signal-to-noise ratio, respectively). All simulations were carried out using the MD software ls1 mardyn and the molecular-continuum coupling tool MaMiCo. However, the derived models are easily transferable to other pieces of software and other HPC platforms.
在设计模拟研究时,能耗起着至关重要的作用。在这项工作中,我们在分子和分子连续流模拟的统计误差与能耗之间的关系建模方面迈出了一步。在重新审视了分子动力学(MD)模拟的统计误差分析和运行时间复杂性之后,我们在独立的短程 MD 模拟中验证了各自的关系。然后,我们将分析扩展到耦合分子-真空模拟,包括多实例(即 MD 集合平均)情况,并额外分析了噪声滤波器的影响。我们的研究结果表明,高斯滤波器可以在类似于将 MD 实例数量增加一倍的程度上减少统计误差。我们进一步使用回归法推导出一个能耗分析模型,该模型可预测我们的高性能计算集群 HSUper 上的能耗,从而在规定的统计误差(或信噪比增益)下实现仿真结果。所有模拟均使用 MD 软件 ls1 mardyn 和分子-真空耦合工具 MaMiCo 进行。不过,衍生模型可以很容易地移植到其他软件和其他高性能计算平台上。
{"title":"The Error-Energy Tradeoff in Molecular and Molecular-Continuum Fluid Simulations","authors":"Amartya Das Sharma, Ruben Horn, Philipp Neumann","doi":"10.1145/3636480.3636486","DOIUrl":"https://doi.org/10.1145/3636480.3636486","url":null,"abstract":"Energy consumption plays a crucial role when designing simulation studies. In this work, we take a step towards modelling the relationship between statistical error and energy consumption for molecular and molecular-continuum flow simulations. After revisiting statistical error analysis and run time complexities for molecular dynamics (MD) simulations, we verify the respective relationships in stand-alone short-range MD simulations. We then extend the analysis to coupled molecular-continuum simulations, including the multi-instance (i.e., MD ensemble averaging) case, and additionally analyse the impact of noise filters. Our findings suggest that Gauss filters can reduce the statistical error to a similar degree as doubling the number of MD instances would. We further use regression to derive an analytical energy consumption model that predicts energy consumption on our HPC-cluster HSUper, to achieve simulation results at a prescribed statistical error (or gain in signal-to-noise ratio, respectively). All simulations were carried out using the MD software ls1 mardyn and the molecular-continuum coupling tool MaMiCo. However, the derived models are easily transferable to other pieces of software and other HPC platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"10 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design and Preliminary Evaluation of OpenACC Compiler for FPGA with OpenCL and Stream Processing DSL 基于OpenCL和流处理DSL的FPGA OpenACC编译器设计与初步评价
Yutaka Watanabe, Jinpil Lee, K. Sano, T. Boku, M. Sato
FPGA has emerged as one of the attractive computing devices in the post-Moore era because of its power efficiency and reconfigurability, even for future high-performance computing. We have designed an OpenACC compiler for FPGA to generate the kernel code by using stream processing Domain Specific Language (DSL) called SPGen, with OpenCL. Although, recently, the programming for FPGA has been improved dramatically by High-Level Synthesis (HLS) frameworks such as OpenCL and HLS C, yet it is still too difficult for HPC application developers, and the directive-based programming models such as OpenACC should be supported even for FPGA. OpenCL can be used as a portable intermediate code for OpenACC for FPGA. However, the generation of hardware from OpenCL is not easy to understand and therefore requires expert knowledge. SPGen is a DSL framework for generating stream processing HDL modules from the description of a dataflow graph. The advantage of our approach is that the code generation with SPGen enables more comprehensive low-level optimization in the OpenACC compiler. The preliminary evaluation results show that, for some kernels, the proposed method, which translates the OpenACC C code into OpenCL and SPGen codes, can perform optimization in the lower level more explicitly than the OpenCL-only method, which translates the OpenACC C code into the OpenCL code only. We also observed that more resources might be consumed in the proposed method. However, implementations of both methods are preliminary. We believe improving code generation will fix the problems such as high resource consumption.
FPGA已成为后摩尔时代最具吸引力的计算设备之一,因为它的功率效率和可重构性,甚至可以用于未来的高性能计算。我们设计了一个OpenACC编译器用于FPGA,通过使用流处理领域特定语言SPGen (Domain Specific Language)和OpenCL生成内核代码。尽管近年来,高级综合(High-Level Synthesis, HLS)框架(如OpenCL和HLS C)对FPGA的编程有了很大的改进,但对于HPC应用开发人员来说仍然过于困难,即使在FPGA上也应该支持基于指令的编程模型(如OpenACC)。OpenCL可以作为OpenACC的可移植中间代码用于FPGA。然而,从OpenCL生成硬件并不容易理解,因此需要专业知识。SPGen是一个DSL框架,用于从数据流图的描述生成流处理HDL模块。我们的方法的优点是,使用SPGen生成的代码可以在OpenACC编译器中实现更全面的低级优化。初步评估结果表明,对于某些内核,将OpenACC C代码转换为OpenCL和SPGen代码的方法比仅将OpenACC C代码转换为OpenCL代码的方法更显式地执行底层优化。我们还观察到,在提出的方法中可能会消耗更多的资源。然而,这两种方法的实现都是初步的。我们相信改进代码生成将解决诸如高资源消耗之类的问题。
{"title":"Design and Preliminary Evaluation of OpenACC Compiler for FPGA with OpenCL and Stream Processing DSL","authors":"Yutaka Watanabe, Jinpil Lee, K. Sano, T. Boku, M. Sato","doi":"10.1145/3373271.3373274","DOIUrl":"https://doi.org/10.1145/3373271.3373274","url":null,"abstract":"FPGA has emerged as one of the attractive computing devices in the post-Moore era because of its power efficiency and reconfigurability, even for future high-performance computing. We have designed an OpenACC compiler for FPGA to generate the kernel code by using stream processing Domain Specific Language (DSL) called SPGen, with OpenCL. Although, recently, the programming for FPGA has been improved dramatically by High-Level Synthesis (HLS) frameworks such as OpenCL and HLS C, yet it is still too difficult for HPC application developers, and the directive-based programming models such as OpenACC should be supported even for FPGA. OpenCL can be used as a portable intermediate code for OpenACC for FPGA. However, the generation of hardware from OpenCL is not easy to understand and therefore requires expert knowledge. SPGen is a DSL framework for generating stream processing HDL modules from the description of a dataflow graph. The advantage of our approach is that the code generation with SPGen enables more comprehensive low-level optimization in the OpenACC compiler. The preliminary evaluation results show that, for some kernels, the proposed method, which translates the OpenACC C code into OpenCL and SPGen codes, can perform optimization in the lower level more explicitly than the OpenCL-only method, which translates the OpenACC C code into the OpenCL code only. We also observed that more resources might be consumed in the proposed method. However, implementations of both methods are preliminary. We believe improving code generation will fix the problems such as high resource consumption.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124333703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
The Analysis of Inter-Process Interference on a Hybrid Memory System 混合存储系统的进程间干扰分析
S. Imamura, Eiji Yoshida
Persistent memory (PM) is an emerging memory device that has a larger capacity and lower cost per gigabyte than conventional DRAM. Intel has released a first PM product called Optane™ DC Persistent Memory, but its performance is several times lower than that of DRAM. Therefore, it will be used in combination with DRAM to configure hybrid memory systems that can obtain both the high performance of DRAM and large capacity of PM. In this paper, we evaluate and analyze the performance interference between various types of processes that are concurrently executed on a real server platform having a hybrid memory system. Through the evaluation with a synthetic benchmark, we show that the interference on the hybrid memory system is significantly different from that on a conventional DRAM-only memory system.
持久存储器(PM)是一种新兴的存储器设备,它比传统的DRAM具有更大的容量和更低的每千兆字节成本。英特尔已经发布了第一款名为Optane™DC Persistent Memory的PM产品,但其性能比DRAM低几倍。因此,它将与DRAM结合使用,配置可以同时获得DRAM的高性能和PM的大容量的混合存储系统。在本文中,我们评估和分析了在具有混合存储系统的真实服务器平台上并发执行的各种类型进程之间的性能干扰。通过综合基准测试,我们发现混合存储系统的干扰与传统的纯dram存储系统有显著的不同。
{"title":"The Analysis of Inter-Process Interference on a Hybrid Memory System","authors":"S. Imamura, Eiji Yoshida","doi":"10.1145/3373271.3373272","DOIUrl":"https://doi.org/10.1145/3373271.3373272","url":null,"abstract":"Persistent memory (PM) is an emerging memory device that has a larger capacity and lower cost per gigabyte than conventional DRAM. Intel has released a first PM product called Optane™ DC Persistent Memory, but its performance is several times lower than that of DRAM. Therefore, it will be used in combination with DRAM to configure hybrid memory systems that can obtain both the high performance of DRAM and large capacity of PM. In this paper, we evaluate and analyze the performance interference between various types of processes that are concurrently executed on a real server platform having a hybrid memory system. Through the evaluation with a synthetic benchmark, we show that the interference on the hybrid memory system is significantly different from that on a conventional DRAM-only memory system.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126322825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Parallel Multigrid Method on Multicore/Manycore Clusters 多核/多核集群的并行多网格方法
K. Nakajima
Parallel multigrid method is expected to be a useful algorithm in exascale era because of its scalability. It is widely known that overhead of coarse grid solver in parallel multigrid method is significant, if the number of MPI processes is O(104) or larger. The author proposed the hCGA for avoiding such overhead. Recently, the AM-hCGA, further optimized version of the hCGA, was proposed by the author, and its performance was evaluated on the Oakforest-PACS system (OFP) with IHK/McKernel at JCAHPC using up to 2,048 nodes of Intel Xeon Phi (Knights Landing). In the present work, developed method is also implemented to the Oakbridge-CX system (OBCX) at the University of Tokyo using up to 1,024 nodes (2,048 sockets) of Intel Xeon Platinum 8280 (Cascade Lake). Performance in weak and strong scaling are evaluated for application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM). The hCGA and the AM-hCGA provide excellent performance on both of OFP and OBCX with larger number of nodes. Especially, it achieved excellent performance in strong scaling on OBCX.
并行多网格方法由于其可扩展性,有望成为百亿亿次时代的一种有用的算法。众所周知,在并行多网格方法中,当MPI进程数大于等于0(104)时,粗网格求解器的开销是显著的。为了避免这种开销,作者提出了hCGA。最近,作者提出了进一步优化的AM-hCGA,并在JCAHPC上使用多达2,048个Intel Xeon Phi (Knights Landing)节点的IHK/McKernel Oakforest-PACS系统(OFP)上对其性能进行了评估。在目前的工作中,开发的方法也被实现到东京大学的Oakbridge-CX系统(OBCX)上,使用多达1,024个节点(2,048个插座)的Intel Xeon Platinum 8280 (Cascade Lake)。在非均质多孔介质(pGW3D-FVM)三维地下水渗流中,对弱结垢和强结垢的性能进行了评价。hCGA和AM-hCGA在节点数较大的OFP和OBCX上都有很好的性能。特别是在OBCX上实现了出色的强缩放性能。
{"title":"Parallel Multigrid Method on Multicore/Manycore Clusters","authors":"K. Nakajima","doi":"10.1145/3373271.3373273","DOIUrl":"https://doi.org/10.1145/3373271.3373273","url":null,"abstract":"Parallel multigrid method is expected to be a useful algorithm in exascale era because of its scalability. It is widely known that overhead of coarse grid solver in parallel multigrid method is significant, if the number of MPI processes is O(104) or larger. The author proposed the hCGA for avoiding such overhead. Recently, the AM-hCGA, further optimized version of the hCGA, was proposed by the author, and its performance was evaluated on the Oakforest-PACS system (OFP) with IHK/McKernel at JCAHPC using up to 2,048 nodes of Intel Xeon Phi (Knights Landing). In the present work, developed method is also implemented to the Oakbridge-CX system (OBCX) at the University of Tokyo using up to 1,024 nodes (2,048 sockets) of Intel Xeon Platinum 8280 (Cascade Lake). Performance in weak and strong scaling are evaluated for application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM). The hCGA and the AM-hCGA provide excellent performance on both of OFP and OBCX with larger number of nodes. Especially, it achieved excellent performance in strong scaling on OBCX.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"82 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125909921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OpenCL-enabled GPU-FPGA Accelerated Computing with Inter-FPGA Communication 支持opencl的GPU-FPGA加速计算,fpga间通信
Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, Ayumi Nakamichi, T. Boku
Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research; their computational and communication capabilities have drastically improved in recent years owing to advances in semiconductor integration technologies. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL that reduce the amount of programming effort required have been developed and offered by FPGA vendors. These improvements reveal the possibility of implementing a concept that enables on-the-fly offloading of computational loads at which CPUs/GPUs perform poorly compared to FPGAs while moving data with low latency. We think that this concept is key to improving the performance of heterogeneous supercomputers that use accelerators such as the GPU. In this paper, we propose an approach for GPU--FPGA accelerated computing with the OpenCL programming framework that is based on the OpenCL-enabled GPU--FPGA DMA method and the FPGA-to-FPGA communication method. The experimental results demonstrate that our proposed method can enable GPUs and FPGAs to work together over different nodes.
现场可编程门阵列(fpga)在高性能计算研究中引起了极大的兴趣;近年来,由于半导体集成技术的进步,它们的计算和通信能力大大提高。除了提高FPGA性能外,FPGA供应商还开发并提供了用于在OpenCL中开发FPGA的工具链,以减少所需的编程工作量。这些改进揭示了实现一个概念的可能性,即在低延迟移动数据时,cpu / gpu与fpga相比性能较差,从而可以实时卸载计算负载。我们认为这个概念是提高使用GPU等加速器的异构超级计算机性能的关键。在本文中,我们提出了一种基于OpenCL编程框架的GPU- FPGA加速计算方法,该框架基于支持OpenCL的GPU- FPGA DMA方法和FPGA- FPGA通信方法。实验结果表明,该方法可以实现gpu和fpga在不同节点上的协同工作。
{"title":"OpenCL-enabled GPU-FPGA Accelerated Computing with Inter-FPGA Communication","authors":"Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, Ayumi Nakamichi, T. Boku","doi":"10.1145/3373271.3373275","DOIUrl":"https://doi.org/10.1145/3373271.3373275","url":null,"abstract":"Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research; their computational and communication capabilities have drastically improved in recent years owing to advances in semiconductor integration technologies. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL that reduce the amount of programming effort required have been developed and offered by FPGA vendors. These improvements reveal the possibility of implementing a concept that enables on-the-fly offloading of computational loads at which CPUs/GPUs perform poorly compared to FPGAs while moving data with low latency. We think that this concept is key to improving the performance of heterogeneous supercomputers that use accelerators such as the GPU. In this paper, we propose an approach for GPU--FPGA accelerated computing with the OpenCL programming framework that is based on the OpenCL-enabled GPU--FPGA DMA method and the FPGA-to-FPGA communication method. The experimental results demonstrate that our proposed method can enable GPUs and FPGAs to work together over different nodes.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115160620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops 亚太地区高性能计算国际会议论文集
{"title":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","authors":"","doi":"10.1145/3373271","DOIUrl":"https://doi.org/10.1145/3373271","url":null,"abstract":"","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"204 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133941326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1