Fabio Banchelli, Joan Vinyals-Ylla-Catala, Josep Pocurull, Marc Clascà, Kilian Peiro, Filippo Spiga, M. Garcia-Gasulla, Filippo Mantovani
Arm-based system in HPC are a reality since more than a decade. However, when a new chip enters the market always implies challenges, not only at ISA level, but also with regards to the SoC integration, the memory subsystem, the board integration, the node interconnection, and finally the OS and all layers of the system software (compiler and libraries). Guided by the procurement of an NVIDIA Grace HPC cluster within the deployment of MareNostrum 5, and emulating the approach of a scientist who needs to migrate its scientific research to a new HPC system, we evaluated five complex scientific applications on engineering sample nodes of NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip (CPU-only). We report intra-node and inter-node scalability and early performance results showing a speed-up between 1.3 × and 4.28 × for all codes when compared to the current generation of MareNostrum 4 powered by Intel Skylake CPUs.
十多年来,基于 Arm 的高性能计算系统已成为现实。然而,新芯片进入市场总是意味着挑战,不仅是在 ISA 层面,还涉及 SoC 集成、内存子系统、板卡集成、节点互连,以及操作系统和系统软件的所有层面(编译器和库)。在 MareNostrum 5 部署范围内采购英伟达™(NVIDIA®)Grace HPC 集群的指导下,并模拟科学家需要将其科学研究迁移到新 HPC 系统的方法,我们在英伟达™(NVIDIA®)Grace CPU 超级芯片和英伟达™(NVIDIA®)Grace Hopper 超级芯片(仅 CPU)的工程样本节点上评估了五个复杂的科学应用。我们报告了节点内和节点间的可扩展性以及早期性能结果,与采用英特尔Skylake CPU的新一代MareNostrum 4相比,所有代码的速度提高了1.3倍至4.28倍。
{"title":"NVIDIA Grace Superchip Early Evaluation for HPC Applications","authors":"Fabio Banchelli, Joan Vinyals-Ylla-Catala, Josep Pocurull, Marc Clascà, Kilian Peiro, Filippo Spiga, M. Garcia-Gasulla, Filippo Mantovani","doi":"10.1145/3636480.3637284","DOIUrl":"https://doi.org/10.1145/3636480.3637284","url":null,"abstract":"Arm-based system in HPC are a reality since more than a decade. However, when a new chip enters the market always implies challenges, not only at ISA level, but also with regards to the SoC integration, the memory subsystem, the board integration, the node interconnection, and finally the OS and all layers of the system software (compiler and libraries). Guided by the procurement of an NVIDIA Grace HPC cluster within the deployment of MareNostrum 5, and emulating the approach of a scientist who needs to migrate its scientific research to a new HPC system, we evaluated five complex scientific applications on engineering sample nodes of NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip (CPU-only). We report intra-node and inter-node scalability and early performance results showing a speed-up between 1.3 × and 4.28 × for all codes when compared to the current generation of MareNostrum 4 powered by Intel Skylake CPUs.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentao Liang, N. Fujita, Ryohei Kobayashi, T. Boku
Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.
{"title":"Using Intel oneAPI for Multi-hybrid Acceleration Programming with GPU and FPGA Coupling","authors":"Wentao Liang, N. Fujita, Ryohei Kobayashi, T. Boku","doi":"10.1145/3636480.3637220","DOIUrl":"https://doi.org/10.1145/3636480.3637220","url":null,"abstract":"Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"3 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes an automatic MPI ABI (Application Binary Interface) translation library builder named MPI-Adapter2. The container-based job environment is becoming widespread in computer centers. However, when a user uses the container image in another computer center, the container with MPI binary may not work because of the difference in the ABI of MPI libraries. The MPI-Adapter2 enables to building of MPI ABI translation libraries automatically from MPI libraries. MPI-Adapter2 can build MPI ABI translation libraries not only between different MPI implementations, such as Open MPI, MPICH, and Intel MPI but also between different versions of MPI implementation. We have implemented and evaluated MPI-Adapter2 among several versions of Intel MPI, MPICH, MVAPICH, and Open MPI using NAS parallel benchmarks and pHEAT-3D, and found that MPI-Adapter2 worked fine except for Open MPI ver. 4 binary on Open MPI ver. 2 on IS of NAS parallel benchmarks, because of the difference in MPI object size. We also evaluated the pHEAT-3D binary compiled by Open MPI ver.5 using MPI-Adapter2 up to 1024 processes with 128 nodes. The performance overhead between MPI-Adapter2 and Intel native evaluation was 1.3%.
本文提出了一种名为 MPI-Adapter2 的 MPI ABI(应用二进制接口)自动转换库生成器。基于容器的工作环境正在计算机中心普及。然而,当用户在其他计算机中心使用容器镜像时,由于 MPI 库的 ABI 不同,带有 MPI 二进制的容器可能无法工作。MPI-Adapter2 可以从 MPI 库自动构建 MPI ABI 转换库。MPI-Adapter2 不仅能在不同的 MPI 实现(如 Open MPI、MPICH 和 Intel MPI)之间建立 MPI ABI 转换库,还能在不同版本的 MPI 实现之间建立 MPI ABI 转换库。我们使用 NAS 并行基准和 pHEAT-3D 在多个版本的英特尔 MPI、MPICH、MVAPICH 和 Open MPI 之间实现并评估了 MPI-Adapter2,发现 MPI-Adapter2 在 Open MPI ver.4 二进制 Open MPI ver.由于 MPI 对象的大小不同,MPI-Adapter2 在 NAS 并行基准的 IS 上运行良好。我们还使用 MPI-Adapter2 评估了由 Open MPI ver.5 编译的 pHEAT-3D 二进制程序,最高可达 1024 个进程、128 个节点。MPI-Adapter2 和英特尔本地评估的性能开销为 1.3%。
{"title":"MPI-Adapter2: An Automatic ABI Translation Library Builder for MPI Application Binary Portability","authors":"Shinji Sumimoto, Toshihiro Hanawa, Kengo Nakajima","doi":"10.1145/3636480.3637219","DOIUrl":"https://doi.org/10.1145/3636480.3637219","url":null,"abstract":"This paper proposes an automatic MPI ABI (Application Binary Interface) translation library builder named MPI-Adapter2. The container-based job environment is becoming widespread in computer centers. However, when a user uses the container image in another computer center, the container with MPI binary may not work because of the difference in the ABI of MPI libraries. The MPI-Adapter2 enables to building of MPI ABI translation libraries automatically from MPI libraries. MPI-Adapter2 can build MPI ABI translation libraries not only between different MPI implementations, such as Open MPI, MPICH, and Intel MPI but also between different versions of MPI implementation. We have implemented and evaluated MPI-Adapter2 among several versions of Intel MPI, MPICH, MVAPICH, and Open MPI using NAS parallel benchmarks and pHEAT-3D, and found that MPI-Adapter2 worked fine except for Open MPI ver. 4 binary on Open MPI ver. 2 on IS of NAS parallel benchmarks, because of the difference in MPI object size. We also evaluated the pHEAT-3D binary compiled by Open MPI ver.5 using MPI-Adapter2 up to 1024 processes with 128 nodes. The performance overhead between MPI-Adapter2 and Intel native evaluation was 1.3%.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"2 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Energy consumption plays a crucial role when designing simulation studies. In this work, we take a step towards modelling the relationship between statistical error and energy consumption for molecular and molecular-continuum flow simulations. After revisiting statistical error analysis and run time complexities for molecular dynamics (MD) simulations, we verify the respective relationships in stand-alone short-range MD simulations. We then extend the analysis to coupled molecular-continuum simulations, including the multi-instance (i.e., MD ensemble averaging) case, and additionally analyse the impact of noise filters. Our findings suggest that Gauss filters can reduce the statistical error to a similar degree as doubling the number of MD instances would. We further use regression to derive an analytical energy consumption model that predicts energy consumption on our HPC-cluster HSUper, to achieve simulation results at a prescribed statistical error (or gain in signal-to-noise ratio, respectively). All simulations were carried out using the MD software ls1 mardyn and the molecular-continuum coupling tool MaMiCo. However, the derived models are easily transferable to other pieces of software and other HPC platforms.
{"title":"The Error-Energy Tradeoff in Molecular and Molecular-Continuum Fluid Simulations","authors":"Amartya Das Sharma, Ruben Horn, Philipp Neumann","doi":"10.1145/3636480.3636486","DOIUrl":"https://doi.org/10.1145/3636480.3636486","url":null,"abstract":"Energy consumption plays a crucial role when designing simulation studies. In this work, we take a step towards modelling the relationship between statistical error and energy consumption for molecular and molecular-continuum flow simulations. After revisiting statistical error analysis and run time complexities for molecular dynamics (MD) simulations, we verify the respective relationships in stand-alone short-range MD simulations. We then extend the analysis to coupled molecular-continuum simulations, including the multi-instance (i.e., MD ensemble averaging) case, and additionally analyse the impact of noise filters. Our findings suggest that Gauss filters can reduce the statistical error to a similar degree as doubling the number of MD instances would. We further use regression to derive an analytical energy consumption model that predicts energy consumption on our HPC-cluster HSUper, to achieve simulation results at a prescribed statistical error (or gain in signal-to-noise ratio, respectively). All simulations were carried out using the MD software ls1 mardyn and the molecular-continuum coupling tool MaMiCo. However, the derived models are easily transferable to other pieces of software and other HPC platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"10 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yutaka Watanabe, Jinpil Lee, K. Sano, T. Boku, M. Sato
FPGA has emerged as one of the attractive computing devices in the post-Moore era because of its power efficiency and reconfigurability, even for future high-performance computing. We have designed an OpenACC compiler for FPGA to generate the kernel code by using stream processing Domain Specific Language (DSL) called SPGen, with OpenCL. Although, recently, the programming for FPGA has been improved dramatically by High-Level Synthesis (HLS) frameworks such as OpenCL and HLS C, yet it is still too difficult for HPC application developers, and the directive-based programming models such as OpenACC should be supported even for FPGA. OpenCL can be used as a portable intermediate code for OpenACC for FPGA. However, the generation of hardware from OpenCL is not easy to understand and therefore requires expert knowledge. SPGen is a DSL framework for generating stream processing HDL modules from the description of a dataflow graph. The advantage of our approach is that the code generation with SPGen enables more comprehensive low-level optimization in the OpenACC compiler. The preliminary evaluation results show that, for some kernels, the proposed method, which translates the OpenACC C code into OpenCL and SPGen codes, can perform optimization in the lower level more explicitly than the OpenCL-only method, which translates the OpenACC C code into the OpenCL code only. We also observed that more resources might be consumed in the proposed method. However, implementations of both methods are preliminary. We believe improving code generation will fix the problems such as high resource consumption.
FPGA已成为后摩尔时代最具吸引力的计算设备之一,因为它的功率效率和可重构性,甚至可以用于未来的高性能计算。我们设计了一个OpenACC编译器用于FPGA,通过使用流处理领域特定语言SPGen (Domain Specific Language)和OpenCL生成内核代码。尽管近年来,高级综合(High-Level Synthesis, HLS)框架(如OpenCL和HLS C)对FPGA的编程有了很大的改进,但对于HPC应用开发人员来说仍然过于困难,即使在FPGA上也应该支持基于指令的编程模型(如OpenACC)。OpenCL可以作为OpenACC的可移植中间代码用于FPGA。然而,从OpenCL生成硬件并不容易理解,因此需要专业知识。SPGen是一个DSL框架,用于从数据流图的描述生成流处理HDL模块。我们的方法的优点是,使用SPGen生成的代码可以在OpenACC编译器中实现更全面的低级优化。初步评估结果表明,对于某些内核,将OpenACC C代码转换为OpenCL和SPGen代码的方法比仅将OpenACC C代码转换为OpenCL代码的方法更显式地执行底层优化。我们还观察到,在提出的方法中可能会消耗更多的资源。然而,这两种方法的实现都是初步的。我们相信改进代码生成将解决诸如高资源消耗之类的问题。
{"title":"Design and Preliminary Evaluation of OpenACC Compiler for FPGA with OpenCL and Stream Processing DSL","authors":"Yutaka Watanabe, Jinpil Lee, K. Sano, T. Boku, M. Sato","doi":"10.1145/3373271.3373274","DOIUrl":"https://doi.org/10.1145/3373271.3373274","url":null,"abstract":"FPGA has emerged as one of the attractive computing devices in the post-Moore era because of its power efficiency and reconfigurability, even for future high-performance computing. We have designed an OpenACC compiler for FPGA to generate the kernel code by using stream processing Domain Specific Language (DSL) called SPGen, with OpenCL. Although, recently, the programming for FPGA has been improved dramatically by High-Level Synthesis (HLS) frameworks such as OpenCL and HLS C, yet it is still too difficult for HPC application developers, and the directive-based programming models such as OpenACC should be supported even for FPGA. OpenCL can be used as a portable intermediate code for OpenACC for FPGA. However, the generation of hardware from OpenCL is not easy to understand and therefore requires expert knowledge. SPGen is a DSL framework for generating stream processing HDL modules from the description of a dataflow graph. The advantage of our approach is that the code generation with SPGen enables more comprehensive low-level optimization in the OpenACC compiler. The preliminary evaluation results show that, for some kernels, the proposed method, which translates the OpenACC C code into OpenCL and SPGen codes, can perform optimization in the lower level more explicitly than the OpenCL-only method, which translates the OpenACC C code into the OpenCL code only. We also observed that more resources might be consumed in the proposed method. However, implementations of both methods are preliminary. We believe improving code generation will fix the problems such as high resource consumption.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124333703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Persistent memory (PM) is an emerging memory device that has a larger capacity and lower cost per gigabyte than conventional DRAM. Intel has released a first PM product called Optane™ DC Persistent Memory, but its performance is several times lower than that of DRAM. Therefore, it will be used in combination with DRAM to configure hybrid memory systems that can obtain both the high performance of DRAM and large capacity of PM. In this paper, we evaluate and analyze the performance interference between various types of processes that are concurrently executed on a real server platform having a hybrid memory system. Through the evaluation with a synthetic benchmark, we show that the interference on the hybrid memory system is significantly different from that on a conventional DRAM-only memory system.
{"title":"The Analysis of Inter-Process Interference on a Hybrid Memory System","authors":"S. Imamura, Eiji Yoshida","doi":"10.1145/3373271.3373272","DOIUrl":"https://doi.org/10.1145/3373271.3373272","url":null,"abstract":"Persistent memory (PM) is an emerging memory device that has a larger capacity and lower cost per gigabyte than conventional DRAM. Intel has released a first PM product called Optane™ DC Persistent Memory, but its performance is several times lower than that of DRAM. Therefore, it will be used in combination with DRAM to configure hybrid memory systems that can obtain both the high performance of DRAM and large capacity of PM. In this paper, we evaluate and analyze the performance interference between various types of processes that are concurrently executed on a real server platform having a hybrid memory system. Through the evaluation with a synthetic benchmark, we show that the interference on the hybrid memory system is significantly different from that on a conventional DRAM-only memory system.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126322825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel multigrid method is expected to be a useful algorithm in exascale era because of its scalability. It is widely known that overhead of coarse grid solver in parallel multigrid method is significant, if the number of MPI processes is O(104) or larger. The author proposed the hCGA for avoiding such overhead. Recently, the AM-hCGA, further optimized version of the hCGA, was proposed by the author, and its performance was evaluated on the Oakforest-PACS system (OFP) with IHK/McKernel at JCAHPC using up to 2,048 nodes of Intel Xeon Phi (Knights Landing). In the present work, developed method is also implemented to the Oakbridge-CX system (OBCX) at the University of Tokyo using up to 1,024 nodes (2,048 sockets) of Intel Xeon Platinum 8280 (Cascade Lake). Performance in weak and strong scaling are evaluated for application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM). The hCGA and the AM-hCGA provide excellent performance on both of OFP and OBCX with larger number of nodes. Especially, it achieved excellent performance in strong scaling on OBCX.
{"title":"Parallel Multigrid Method on Multicore/Manycore Clusters","authors":"K. Nakajima","doi":"10.1145/3373271.3373273","DOIUrl":"https://doi.org/10.1145/3373271.3373273","url":null,"abstract":"Parallel multigrid method is expected to be a useful algorithm in exascale era because of its scalability. It is widely known that overhead of coarse grid solver in parallel multigrid method is significant, if the number of MPI processes is O(104) or larger. The author proposed the hCGA for avoiding such overhead. Recently, the AM-hCGA, further optimized version of the hCGA, was proposed by the author, and its performance was evaluated on the Oakforest-PACS system (OFP) with IHK/McKernel at JCAHPC using up to 2,048 nodes of Intel Xeon Phi (Knights Landing). In the present work, developed method is also implemented to the Oakbridge-CX system (OBCX) at the University of Tokyo using up to 1,024 nodes (2,048 sockets) of Intel Xeon Platinum 8280 (Cascade Lake). Performance in weak and strong scaling are evaluated for application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM). The hCGA and the AM-hCGA provide excellent performance on both of OFP and OBCX with larger number of nodes. Especially, it achieved excellent performance in strong scaling on OBCX.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"82 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125909921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, Ayumi Nakamichi, T. Boku
Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research; their computational and communication capabilities have drastically improved in recent years owing to advances in semiconductor integration technologies. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL that reduce the amount of programming effort required have been developed and offered by FPGA vendors. These improvements reveal the possibility of implementing a concept that enables on-the-fly offloading of computational loads at which CPUs/GPUs perform poorly compared to FPGAs while moving data with low latency. We think that this concept is key to improving the performance of heterogeneous supercomputers that use accelerators such as the GPU. In this paper, we propose an approach for GPU--FPGA accelerated computing with the OpenCL programming framework that is based on the OpenCL-enabled GPU--FPGA DMA method and the FPGA-to-FPGA communication method. The experimental results demonstrate that our proposed method can enable GPUs and FPGAs to work together over different nodes.
{"title":"OpenCL-enabled GPU-FPGA Accelerated Computing with Inter-FPGA Communication","authors":"Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, Ayumi Nakamichi, T. Boku","doi":"10.1145/3373271.3373275","DOIUrl":"https://doi.org/10.1145/3373271.3373275","url":null,"abstract":"Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research; their computational and communication capabilities have drastically improved in recent years owing to advances in semiconductor integration technologies. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL that reduce the amount of programming effort required have been developed and offered by FPGA vendors. These improvements reveal the possibility of implementing a concept that enables on-the-fly offloading of computational loads at which CPUs/GPUs perform poorly compared to FPGAs while moving data with low latency. We think that this concept is key to improving the performance of heterogeneous supercomputers that use accelerators such as the GPU. In this paper, we propose an approach for GPU--FPGA accelerated computing with the OpenCL programming framework that is based on the OpenCL-enabled GPU--FPGA DMA method and the FPGA-to-FPGA communication method. The experimental results demonstrate that our proposed method can enable GPUs and FPGAs to work together over different nodes.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115160620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","authors":"","doi":"10.1145/3373271","DOIUrl":"https://doi.org/10.1145/3373271","url":null,"abstract":"","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"204 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133941326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}