将SU3_Bench微基准测试移植到Intel Arria 10和Xilinx Alveo U280 fpga上的经验

International Workshop on OpenCL Pub Date : 2021-04-27 DOI:10.1145/3456669.3456671

D. Doerfler, Farzad Fatollahi-Fard, Colin MacLean, T. Nguyen, Samuel Williams, N. Wright, Marco Siracusa

{"title":"将SU3_Bench微基准测试移植到Intel Arria 10和Xilinx Alveo U280 fpga上的经验","authors":"D. Doerfler, Farzad Fatollahi-Fard, Colin MacLean, T. Nguyen, Samuel Williams, N. Wright, Marco Siracusa","doi":"10.1145/3456669.3456671","DOIUrl":null,"url":null,"abstract":"In this study we investigate the implications of porting a common computational kernel used in high performance computing, which has been optimized for efficient execution on general purpose graphics processing units (GPUs), to a field programmable gate array (FPGA). In particular, we use a benchmark based on a matrix-matrix multiply kernel commonly used in lattice quantum chromodynamics applications. The microbenchmark is based on the OpenCL programming language. We evaluate the performance, and portability, aspects associated for two FPGAs, the Intel Arria 10 and the Xilinx Alveo U280. The purpose of the study is not to compare the two FPGAs, but to evaluate their respective OpenCL toolchains and to evaluate the level of effort needed to port a GPU optimized code to a FPGA, and the effectiveness of the respective toolchains. We did find the toolchains to be relatively easy to use, and it was possible to get correctness with little effort, but there was significant effort needed to get relatively good performance. We found that FPGAs perform best when using single work item kernels, as opposed to the nominal multiple work item NDRange kernel used for CPUs and GPUs. In addition, other source code changes were necessary, and in particular the lack of a local cache in FPGA architectures can require a significant rewrite of the code. The performance achieved with the Intel Arria 10 was 47.6% of its maximum sustained bandwidth, while the Xilinx Alveo U280 achieved 35.2%. GPU architectures have been shown to demonstrate 75% to 90% architectural efficiencies.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"39 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs\",\"authors\":\"D. Doerfler, Farzad Fatollahi-Fard, Colin MacLean, T. Nguyen, Samuel Williams, N. Wright, Marco Siracusa\",\"doi\":\"10.1145/3456669.3456671\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study we investigate the implications of porting a common computational kernel used in high performance computing, which has been optimized for efficient execution on general purpose graphics processing units (GPUs), to a field programmable gate array (FPGA). In particular, we use a benchmark based on a matrix-matrix multiply kernel commonly used in lattice quantum chromodynamics applications. The microbenchmark is based on the OpenCL programming language. We evaluate the performance, and portability, aspects associated for two FPGAs, the Intel Arria 10 and the Xilinx Alveo U280. The purpose of the study is not to compare the two FPGAs, but to evaluate their respective OpenCL toolchains and to evaluate the level of effort needed to port a GPU optimized code to a FPGA, and the effectiveness of the respective toolchains. We did find the toolchains to be relatively easy to use, and it was possible to get correctness with little effort, but there was significant effort needed to get relatively good performance. We found that FPGAs perform best when using single work item kernels, as opposed to the nominal multiple work item NDRange kernel used for CPUs and GPUs. In addition, other source code changes were necessary, and in particular the lack of a local cache in FPGA architectures can require a significant rewrite of the code. The performance achieved with the Intel Arria 10 was 47.6% of its maximum sustained bandwidth, while the Xilinx Alveo U280 achieved 35.2%. GPU architectures have been shown to demonstrate 75% to 90% architectural efficiencies.\",\"PeriodicalId\":73497,\"journal\":{\"name\":\"International Workshop on OpenCL\",\"volume\":\"39 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3456669.3456671\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3456669.3456671","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本研究中，我们研究了将用于高性能计算的通用计算内核移植到现场可编程门阵列(FPGA)的含义，该内核已经过优化，可在通用图形处理单元(gpu)上高效执行。特别地，我们使用了基于晶格量子色动力学应用中常用的矩阵-矩阵乘法核的基准。微基准测试基于OpenCL编程语言。我们评估了两种fpga的性能和可移植性，英特尔Arria 10和Xilinx Alveo U280。本研究的目的不是比较两种FPGA，而是评估它们各自的OpenCL工具链，并评估将GPU优化代码移植到FPGA所需的工作量，以及各自工具链的有效性。我们确实发现工具链相对容易使用，并且可以用很少的努力获得正确性，但是需要大量的努力才能获得相对较好的性能。我们发现fpga在使用单个工作项内核时表现最好，而不是用于cpu和gpu的标称多工作项nrange内核。此外，其他源代码更改是必要的，特别是FPGA架构中缺少本地缓存可能需要对代码进行大量重写。英特尔Arria 10实现的性能是其最大持续带宽的47.6%，而Xilinx Alveo U280实现了35.2%。GPU架构显示出75%到90%的架构效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs

In this study we investigate the implications of porting a common computational kernel used in high performance computing, which has been optimized for efficient execution on general purpose graphics processing units (GPUs), to a field programmable gate array (FPGA). In particular, we use a benchmark based on a matrix-matrix multiply kernel commonly used in lattice quantum chromodynamics applications. The microbenchmark is based on the OpenCL programming language. We evaluate the performance, and portability, aspects associated for two FPGAs, the Intel Arria 10 and the Xilinx Alveo U280. The purpose of the study is not to compare the two FPGAs, but to evaluate their respective OpenCL toolchains and to evaluate the level of effort needed to port a GPU optimized code to a FPGA, and the effectiveness of the respective toolchains. We did find the toolchains to be relatively easy to use, and it was possible to get correctness with little effort, but there was significant effort needed to get relatively good performance. We found that FPGAs perform best when using single work item kernels, as opposed to the nominal multiple work item NDRange kernel used for CPUs and GPUs. In addition, other source code changes were necessary, and in particular the lack of a local cache in FPGA architectures can require a significant rewrite of the code. The performance achieved with the Intel Arria 10 was 47.6% of its maximum sustained bandwidth, while the Xilinx Alveo U280 achieved 35.2%. GPU architectures have been shown to demonstrate 75% to 90% architectural efficiencies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量

期刊最新文献

Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL