首页 > 最新文献

International Workshop on OpenCL最新文献

英文 中文
Toward Evaluating High-Level Synthesis Portability and Performance between Intel and Xilinx FPGAs 对英特尔和赛灵思fpga之间高级综合可移植性和性能的评估
Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456699
A. Cabrera, Aaron R. Young, Jacob Lambert, Zhili Xiao, Amy An, Seyong Lee, Zheming Jin, Jungwon Kim, J. Buhler, R. Chamberlain, J. Vetter
Offloading computation from a CPU to a hardware accelerator is becoming a more common solution for improving performance because traditional gains enabled by Moore’s law and Dennard scaling have slowed. GPUs are often used as hardware accelerators, but field-programmable gate arrays (FPGAs) are gaining traction. FPGAs are beneficial because they allow hardware specific to a particular application to be created. However, they are notoriously difficult to program. To this end, two of the main FPGA manufacturers, Intel and Xilinx, have created tools and frameworks that enable the use of higher level languages to design FPGA hardware. Although Xilinx kernels can be designed by using C/C++, both Intel and Xilinx support the use of OpenCL C to architect FPGA hardware. However, not much is known about the portability and performance between these two device families other than the fact that it is theoretically possible to synthesize a kernel meant for Intel to Xilinx and vice versa. In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We use OpenCL C implementations of a subset of the Rodinia benchmarking suite that were designed for an Intel FPGA and make the necessary modifications to create synthesizable OpenCL C kernels for a Xilinx FPGA. We find that the difficulty of porting certain kernel optimizations varies, depending on the construct. Once the minimum amount of modifications is made to create synthesizable hardware for the Xilinx platform, more nontrivial work is needed to improve performance. However, we find that constructs that are known to be performant for an FPGA should improve performance regardless of the platform; the difficulty comes in deciding how to invoke certain kernel optimizations while also abiding by the constraints enforced by a given platform’s hardware compiler.
将计算从CPU卸载到硬件加速器正在成为提高性能的一种更常见的解决方案,因为摩尔定律和登纳德缩放所带来的传统收益已经放缓。gpu通常用作硬件加速器,但现场可编程门阵列(fpga)正在获得牵引力。fpga是有益的,因为它们允许创建特定于特定应用程序的硬件。然而,它们是出了名的难以编程。为此,两家主要的FPGA制造商Intel和Xilinx已经创建了工具和框架,可以使用更高级的语言来设计FPGA硬件。虽然Xilinx内核可以使用C/ c++来设计,但英特尔和Xilinx都支持使用OpenCL C来构建FPGA硬件。然而,对于这两个设备家族之间的可移植性和性能,我们所知不多,只知道理论上可以将Intel和Xilinx的内核合成为一个内核,反之亦然。在这项工作中,我们评估了Intel和Xilinx内核的可移植性和性能。我们使用为Intel FPGA设计的Rodinia基准测试套件子集的OpenCL C实现,并进行必要的修改,为Xilinx FPGA创建可合成的OpenCL C内核。我们发现移植某些内核优化的难度因结构而异。一旦为Xilinx平台创建了可合成的硬件而进行了最少量的修改,就需要进行更多重要的工作来提高性能。然而,我们发现已知的FPGA性能结构应该可以提高性能,而不管平台如何;困难在于决定如何调用某些内核优化,同时还要遵守给定平台硬件编译器强制的约束。
{"title":"Toward Evaluating High-Level Synthesis Portability and Performance between Intel and Xilinx FPGAs","authors":"A. Cabrera, Aaron R. Young, Jacob Lambert, Zhili Xiao, Amy An, Seyong Lee, Zheming Jin, Jungwon Kim, J. Buhler, R. Chamberlain, J. Vetter","doi":"10.1145/3456669.3456699","DOIUrl":"https://doi.org/10.1145/3456669.3456699","url":null,"abstract":"Offloading computation from a CPU to a hardware accelerator is becoming a more common solution for improving performance because traditional gains enabled by Moore’s law and Dennard scaling have slowed. GPUs are often used as hardware accelerators, but field-programmable gate arrays (FPGAs) are gaining traction. FPGAs are beneficial because they allow hardware specific to a particular application to be created. However, they are notoriously difficult to program. To this end, two of the main FPGA manufacturers, Intel and Xilinx, have created tools and frameworks that enable the use of higher level languages to design FPGA hardware. Although Xilinx kernels can be designed by using C/C++, both Intel and Xilinx support the use of OpenCL C to architect FPGA hardware. However, not much is known about the portability and performance between these two device families other than the fact that it is theoretically possible to synthesize a kernel meant for Intel to Xilinx and vice versa. In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We use OpenCL C implementations of a subset of the Rodinia benchmarking suite that were designed for an Intel FPGA and make the necessary modifications to create synthesizable OpenCL C kernels for a Xilinx FPGA. We find that the difficulty of porting certain kernel optimizations varies, depending on the construct. Once the minimum amount of modifications is made to create synthesizable hardware for the Xilinx platform, more nontrivial work is needed to improve performance. However, we find that constructs that are known to be performant for an FPGA should improve performance regardless of the platform; the difficulty comes in deciding how to invoke certain kernel optimizations while also abiding by the constraints enforced by a given platform’s hardware compiler.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84897389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Profiling Heterogeneous Computing Performance with VTune Profiler 用VTune Profiler分析异构计算性能
Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456678
V. Tsymbal, Alexandr Kurylev
Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions.
异构平台的编程需要对各个层次的系统架构有深刻的理解,这有助于应用程序设计利用CPU和gpu等加速硬件之间的最佳数据和工作分解。然而,在许多情况下,应用程序是从传统的CPU编程语言(如c++)或从加速器友好但仍然是低级语言(如OpenCL)转换而来的,主要问题是确定应用程序的哪一部分正在从卸载到GPU中进行利用。另一个问题是估计,由于特定GP GPU设备的加速,可能会获得多少性能提升。每个平台都有其独特的限制,这些限制会影响卸载计算任务的性能,例如数据传输税、任务初始化开销、内存延迟和带宽限制。为了考虑到这些限制,软件开发人员需要工具来收集正确的信息并产生建议,以做出最佳的设计和优化决策。在本次演讲中,我们将介绍英特尔®VTune™Profiler中的两种新的GPU性能分析类型,以及由分析支持的异构应用程序性能分析方法。VTune Profiler是一个众所周知的cpu性能表征工具,现在它包括GPU卸载分析和GPU热点分析,使用OpenCL, SYCL/Data Parallel c++和OpenMP Offload在大多数卸载模型上编写的应用程序。GPU卸载分析有助于通过创建和提交任务到卸载队列来识别CPU如何与GPU交互。它提供指标和性能数据,如GPU利用率、最热GPU计算任务、任务实例计数和定时、内核数据传输大小、SIMD宽度测量、GPU执行单元(EU)线程占用和内存利用率。所有这些指标都提供了一个系统的图像,显示任务在gpu上卸载和执行的效率。GPU热点分析旨在检查运行在GPU EUs上的计算任务或内核效率,并与GPU内存子系统交互。由计算内核实现或编译器问题导致的低效率会导致EU空闲,或者从内存源到EU寄存器获取数据的延迟增加,这最终会导致性能下降。由于GPU内存子系统(L1, L2缓存,共享本地内存,L3缓存,GPU DRAM, CPU LLC和DRAM)的复杂性,分析数据访问效率低下甚至更成问题。GPU热点分析通过呈现当前GPU内存层次图的可视化、不同内存代理之间的详细数据传输跟踪、内存带宽测量、屏障和原子分析来解决这些问题。此外,VTune在源代码级别上分析每个计算内核,根据源代码行或汇编指令提供性能指标。内存延迟指标有助于确定源行级别上最低效的数据访问。补充GPU指令计数分析与指令集在一个编译器生成的内核澄清。VTune中的GPU分析是为OpenCL语言和运行时开发的,但是最新的SYCL语言及其扩展Data Parallel c++以及Level Zero运行时也得到支持,运行在所有英特尔GPU上,从Gen9 HD Graphics到英特尔Iris Xe Graphics(一个独立的GPU卡)。会议将介绍不同GPU架构的性能分析结果。VTune Profiler for gpu是一个新扩展的工具集,随着英特尔新加速架构的开发,它正在积极开发。工具中不断出现新的特性和分析概念,以满足软件架构师和开发人员的需求。
{"title":"Profiling Heterogeneous Computing Performance with VTune Profiler","authors":"V. Tsymbal, Alexandr Kurylev","doi":"10.1145/3456669.3456678","DOIUrl":"https://doi.org/10.1145/3456669.3456678","url":null,"abstract":"Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions. ","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84364109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
International Workshop on OpenCL OpenCL国际研讨会
Pub Date : 2021-01-01 DOI: 10.1145/3456669
{"title":"International Workshop on OpenCL","authors":"","doi":"10.1145/3456669","DOIUrl":"https://doi.org/10.1145/3456669","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91220179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
IWOCL '20: International Workshop on OpenCL, Virtual Event / Munich, Germany, April 27-29, 2020 IWOCL '20: OpenCL国际研讨会,虚拟事件/德国慕尼黑,2020年4月27-29日
Pub Date : 2020-01-01 DOI: 10.1145/3388333
{"title":"IWOCL '20: International Workshop on OpenCL, Virtual Event / Munich, Germany, April 27-29, 2020","authors":"","doi":"10.1145/3388333","DOIUrl":"https://doi.org/10.1145/3388333","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76080501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL 芯片上的Gzip:在fpga上使用OpenCL进行高性能无损数据压缩
Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664670
M. Abdelfattah, A. Hagiescu, Deshanand P. Singh
Hardware implementation of lossless data compression is important for optimizing the capacity/cost/power of storage devices in data centers, as well as communication channels in high-speed networks. In this work we use the Open Computing Language (OpenCL) to implement high-speed data compression (Gzip) on a field-programmable gate-arrays (FPGA). We show how we make use of a heavily-pipelined custom hardware implementation to achieve the high throughput of ~3 GB/s with more than 2x compression ratio over standard compression benchmarks. When compared against a highly-tuned CPU implementation, the performance-per-watt of our OpenCL FPGA implementation is 12x better and compression ratio is on-par. Additionally, we compare our implementation to a hand-coded commercial implementation of Gzip to quantify the gap between a high-level language like OpenCL, and a hardware description language like Verilog. OpenCL performance is 5.3% lower than Verilog, and area is 2% more logic and 25% more of the FPGA's available memory resources but the productivity gains are significant.
无损数据压缩的硬件实现对于优化数据中心中存储设备的容量/成本/功率以及高速网络中的通信通道非常重要。在这项工作中,我们使用开放计算语言(OpenCL)在现场可编程门阵列(FPGA)上实现高速数据压缩(Gzip)。我们展示了如何利用高度流水线化的定制硬件实现来实现~ 3gb /s的高吞吐量,压缩比超过标准压缩基准的2倍以上。与高度调优的CPU实现相比,我们的OpenCL FPGA实现的每瓦性能提高了12倍,压缩比也达到了同等水平。此外,我们将我们的实现与手工编码的Gzip商业实现进行比较,以量化高级语言(如OpenCL)与硬件描述语言(如Verilog)之间的差距。OpenCL的性能比Verilog低5.3%,逻辑面积多2%,FPGA可用内存资源多25%,但生产率的提高是显著的。
{"title":"Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL","authors":"M. Abdelfattah, A. Hagiescu, Deshanand P. Singh","doi":"10.1145/2664666.2664670","DOIUrl":"https://doi.org/10.1145/2664666.2664670","url":null,"abstract":"Hardware implementation of lossless data compression is important for optimizing the capacity/cost/power of storage devices in data centers, as well as communication channels in high-speed networks. In this work we use the Open Computing Language (OpenCL) to implement high-speed data compression (Gzip) on a field-programmable gate-arrays (FPGA). We show how we make use of a heavily-pipelined custom hardware implementation to achieve the high throughput of ~3 GB/s with more than 2x compression ratio over standard compression benchmarks. When compared against a highly-tuned CPU implementation, the performance-per-watt of our OpenCL FPGA implementation is 12x better and compression ratio is on-par. Additionally, we compare our implementation to a hand-coded commercial implementation of Gzip to quantify the gap between a high-level language like OpenCL, and a hardware description language like Verilog. OpenCL performance is 5.3% lower than Verilog, and area is 2% more logic and 25% more of the FPGA's available memory resources but the productivity gains are significant.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73207777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 114
Evaluation of a performance portable lattice Boltzmann code using OpenCL 用OpenCL评价便携式晶格玻尔兹曼代码的性能
Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664668
Simon McIntosh-Smith, Dan Curran
With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area --- structured grid codes --- and investigated techniques exploiting OpenCL to enable performance portability across a diverse range of high-end many-core architectures. In particular we have chosen to investigate 3D lattice Boltzmann codes (D3Q19 BGK). We have developed an OpenCL version of this code in order to provide cross-platform functional portability, and compared the performance of this OpenCL version to optimized native versions on each target platform, including hybrid OpenMP/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for 3D lattice Boltzmann codes, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.
随着多核计算机架构的出现,如NVIDIA和AMD的gpgpu,以及最近英特尔的Xeon Phi,确保高性能计算代码的性能可移植性可能变得更加复杂。在这项工作中,我们专注于一个重要的应用领域——结构化网格代码,并研究了利用OpenCL在各种高端多核架构中实现性能可移植性的技术。特别地,我们选择研究三维晶格玻尔兹曼码(D3Q19 BGK)。为了提供跨平台的功能可移植性,我们开发了此代码的OpenCL版本,并将此OpenCL版本的性能与每个目标平台上优化的本机版本进行了比较,包括cpu和Xeon Phi上的混合OpenMP/AVX版本,以及NVIDIA gpu上的CUDA版本。结果表明,与传统观点相反,使用OpenCL可以实现高度的性能可移植性,至少对于3D晶格玻尔兹曼代码,使用一组简单的技术。OpenCL中的性能可移植代码与在每个平台上使用本地并行编程模型的最佳性能也具有很强的竞争力。
{"title":"Evaluation of a performance portable lattice Boltzmann code using OpenCL","authors":"Simon McIntosh-Smith, Dan Curran","doi":"10.1145/2664666.2664668","DOIUrl":"https://doi.org/10.1145/2664666.2664668","url":null,"abstract":"With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area --- structured grid codes --- and investigated techniques exploiting OpenCL to enable performance portability across a diverse range of high-end many-core architectures. In particular we have chosen to investigate 3D lattice Boltzmann codes (D3Q19 BGK). We have developed an OpenCL version of this code in order to provide cross-platform functional portability, and compared the performance of this OpenCL version to optimized native versions on each target platform, including hybrid OpenMP/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for 3D lattice Boltzmann codes, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88601693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Generating OpenCL C kernels from OpenACC 从OpenACC生成OpenCL C内核
Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664675
T. Vanderbruggen, John Cavazos
Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target hardware accelerators with minimal modification of the original application. In this paper, we present the generation of OpenCL C kernels from OpenACC annotated codes. We introduce a method to produce multiple kernels for each OpenACC compute region. We evaluate these kernels on different hardware accelerators (NVidia GPU, Intel MIC). Finally, we show that the produced kernels give different performances for different accelerators. Hence this method produces a tuning space in which we can search for the best kernel version for a given accelerator.
硬件加速器现在是提高计算节点性能的常用方法。这种性能改进是有代价的:需要重写应用程序以利用新硬件。OpenACC是一组编译器指令,以硬件加速器为目标,对原始应用程序进行最小的修改。在本文中,我们提出了从OpenACC注释代码生成OpenCL C内核的方法。我们介绍了一种为每个OpenACC计算区域生成多个内核的方法。我们在不同的硬件加速器(NVidia GPU, Intel MIC)上对这些内核进行了评估。最后,我们证明了所生成的核对于不同的加速器具有不同的性能。因此,这种方法产生了一个调优空间,我们可以在其中搜索给定加速器的最佳内核版本。
{"title":"Generating OpenCL C kernels from OpenACC","authors":"T. Vanderbruggen, John Cavazos","doi":"10.1145/2664666.2664675","DOIUrl":"https://doi.org/10.1145/2664666.2664675","url":null,"abstract":"Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target hardware accelerators with minimal modification of the original application. In this paper, we present the generation of OpenCL C kernels from OpenACC annotated codes. We introduce a method to produce multiple kernels for each OpenACC compute region.\u0000 We evaluate these kernels on different hardware accelerators (NVidia GPU, Intel MIC). Finally, we show that the produced kernels give different performances for different accelerators. Hence this method produces a tuning space in which we can search for the best kernel version for a given accelerator.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88777392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters KernelInterceptor:通过拦截内核及其参数来自动化GPU内核验证
Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664673
E. Bardsley, A. Donaldson, John Wickerson
GPUVerify is a static analysis tool for verifying that GPU kernels are free from data races and barrier divergence. It is intended as an automatic tool, but its usability is impaired by the fact that the user must explicitly supply the kernel source code, the number of work items and work groups, and preconditions on key kernel arguments. Extracting this information from non-trivial OpenCL applications is laborious and error-prone. We describe an extension to GPUVerify, called KernelInterceptor, that automates the extraction of this information from a given OpenCL application. After recompiling the application having included an additional header file, and linking with an additional library, KernelInterceptor is able to detect each dynamic kernel launch and record the values of the various parameters in a series of log files. GPUVerify can then be invoked to examine these log files and verify each kernel instance. We explain how the interception mechanism works, and comment on the extent to which it improves the usability of GPUVerify.
GPUVerify是一个静态分析工具,用于验证GPU内核是否没有数据竞争和屏障发散。它的目的是作为一个自动工具,但是由于用户必须显式地提供内核源代码、工作项和工作组的数量以及关键内核参数的先决条件,它的可用性受到了损害。从重要的OpenCL应用程序中提取这些信息是费力且容易出错的。我们描述了GPUVerify的扩展,称为KernelInterceptor,它可以自动从给定的OpenCL应用程序中提取这些信息。在重新编译包含附加头文件的应用程序并链接到附加库之后,KernelInterceptor能够检测每个动态内核启动,并在一系列日志文件中记录各种参数的值。然后可以调用GPUVerify来检查这些日志文件并验证每个内核实例。我们解释了拦截机制的工作原理,并评论了它在多大程度上提高了GPUVerify的可用性。
{"title":"KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters","authors":"E. Bardsley, A. Donaldson, John Wickerson","doi":"10.1145/2664666.2664673","DOIUrl":"https://doi.org/10.1145/2664666.2664673","url":null,"abstract":"GPUVerify is a static analysis tool for verifying that GPU kernels are free from data races and barrier divergence. It is intended as an automatic tool, but its usability is impaired by the fact that the user must explicitly supply the kernel source code, the number of work items and work groups, and preconditions on key kernel arguments. Extracting this information from non-trivial OpenCL applications is laborious and error-prone.\u0000 We describe an extension to GPUVerify, called KernelInterceptor, that automates the extraction of this information from a given OpenCL application. After recompiling the application having included an additional header file, and linking with an additional library, KernelInterceptor is able to detect each dynamic kernel launch and record the values of the various parameters in a series of log files. GPUVerify can then be invoked to examine these log files and verify each kernel instance. We explain how the interception mechanism works, and comment on the extent to which it improves the usability of GPUVerify.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77321939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
clMAGMA: high performance dense linear algebra with OpenCL clMAGMA:基于OpenCL的高性能密集线性代数
Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664667
Chongxiao Cao, J. Dongarra, Peng Du, M. Gates, P. Luszczek, S. Tomov
This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates various optimizations, and in general provides the DLA functionality of the popular LAPACK library on heterogeneous architectures. The LAPACK compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portable performance. High performance is obtained through the use of the high-performance OpenCL BLAS, hardware- and OpenCL-specific tuning, and a hybridization methodology, where we split the algorithm into computational tasks of various granularities. Execution of those tasks is efficiently scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
本文介绍了几种基本的密集线性代数(DLA)算法在OpenCL中的设计与实现。特别地,它们是线性系统求解器和特征值问题求解器。此外,我们还概述了clMAGMA库,这是一个开源、高性能的OpenCL库,它集成了各种优化,并且通常在异构架构上提供流行的LAPACK库的DLA功能。遵从LAPACK和使用OpenCL简化了clMAGMA在应用程序中的使用,同时为它们提供了可移植的性能。高性能是通过使用高性能的OpenCL BLAS、硬件和OpenCL特定的调优以及混合方法获得的,在混合方法中,我们将算法分解为各种粒度的计算任务。通过最小化数据移动和将算法需求映射到各种异构硬件组件的体系结构优势,这些任务的执行被有效地安排在异构硬件组件上。
{"title":"clMAGMA: high performance dense linear algebra with OpenCL","authors":"Chongxiao Cao, J. Dongarra, Peng Du, M. Gates, P. Luszczek, S. Tomov","doi":"10.1145/2664666.2664667","DOIUrl":"https://doi.org/10.1145/2664666.2664667","url":null,"abstract":"This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates various optimizations, and in general provides the DLA functionality of the popular LAPACK library on heterogeneous architectures. The LAPACK compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portable performance. High performance is obtained through the use of the high-performance OpenCL BLAS, hardware- and OpenCL-specific tuning, and a hybridization methodology, where we split the algorithm into computational tasks of various granularities. Execution of those tasks is efficiently scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86114507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Performance portability study of linear algebra kernels in OpenCL OpenCL中线性代数核的性能可移植性研究
Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664674
K. Rupp, Philippe Tillet, F. Rudolf, J. Weinbub, T. Grasser, A. Jüngel
The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations of kernel implementations and work sizes are found to exhibit good performance across compute kernels, hardware generations, and, to a lesser degree, vendors. As a consequence, it is demonstrated that the optimization of a single kernel is often sufficient to obtain good performance for a large class of more complicated operations.
研究了OpenCL内核实现对公共内存带宽有限的线性代数运算的性能可移植性。我们发现,内核实现和工作大小的某些组合在不同的计算内核、不同的硬件世代以及不同的供应商之间都表现出良好的性能。结果表明,单个内核的优化通常足以为大量更复杂的操作获得良好的性能。
{"title":"Performance portability study of linear algebra kernels in OpenCL","authors":"K. Rupp, Philippe Tillet, F. Rudolf, J. Weinbub, T. Grasser, A. Jüngel","doi":"10.1145/2664666.2664674","DOIUrl":"https://doi.org/10.1145/2664666.2664674","url":null,"abstract":"The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations of kernel implementations and work sizes are found to exhibit good performance across compute kernels, hardware generations, and, to a lesser degree, vendors. As a consequence, it is demonstrated that the optimization of a single kernel is often sufficient to obtain good performance for a large class of more complicated operations.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74907290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
International Workshop on OpenCL
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1