A. Cabrera, Aaron R. Young, Jacob Lambert, Zhili Xiao, Amy An, Seyong Lee, Zheming Jin, Jungwon Kim, J. Buhler, R. Chamberlain, J. Vetter
Offloading computation from a CPU to a hardware accelerator is becoming a more common solution for improving performance because traditional gains enabled by Moore’s law and Dennard scaling have slowed. GPUs are often used as hardware accelerators, but field-programmable gate arrays (FPGAs) are gaining traction. FPGAs are beneficial because they allow hardware specific to a particular application to be created. However, they are notoriously difficult to program. To this end, two of the main FPGA manufacturers, Intel and Xilinx, have created tools and frameworks that enable the use of higher level languages to design FPGA hardware. Although Xilinx kernels can be designed by using C/C++, both Intel and Xilinx support the use of OpenCL C to architect FPGA hardware. However, not much is known about the portability and performance between these two device families other than the fact that it is theoretically possible to synthesize a kernel meant for Intel to Xilinx and vice versa. In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We use OpenCL C implementations of a subset of the Rodinia benchmarking suite that were designed for an Intel FPGA and make the necessary modifications to create synthesizable OpenCL C kernels for a Xilinx FPGA. We find that the difficulty of porting certain kernel optimizations varies, depending on the construct. Once the minimum amount of modifications is made to create synthesizable hardware for the Xilinx platform, more nontrivial work is needed to improve performance. However, we find that constructs that are known to be performant for an FPGA should improve performance regardless of the platform; the difficulty comes in deciding how to invoke certain kernel optimizations while also abiding by the constraints enforced by a given platform’s hardware compiler.
{"title":"Toward Evaluating High-Level Synthesis Portability and Performance between Intel and Xilinx FPGAs","authors":"A. Cabrera, Aaron R. Young, Jacob Lambert, Zhili Xiao, Amy An, Seyong Lee, Zheming Jin, Jungwon Kim, J. Buhler, R. Chamberlain, J. Vetter","doi":"10.1145/3456669.3456699","DOIUrl":"https://doi.org/10.1145/3456669.3456699","url":null,"abstract":"Offloading computation from a CPU to a hardware accelerator is becoming a more common solution for improving performance because traditional gains enabled by Moore’s law and Dennard scaling have slowed. GPUs are often used as hardware accelerators, but field-programmable gate arrays (FPGAs) are gaining traction. FPGAs are beneficial because they allow hardware specific to a particular application to be created. However, they are notoriously difficult to program. To this end, two of the main FPGA manufacturers, Intel and Xilinx, have created tools and frameworks that enable the use of higher level languages to design FPGA hardware. Although Xilinx kernels can be designed by using C/C++, both Intel and Xilinx support the use of OpenCL C to architect FPGA hardware. However, not much is known about the portability and performance between these two device families other than the fact that it is theoretically possible to synthesize a kernel meant for Intel to Xilinx and vice versa. In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We use OpenCL C implementations of a subset of the Rodinia benchmarking suite that were designed for an Intel FPGA and make the necessary modifications to create synthesizable OpenCL C kernels for a Xilinx FPGA. We find that the difficulty of porting certain kernel optimizations varies, depending on the construct. Once the minimum amount of modifications is made to create synthesizable hardware for the Xilinx platform, more nontrivial work is needed to improve performance. However, we find that constructs that are known to be performant for an FPGA should improve performance regardless of the platform; the difficulty comes in deciding how to invoke certain kernel optimizations while also abiding by the constraints enforced by a given platform’s hardware compiler.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84897389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions.
异构平台的编程需要对各个层次的系统架构有深刻的理解,这有助于应用程序设计利用CPU和gpu等加速硬件之间的最佳数据和工作分解。然而,在许多情况下,应用程序是从传统的CPU编程语言(如c++)或从加速器友好但仍然是低级语言(如OpenCL)转换而来的,主要问题是确定应用程序的哪一部分正在从卸载到GPU中进行利用。另一个问题是估计,由于特定GP GPU设备的加速,可能会获得多少性能提升。每个平台都有其独特的限制,这些限制会影响卸载计算任务的性能,例如数据传输税、任务初始化开销、内存延迟和带宽限制。为了考虑到这些限制,软件开发人员需要工具来收集正确的信息并产生建议,以做出最佳的设计和优化决策。在本次演讲中,我们将介绍英特尔®VTune™Profiler中的两种新的GPU性能分析类型,以及由分析支持的异构应用程序性能分析方法。VTune Profiler是一个众所周知的cpu性能表征工具,现在它包括GPU卸载分析和GPU热点分析,使用OpenCL, SYCL/Data Parallel c++和OpenMP Offload在大多数卸载模型上编写的应用程序。GPU卸载分析有助于通过创建和提交任务到卸载队列来识别CPU如何与GPU交互。它提供指标和性能数据,如GPU利用率、最热GPU计算任务、任务实例计数和定时、内核数据传输大小、SIMD宽度测量、GPU执行单元(EU)线程占用和内存利用率。所有这些指标都提供了一个系统的图像,显示任务在gpu上卸载和执行的效率。GPU热点分析旨在检查运行在GPU EUs上的计算任务或内核效率,并与GPU内存子系统交互。由计算内核实现或编译器问题导致的低效率会导致EU空闲,或者从内存源到EU寄存器获取数据的延迟增加,这最终会导致性能下降。由于GPU内存子系统(L1, L2缓存,共享本地内存,L3缓存,GPU DRAM, CPU LLC和DRAM)的复杂性,分析数据访问效率低下甚至更成问题。GPU热点分析通过呈现当前GPU内存层次图的可视化、不同内存代理之间的详细数据传输跟踪、内存带宽测量、屏障和原子分析来解决这些问题。此外,VTune在源代码级别上分析每个计算内核,根据源代码行或汇编指令提供性能指标。内存延迟指标有助于确定源行级别上最低效的数据访问。补充GPU指令计数分析与指令集在一个编译器生成的内核澄清。VTune中的GPU分析是为OpenCL语言和运行时开发的,但是最新的SYCL语言及其扩展Data Parallel c++以及Level Zero运行时也得到支持,运行在所有英特尔GPU上,从Gen9 HD Graphics到英特尔Iris Xe Graphics(一个独立的GPU卡)。会议将介绍不同GPU架构的性能分析结果。VTune Profiler for gpu是一个新扩展的工具集,随着英特尔新加速架构的开发,它正在积极开发。工具中不断出现新的特性和分析概念,以满足软件架构师和开发人员的需求。
{"title":"Profiling Heterogeneous Computing Performance with VTune Profiler","authors":"V. Tsymbal, Alexandr Kurylev","doi":"10.1145/3456669.3456678","DOIUrl":"https://doi.org/10.1145/3456669.3456678","url":null,"abstract":"Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions. ","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84364109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"International Workshop on OpenCL","authors":"","doi":"10.1145/3456669","DOIUrl":"https://doi.org/10.1145/3456669","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91220179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IWOCL '20: International Workshop on OpenCL, Virtual Event / Munich, Germany, April 27-29, 2020","authors":"","doi":"10.1145/3388333","DOIUrl":"https://doi.org/10.1145/3388333","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76080501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hardware implementation of lossless data compression is important for optimizing the capacity/cost/power of storage devices in data centers, as well as communication channels in high-speed networks. In this work we use the Open Computing Language (OpenCL) to implement high-speed data compression (Gzip) on a field-programmable gate-arrays (FPGA). We show how we make use of a heavily-pipelined custom hardware implementation to achieve the high throughput of ~3 GB/s with more than 2x compression ratio over standard compression benchmarks. When compared against a highly-tuned CPU implementation, the performance-per-watt of our OpenCL FPGA implementation is 12x better and compression ratio is on-par. Additionally, we compare our implementation to a hand-coded commercial implementation of Gzip to quantify the gap between a high-level language like OpenCL, and a hardware description language like Verilog. OpenCL performance is 5.3% lower than Verilog, and area is 2% more logic and 25% more of the FPGA's available memory resources but the productivity gains are significant.
{"title":"Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL","authors":"M. Abdelfattah, A. Hagiescu, Deshanand P. Singh","doi":"10.1145/2664666.2664670","DOIUrl":"https://doi.org/10.1145/2664666.2664670","url":null,"abstract":"Hardware implementation of lossless data compression is important for optimizing the capacity/cost/power of storage devices in data centers, as well as communication channels in high-speed networks. In this work we use the Open Computing Language (OpenCL) to implement high-speed data compression (Gzip) on a field-programmable gate-arrays (FPGA). We show how we make use of a heavily-pipelined custom hardware implementation to achieve the high throughput of ~3 GB/s with more than 2x compression ratio over standard compression benchmarks. When compared against a highly-tuned CPU implementation, the performance-per-watt of our OpenCL FPGA implementation is 12x better and compression ratio is on-par. Additionally, we compare our implementation to a hand-coded commercial implementation of Gzip to quantify the gap between a high-level language like OpenCL, and a hardware description language like Verilog. OpenCL performance is 5.3% lower than Verilog, and area is 2% more logic and 25% more of the FPGA's available memory resources but the productivity gains are significant.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"201 1","pages":"4:1-4:9"},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73207777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area --- structured grid codes --- and investigated techniques exploiting OpenCL to enable performance portability across a diverse range of high-end many-core architectures. In particular we have chosen to investigate 3D lattice Boltzmann codes (D3Q19 BGK). We have developed an OpenCL version of this code in order to provide cross-platform functional portability, and compared the performance of this OpenCL version to optimized native versions on each target platform, including hybrid OpenMP/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for 3D lattice Boltzmann codes, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.
{"title":"Evaluation of a performance portable lattice Boltzmann code using OpenCL","authors":"Simon McIntosh-Smith, Dan Curran","doi":"10.1145/2664666.2664668","DOIUrl":"https://doi.org/10.1145/2664666.2664668","url":null,"abstract":"With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area --- structured grid codes --- and investigated techniques exploiting OpenCL to enable performance portability across a diverse range of high-end many-core architectures. In particular we have chosen to investigate 3D lattice Boltzmann codes (D3Q19 BGK). We have developed an OpenCL version of this code in order to provide cross-platform functional portability, and compared the performance of this OpenCL version to optimized native versions on each target platform, including hybrid OpenMP/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for 3D lattice Boltzmann codes, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"28 1","pages":"2:1-2:12"},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88601693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target hardware accelerators with minimal modification of the original application. In this paper, we present the generation of OpenCL C kernels from OpenACC annotated codes. We introduce a method to produce multiple kernels for each OpenACC compute region. We evaluate these kernels on different hardware accelerators (NVidia GPU, Intel MIC). Finally, we show that the produced kernels give different performances for different accelerators. Hence this method produces a tuning space in which we can search for the best kernel version for a given accelerator.
{"title":"Generating OpenCL C kernels from OpenACC","authors":"T. Vanderbruggen, John Cavazos","doi":"10.1145/2664666.2664675","DOIUrl":"https://doi.org/10.1145/2664666.2664675","url":null,"abstract":"Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target hardware accelerators with minimal modification of the original application. In this paper, we present the generation of OpenCL C kernels from OpenACC annotated codes. We introduce a method to produce multiple kernels for each OpenACC compute region.\u0000 We evaluate these kernels on different hardware accelerators (NVidia GPU, Intel MIC). Finally, we show that the produced kernels give different performances for different accelerators. Hence this method produces a tuning space in which we can search for the best kernel version for a given accelerator.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"42 1","pages":"9:1-9:10"},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88777392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPUVerify is a static analysis tool for verifying that GPU kernels are free from data races and barrier divergence. It is intended as an automatic tool, but its usability is impaired by the fact that the user must explicitly supply the kernel source code, the number of work items and work groups, and preconditions on key kernel arguments. Extracting this information from non-trivial OpenCL applications is laborious and error-prone. We describe an extension to GPUVerify, called KernelInterceptor, that automates the extraction of this information from a given OpenCL application. After recompiling the application having included an additional header file, and linking with an additional library, KernelInterceptor is able to detect each dynamic kernel launch and record the values of the various parameters in a series of log files. GPUVerify can then be invoked to examine these log files and verify each kernel instance. We explain how the interception mechanism works, and comment on the extent to which it improves the usability of GPUVerify.
{"title":"KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters","authors":"E. Bardsley, A. Donaldson, John Wickerson","doi":"10.1145/2664666.2664673","DOIUrl":"https://doi.org/10.1145/2664666.2664673","url":null,"abstract":"GPUVerify is a static analysis tool for verifying that GPU kernels are free from data races and barrier divergence. It is intended as an automatic tool, but its usability is impaired by the fact that the user must explicitly supply the kernel source code, the number of work items and work groups, and preconditions on key kernel arguments. Extracting this information from non-trivial OpenCL applications is laborious and error-prone.\u0000 We describe an extension to GPUVerify, called KernelInterceptor, that automates the extraction of this information from a given OpenCL application. After recompiling the application having included an additional header file, and linking with an additional library, KernelInterceptor is able to detect each dynamic kernel launch and record the values of the various parameters in a series of log files. GPUVerify can then be invoked to examine these log files and verify each kernel instance. We explain how the interception mechanism works, and comment on the extent to which it improves the usability of GPUVerify.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"108 1","pages":"7:1-7:5"},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77321939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chongxiao Cao, J. Dongarra, Peng Du, M. Gates, P. Luszczek, S. Tomov
This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates various optimizations, and in general provides the DLA functionality of the popular LAPACK library on heterogeneous architectures. The LAPACK compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portable performance. High performance is obtained through the use of the high-performance OpenCL BLAS, hardware- and OpenCL-specific tuning, and a hybridization methodology, where we split the algorithm into computational tasks of various granularities. Execution of those tasks is efficiently scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
{"title":"clMAGMA: high performance dense linear algebra with OpenCL","authors":"Chongxiao Cao, J. Dongarra, Peng Du, M. Gates, P. Luszczek, S. Tomov","doi":"10.1145/2664666.2664667","DOIUrl":"https://doi.org/10.1145/2664666.2664667","url":null,"abstract":"This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates various optimizations, and in general provides the DLA functionality of the popular LAPACK library on heterogeneous architectures. The LAPACK compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portable performance. High performance is obtained through the use of the high-performance OpenCL BLAS, hardware- and OpenCL-specific tuning, and a hybridization methodology, where we split the algorithm into computational tasks of various granularities. Execution of those tasks is efficiently scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"4 1","pages":"1:1-1:9"},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86114507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Rupp, Philippe Tillet, F. Rudolf, J. Weinbub, T. Grasser, A. Jüngel
The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations of kernel implementations and work sizes are found to exhibit good performance across compute kernels, hardware generations, and, to a lesser degree, vendors. As a consequence, it is demonstrated that the optimization of a single kernel is often sufficient to obtain good performance for a large class of more complicated operations.
{"title":"Performance portability study of linear algebra kernels in OpenCL","authors":"K. Rupp, Philippe Tillet, F. Rudolf, J. Weinbub, T. Grasser, A. Jüngel","doi":"10.1145/2664666.2664674","DOIUrl":"https://doi.org/10.1145/2664666.2664674","url":null,"abstract":"The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations of kernel implementations and work sizes are found to exhibit good performance across compute kernels, hardware generations, and, to a lesser degree, vendors. As a consequence, it is demonstrated that the optimization of a single kernel is often sufficient to obtain good performance for a large class of more complicated operations.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"1 1","pages":"8:1-8:11"},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74907290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}