SYCL is a modern high-level C++ programming interface which excels at expressing data parallelism for heterogeneous hardware platforms in a programmer-friendly way, and is standardized by the Khronos Group. The latest version of the standard, SYCL 2020, removes the previous dependence of the specification and its implementations on an underlying OpenCL target, opening the door for compliant alternative implementations. In this paper, we discuss the opportunities and challenges of mapping SYCL to Vulkan, a low-level explicit programming model for GPUs. This includes an analysis of the potential semantic mismatch between each respective standard, as well as approaches to work around some of these issues. Additionally, we present a prototype research implementation of Sylkan, a SYCL compiler and runtime targeting Vulkan. In order to evaluate our prototype qualitatively and quantitatively, we chose a variety of functional tests as well as three performance benchmarks. For the functional tests, we discuss and categorize the failures of the current prototype, noting which semantic mismatch or missing implementation causes them. For the performance benchmarks, we compare execution times against a OpenCL-based SYCL implementation and a native Vulkan version of each benchmark, on two hardware platforms.
{"title":"Sylkan: Towards a Vulkan Compute Target Platform for SYCL","authors":"Peter Thoman, Daniel Gogl, T. Fahringer","doi":"10.1145/3456669.3456683","DOIUrl":"https://doi.org/10.1145/3456669.3456683","url":null,"abstract":"SYCL is a modern high-level C++ programming interface which excels at expressing data parallelism for heterogeneous hardware platforms in a programmer-friendly way, and is standardized by the Khronos Group. The latest version of the standard, SYCL 2020, removes the previous dependence of the specification and its implementations on an underlying OpenCL target, opening the door for compliant alternative implementations. In this paper, we discuss the opportunities and challenges of mapping SYCL to Vulkan, a low-level explicit programming model for GPUs. This includes an analysis of the potential semantic mismatch between each respective standard, as well as approaches to work around some of these issues. Additionally, we present a prototype research implementation of Sylkan, a SYCL compiler and runtime targeting Vulkan. In order to evaluate our prototype qualitatively and quantitatively, we chose a variety of functional tests as well as three performance benchmarks. For the functional tests, we discuss and categorize the failures of the current prototype, noting which semantic mismatch or missing implementation causes them. For the performance benchmarks, we compare execution times against a OpenCL-based SYCL implementation and a native Vulkan version of each benchmark, on two hardware platforms.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86089096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sravani Konda, Dunni Aribuki, Weiqun Zhang, K. Gott, C. Lishka
AMReX is a software framework for massively parallel, block-structured adaptive mesh refinement (AMR) applications. AMReX is developed as part of the United States Department Of Energy’s Exascale Computing Project (ECP). Besides AMR capabilities, AMReX also provides a parallel programming framework for numerous applications including six ECP projects, and it implements several backends for CPU-GPU heterogeneous computing. In this talk, we present our experiences supporting DPC++, a language based on the SYCL specification as a backend for AMReX. We will demonstrate how AMReX provides an abstraction layer for its users so that they can write performance portable code for a variety of heterogeneous platforms. We will discuss key DPC++ features that allow AMReX to implement the abstractions and our contributions to the oneAPI specification and Intel’s implementation. We will also highlight some features missing in SYCL/DPC++ that limits its efficiency and our future plans.
{"title":"Experiences Supporting DPC++ in AMReX","authors":"Sravani Konda, Dunni Aribuki, Weiqun Zhang, K. Gott, C. Lishka","doi":"10.1145/3456669.3456673","DOIUrl":"https://doi.org/10.1145/3456669.3456673","url":null,"abstract":"AMReX is a software framework for massively parallel, block-structured adaptive mesh refinement (AMR) applications. AMReX is developed as part of the United States Department Of Energy’s Exascale Computing Project (ECP). Besides AMR capabilities, AMReX also provides a parallel programming framework for numerous applications including six ECP projects, and it implements several backends for CPU-GPU heterogeneous computing. In this talk, we present our experiences supporting DPC++, a language based on the SYCL specification as a backend for AMReX. We will demonstrate how AMReX provides an abstraction layer for its users so that they can write performance portable code for a variety of heterogeneous platforms. We will discuss key DPC++ features that allow AMReX to implement the abstractions and our contributions to the oneAPI specification and Intel’s implementation. We will also highlight some features missing in SYCL/DPC++ that limits its efficiency and our future plans.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81253870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for 5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices. In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained. Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS’s task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.
{"title":"Experiences With Adding SYCL Support to GROMACS","authors":"Andrey Alekseenko, Szilárd Páll, E. Lindahl","doi":"10.1145/3456669.3456690","DOIUrl":"https://doi.org/10.1145/3456669.3456690","url":null,"abstract":"GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for 5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices. In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained. Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS’s task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81528769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the age of data collection, machine learning algorithms have to be able to efficiently cope with vast data sets. This requires scalable algorithms and efficient implementations that can cope with heterogeneous hardware. We propose a new, performance-portable implementation of a well-known, robust, and versatile multi-class classification method that supports multiple Graphics Processing Units (GPUs) from different vendors. It is based on a performance-portable implementation of the approximate k-nearest neighbors (k-NN) algorithm in SYCL. The k-NN assigns a class to a data point based on a majority vote of its neighborhood. The naive approach compares a data point x to all other data points in the training data to identify the k nearest ones. However, this has quadratic runtime and is infeasible for large data sets. Therefore, approximate variants have been developed. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that have to be examined to compute the k-NN. To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL 1.2.1, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems. We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the library supports different SYCL implementations—ComputeCpp, hipSYCL, and DPC++—to target different hardware architectures without significant performance differences.
{"title":"Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL","authors":"Marcel Breyer, Gregor Daiß, D. Pflüger","doi":"10.1145/3456669.3456692","DOIUrl":"https://doi.org/10.1145/3456669.3456692","url":null,"abstract":"In the age of data collection, machine learning algorithms have to be able to efficiently cope with vast data sets. This requires scalable algorithms and efficient implementations that can cope with heterogeneous hardware. We propose a new, performance-portable implementation of a well-known, robust, and versatile multi-class classification method that supports multiple Graphics Processing Units (GPUs) from different vendors. It is based on a performance-portable implementation of the approximate k-nearest neighbors (k-NN) algorithm in SYCL. The k-NN assigns a class to a data point based on a majority vote of its neighborhood. The naive approach compares a data point x to all other data points in the training data to identify the k nearest ones. However, this has quadratic runtime and is infeasible for large data sets. Therefore, approximate variants have been developed. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that have to be examined to compute the k-NN. To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL 1.2.1, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems. We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the library supports different SYCL implementations—ComputeCpp, hipSYCL, and DPC++—to target different hardware architectures without significant performance differences.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84888522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators. While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world. Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.
{"title":"Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL","authors":"T. Sabino, M. Goli","doi":"10.1145/3456669.3456694","DOIUrl":"https://doi.org/10.1145/3456669.3456694","url":null,"abstract":"Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators. While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world. Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75768674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant performance drawbacks of PoCL on Intel CPUs – which run 92 % of the TOP500 list. Using a selection of benchmarks, we identify and analyse performance issues in PoCL with a focus on scheduling and vectorisation. We propose a new CPU device-driver based on Intel Threading Building Blocks (TBB), and evaluate LLVM with respect to automatic compiler vectorisation across work-items in PoCL. Using the TBB driver, it is possible to narrow the gap to Intel OpenCL and even outperform it by a factor of up to 1.3 × in our proxy application benchmark with a manual vectorisation strategy.
{"title":"Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs","authors":"Tobias Baumann, M. Noack, T. Steinke","doi":"10.1145/3456669.3456698","DOIUrl":"https://doi.org/10.1145/3456669.3456698","url":null,"abstract":"The Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant performance drawbacks of PoCL on Intel CPUs – which run 92 % of the TOP500 list. Using a selection of benchmarks, we identify and analyse performance issues in PoCL with a focus on scheduling and vectorisation. We propose a new CPU device-driver based on Intel Threading Building Blocks (TBB), and evaluate LLVM with respect to automatic compiler vectorisation across work-items in PoCL. Using the TBB driver, it is possible to narrow the gap to Intel OpenCL and even outperform it by a factor of up to 1.3 × in our proxy application benchmark with a manual vectorisation strategy.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81278940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RISC-V is a non-profit, member managed organization and is gaining momentum in the processor space, with more than 900 members. One of the goals of the organization is to build an open software platform, providing software developers an easy way to harness the familiar benefits already available on CPUs and GPUs. Today, system-on-chip manufacturers are building specialist accelerator processors based on the RISC-V architecture, taking advantage of the Vectorized extensions that match compute performance mostly seen on GPUs today. The availability of a familiar and well defined programming model is an absolute requirement if expecting to successfully bring these new processors to market. This presentation will dive into the details of Codeplay’s work in partnership with NSI-TEXE and Kyoto Microcomputer, describing the components needed to integrate OpenCL and SYCL onto RISC-V using multiple simulators. This project forms part of Japan’s New Energy and Industrial Technology Development Organisation (“NEDO”) project to build a powerful supercomputer. While Codeplay has previously enabled OpenCL for a variety processor architectures, there are a number of technical challenges involved in delivering a generic integration that can be used by multiple RISC-V based systems, and the solution required a change in approach. By adding to the existing LLVM back-end for RISC-V, and creating an integration layer that plugs into OpenCL, we have built a common base architecture for a range of RISC-V processors from different companies. This presentation will explain how Codeplay’s current driver interface works, and how it has been adapted to integrate with multiple RISC-V targets, in particular the riscvOVPsim and Spike RISC-V ISA simulators. We will also talk about some of the RISC-V extensions that are available, and how these can help to to expose features specific to the RISC-V architecture through OpenCL.
{"title":"Enabling OpenCL and SYCL for RISC-V processors","authors":"Rod Burns, Colin Davidson, Aidan Dodds","doi":"10.1145/3456669.3456687","DOIUrl":"https://doi.org/10.1145/3456669.3456687","url":null,"abstract":"RISC-V is a non-profit, member managed organization and is gaining momentum in the processor space, with more than 900 members. One of the goals of the organization is to build an open software platform, providing software developers an easy way to harness the familiar benefits already available on CPUs and GPUs. Today, system-on-chip manufacturers are building specialist accelerator processors based on the RISC-V architecture, taking advantage of the Vectorized extensions that match compute performance mostly seen on GPUs today. The availability of a familiar and well defined programming model is an absolute requirement if expecting to successfully bring these new processors to market. This presentation will dive into the details of Codeplay’s work in partnership with NSI-TEXE and Kyoto Microcomputer, describing the components needed to integrate OpenCL and SYCL onto RISC-V using multiple simulators. This project forms part of Japan’s New Energy and Industrial Technology Development Organisation (“NEDO”) project to build a powerful supercomputer. While Codeplay has previously enabled OpenCL for a variety processor architectures, there are a number of technical challenges involved in delivering a generic integration that can be used by multiple RISC-V based systems, and the solution required a change in approach. By adding to the existing LLVM back-end for RISC-V, and creating an integration layer that plugs into OpenCL, we have built a common base architecture for a range of RISC-V processors from different companies. This presentation will explain how Codeplay’s current driver interface works, and how it has been adapted to integrate with multiple RISC-V targets, in particular the riscvOVPsim and Spike RISC-V ISA simulators. We will also talk about some of the RISC-V extensions that are available, and how these can help to to expose features specific to the RISC-V architecture through OpenCL.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77197573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network—a type of graph—for handwriting recognition. There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing larger and increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these large, heterogeneous accelerators. Does OpenCL have the expressive power required to describe an execution graph to accelerator hardware, or does OpenCL serialize graphs and execute them sequentially? This technical presentation argues that it is the former: OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and OpenCL is sufficiently powerful to execute that graph on a graph accelerator. The OpenCL API is designed around the concept of the user enqueuing commands onto the front of a command-queue. Commands include executing kernels (i.e., functions), and reading, writing, and copying data buffers. The OpenCL device driver removes commands from the back of a command-queue, sets up data transfers to and from the accelerator device, and schedules kernels to execute on the device. The command-queue abstraction can encode execution graphs in one of two ways, depending on whether the command-queue is an in-order command-queue or an out-of-order command-queue. An in-order command-queue guarantees that the effects of the enqueued commands will be as if the commands were executed in the order in which they were enqueued. However, the OpenCL device driver is allowed to reorder commands, provided that reordering does not affect the output. For example, if two kernels do not have a data dependency between them, then they can be executed in reverse order or even in parallel, if the driver and hardware support it. An out-of-order command-queue does not guarantee that commands will appear to have been executed in the order in which they were enqueued. Instead, it is the OpenCL API user’s responsibility to attach events and event wait lists to commands. When a command finishes executing, it triggers its attached event, and when all the events in a command’s event wait list have triggered, then that command is allowed to execute. Both types of command-queues are capable of describing execution graphs. For in-order command-queues, the gra
{"title":"Executing Graphs with OpenCL","authors":"Erik Tomusk","doi":"10.1145/3456669.3456681","DOIUrl":"https://doi.org/10.1145/3456669.3456681","url":null,"abstract":"For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network—a type of graph—for handwriting recognition. There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing larger and increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these large, heterogeneous accelerators. Does OpenCL have the expressive power required to describe an execution graph to accelerator hardware, or does OpenCL serialize graphs and execute them sequentially? This technical presentation argues that it is the former: OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and OpenCL is sufficiently powerful to execute that graph on a graph accelerator. The OpenCL API is designed around the concept of the user enqueuing commands onto the front of a command-queue. Commands include executing kernels (i.e., functions), and reading, writing, and copying data buffers. The OpenCL device driver removes commands from the back of a command-queue, sets up data transfers to and from the accelerator device, and schedules kernels to execute on the device. The command-queue abstraction can encode execution graphs in one of two ways, depending on whether the command-queue is an in-order command-queue or an out-of-order command-queue. An in-order command-queue guarantees that the effects of the enqueued commands will be as if the commands were executed in the order in which they were enqueued. However, the OpenCL device driver is allowed to reorder commands, provided that reordering does not affect the output. For example, if two kernels do not have a data dependency between them, then they can be executed in reverse order or even in parallel, if the driver and hardware support it. An out-of-order command-queue does not guarantee that commands will appear to have been executed in the order in which they were enqueued. Instead, it is the OpenCL API user’s responsibility to attach events and event wait lists to commands. When a command finishes executing, it triggers its attached event, and when all the events in a command’s event wait list have triggered, then that command is allowed to execute. Both types of command-queues are capable of describing execution graphs. For in-order command-queues, the gra","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77169458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James will share his passion for getting to a world of heterogeneous computing where software tooling (compilers, frameworks, libraries, etc.) all have an “XPU view” of the world that spans vendors and devices. In this world, James advocates that we all be free to write our programs to use whatever XPUs we want, get full access to all XPU capabilities, and be comfortable trusting our ability to do this without extra risk to performance or stability. James will discuss how SYCL, DPC++, XPUs, and oneAPI all are important on our journey to make this vision a reality. James invites all conference attendees to join in and help guide Intel’s enthusiasm to help us all succeed together. Note: James co-authored the first (and only for now) book that teaches SYCL 2020 programming.
{"title":"SYCL, DPC++, XPUs, oneAPI","authors":"J. Reinders","doi":"10.1145/3456669.3456719","DOIUrl":"https://doi.org/10.1145/3456669.3456719","url":null,"abstract":"James will share his passion for getting to a world of heterogeneous computing where software tooling (compilers, frameworks, libraries, etc.) all have an “XPU view” of the world that spans vendors and devices. In this world, James advocates that we all be free to write our programs to use whatever XPUs we want, get full access to all XPU capabilities, and be comfortable trusting our ability to do this without extra risk to performance or stability. James will discuss how SYCL, DPC++, XPUs, and oneAPI all are important on our journey to make this vision a reality. James invites all conference attendees to join in and help guide Intel’s enthusiasm to help us all succeed together. Note: James co-authored the first (and only for now) book that teaches SYCL 2020 programming.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83852669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Medical image processing and visualization is often computationally demanding. Ultrasound images are acquired in real-time and needs to be processed at a high framerate with low latency. Computed tomography (CT) and magnetic resonance imaging (MRI) create large three dimensional volumes with sizes up to 512 × 512 × 800 voxels. In digital pathology, whole slide microscopy images can have an extreme image size of up to 200, 000 × 100, 000 pixels, which does not even fit into the memory of most computers. Thus, there is a need for smart data storage, processing and visualization methods to handle medical image data. The development of FAST started in 2014, the goal was to create an open-source framework which made GPU and parallel processing of medical images easy and portable. While there existed popular image processing libraries such as the visualization toolkit (VTK), insight toolkit (ITK) and OpenCV, the GPU processing capabilities were still implemented ad-hoc and often implied copying data back and forth from the GPU and CPU. Thus it was decided to use the new OpenCL API to create a cross-platform framework designed bottom-up with GPU processing at the very core. One of the design goals was to remove the burden of moving data back and forth from different processors and memory spaces from the developer. Instead, the developer requests access to the data on a given processor, and FAST will copy and update data as needed. Now, seven years later FAST version 3.2 is released, it still uses OpenCL 1.2 and OpenGL 3.3 at the core of almost all of its operations. FAST can stream images in real-time from ultrasound scanners, webcameras, Intel’s RealSense depth camera, and read many different formats from disk including medical formats such as DICOM, Metaimage and huge microscopy images stored as tiled image pyramids. FAST uses a processing pipeline concept, meaning that you define a pipeline as multiple processing and visualization steps first, then initiate the processing by executing the pipeline. The advantages of this is that it’s easy to change data sources and processing steps. The same pipeline used to process an ultrasound image on disk, can be used to process a real-time stream of ultrasound images. Today FAST pipelines can be created with C++, Python 3 and even without any programming using simple text files. The pipeline approach also opens up possibilities for load balancing and tuning based on analyzing the pipeline as computational graphs, although this has not yet been implemented. In the last five years or so, deep neural networks have become the standard for almost all image processing tasks. Many high-performance frameworks for deep neural network inference already exist, but have very different APIs and use different formats for storing neural network models. FAST now provides a common API for neural networks with multiple backends such as NVIDIA’s TensorRT, Intel’s OpenVINO and Google’s TensorFlow. This removes the burden of the us
{"title":"FAST: A framework for high-performance medical image computing and visualization","authors":"E. Smistad","doi":"10.1145/3456669.3456717","DOIUrl":"https://doi.org/10.1145/3456669.3456717","url":null,"abstract":"Medical image processing and visualization is often computationally demanding. Ultrasound images are acquired in real-time and needs to be processed at a high framerate with low latency. Computed tomography (CT) and magnetic resonance imaging (MRI) create large three dimensional volumes with sizes up to 512 × 512 × 800 voxels. In digital pathology, whole slide microscopy images can have an extreme image size of up to 200, 000 × 100, 000 pixels, which does not even fit into the memory of most computers. Thus, there is a need for smart data storage, processing and visualization methods to handle medical image data. The development of FAST started in 2014, the goal was to create an open-source framework which made GPU and parallel processing of medical images easy and portable. While there existed popular image processing libraries such as the visualization toolkit (VTK), insight toolkit (ITK) and OpenCV, the GPU processing capabilities were still implemented ad-hoc and often implied copying data back and forth from the GPU and CPU. Thus it was decided to use the new OpenCL API to create a cross-platform framework designed bottom-up with GPU processing at the very core. One of the design goals was to remove the burden of moving data back and forth from different processors and memory spaces from the developer. Instead, the developer requests access to the data on a given processor, and FAST will copy and update data as needed. Now, seven years later FAST version 3.2 is released, it still uses OpenCL 1.2 and OpenGL 3.3 at the core of almost all of its operations. FAST can stream images in real-time from ultrasound scanners, webcameras, Intel’s RealSense depth camera, and read many different formats from disk including medical formats such as DICOM, Metaimage and huge microscopy images stored as tiled image pyramids. FAST uses a processing pipeline concept, meaning that you define a pipeline as multiple processing and visualization steps first, then initiate the processing by executing the pipeline. The advantages of this is that it’s easy to change data sources and processing steps. The same pipeline used to process an ultrasound image on disk, can be used to process a real-time stream of ultrasound images. Today FAST pipelines can be created with C++, Python 3 and even without any programming using simple text files. The pipeline approach also opens up possibilities for load balancing and tuning based on analyzing the pipeline as computational graphs, although this has not yet been implemented. In the last five years or so, deep neural networks have become the standard for almost all image processing tasks. Many high-performance frameworks for deep neural network inference already exist, but have very different APIs and use different formats for storing neural network models. FAST now provides a common API for neural networks with multiple backends such as NVIDIA’s TensorRT, Intel’s OpenVINO and Google’s TensorFlow. This removes the burden of the us","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89270950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}