There is a growing need for higher level abstractions for device kernels in heterogeneous environments, and the multi-level nature of the MLIR infrastructure perfectly addresses this requirement. As SYCL begins to gain industry adoption for heterogeneous applications and MLIR continues to develop, we present SYCLops: a converter capable of translating SYCL specific LLVM IR to MLIR. This will allow for both target and application specific optimizations within the same framework to exploit opportunities for improvement present at different levels.
{"title":"SYCLops: A SYCL Specific LLVM to MLIR Converter","authors":"Alexandre Singer, Frank Gao, Kai-Ting Amy Wang","doi":"10.1145/3529538.3529992","DOIUrl":"https://doi.org/10.1145/3529538.3529992","url":null,"abstract":"There is a growing need for higher level abstractions for device kernels in heterogeneous environments, and the multi-level nature of the MLIR infrastructure perfectly addresses this requirement. As SYCL begins to gain industry adoption for heterogeneous applications and MLIR continues to develop, we present SYCLops: a converter capable of translating SYCL specific LLVM IR to MLIR. This will allow for both target and application specific optimizations within the same framework to exploit opportunities for improvement present at different levels.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78344976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed
{"title":"OpenCL Command-buffer Extension: Design and Implementation","authors":"Ewan W. Crawford, J. Frankland","doi":"10.1145/3529538.3529979","DOIUrl":"https://doi.org/10.1145/3529538.3529979","url":null,"abstract":"OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed ","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86287571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Kinsner, Ben Ashbaugh, James C. Brodman, G. Lueck, S. Pennycook, Roland Schulz
Modern hardware is increasingly rich in diversity, including CPUs, GPUs, FPGAs and more, with new and novel architectures constantly emerging. To provide differentiation between these devices, each is typically built around architectures optimized for some classes of application or some patterns of parallelism. Numerous computational cores, varying levels of hardware vectorization, and other degrees of architectural freedom exist across the many hardware options. The need to efficiently utilize diverse hardware has led to emergence of a wide variety of programming models, execution models, and languages, and has simultaneously led to a complex landscape of confused and often conflicting terminology and abstractions. This reality makes it challenging for developers to comprehend and then choose a programming model that fits with their applications and mental model, particularly when more than one target architecture or vendor is of interest. This talk strives to untangle the landscape of modern parallel programming models, to help developers understand how the models and options relate to each other, and to frame how to think about their specific algorithms when expressing them in code. Although experienced developers typically understand much of the terminology and the relationships between models, a holistic presentation of the material is of strong value, as evidenced by feedback from parallel programming experts that have seen previews of this presentation. To begin, a brief overview will be presented to frame parallel programming and offload compute programming models, followed by characterization of the Single Program Multiple Data (SPMD) abstract model and the power it exhibits when mapping to multiple classes of architecture. We will discuss how fundamental design decisions within a compiler impact the mapping from source code to an underlying programming model, highlighting that the same code can be lowered to multiple models. This is particularly relevant in the presence of vector data types, which permit multiple interpretations and are a common cause of confusion. A core element of the presentation is decomposition of how programming model and design assumptions of a compiler are ideally understood concurrently by developers to streamline the creation and tuning of performant code. SPMD and explicit Single Instruction Multiple Data (SIMD) programming models will be discussed relative to the Khronos OpenCL and SYCL standards, as well as to OpenMP and CUDA, with the aim of clarifying the concepts and models for developers working in specific languages. The talk will conclude with an overview of an experimental extension to SYCL that proposes a mechanism for mixing SPMD and explicit SIMD programming styles with clear semantics and boundaries in code. The talk will show that providing clear points of transition with clear semantics can enable expert tuning at the granularity of a single line of code, without breaking the SPMD programming a
{"title":"Untangling Modern Parallel Programming Models","authors":"M. Kinsner, Ben Ashbaugh, James C. Brodman, G. Lueck, S. Pennycook, Roland Schulz","doi":"10.1145/3529538.3529987","DOIUrl":"https://doi.org/10.1145/3529538.3529987","url":null,"abstract":"Modern hardware is increasingly rich in diversity, including CPUs, GPUs, FPGAs and more, with new and novel architectures constantly emerging. To provide differentiation between these devices, each is typically built around architectures optimized for some classes of application or some patterns of parallelism. Numerous computational cores, varying levels of hardware vectorization, and other degrees of architectural freedom exist across the many hardware options. The need to efficiently utilize diverse hardware has led to emergence of a wide variety of programming models, execution models, and languages, and has simultaneously led to a complex landscape of confused and often conflicting terminology and abstractions. This reality makes it challenging for developers to comprehend and then choose a programming model that fits with their applications and mental model, particularly when more than one target architecture or vendor is of interest. This talk strives to untangle the landscape of modern parallel programming models, to help developers understand how the models and options relate to each other, and to frame how to think about their specific algorithms when expressing them in code. Although experienced developers typically understand much of the terminology and the relationships between models, a holistic presentation of the material is of strong value, as evidenced by feedback from parallel programming experts that have seen previews of this presentation. To begin, a brief overview will be presented to frame parallel programming and offload compute programming models, followed by characterization of the Single Program Multiple Data (SPMD) abstract model and the power it exhibits when mapping to multiple classes of architecture. We will discuss how fundamental design decisions within a compiler impact the mapping from source code to an underlying programming model, highlighting that the same code can be lowered to multiple models. This is particularly relevant in the presence of vector data types, which permit multiple interpretations and are a common cause of confusion. A core element of the presentation is decomposition of how programming model and design assumptions of a compiler are ideally understood concurrently by developers to streamline the creation and tuning of performant code. SPMD and explicit Single Instruction Multiple Data (SIMD) programming models will be discussed relative to the Khronos OpenCL and SYCL standards, as well as to OpenMP and CUDA, with the aim of clarifying the concepts and models for developers working in specific languages. The talk will conclude with an overview of an experimental extension to SYCL that proposes a mechanism for mixing SPMD and explicit SIMD programming styles with clear semantics and boundaries in code. The talk will show that providing clear points of transition with clear semantics can enable expert tuning at the granularity of a single line of code, without breaking the SPMD programming a","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82088295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rod Burns, R. Keryell, Igor Vorobtsov, Aksel Alpay, Hugh Delaney, P. Zuzek
Parallel programming with heterogeneous architectures has gained a reputation for being difficult, but is it really? Modern C++ has come a long way to making parallel programming easier, and the SYCL programming model means heterogeneous programming using C++ is now more accessible than ever. SYCL uses modern standard C++, and it’s a programming model that lets developers support a wide variety of devices (CPUs, GPUs, FPGAs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model (without vendor lock-in) is essential for modern software developers. There are multiple implementations of SYCL available including open source projects, and in this tutorial you will join instructors who are developing some of these alongside experienced developers from academic institutions implementing complex SYCL code bases. This tutorial will provide a way for developers to gain expertise with SYCL in a practical environment focused more on writing code than Powerpoint. Attendees will gain a background of how the designers of the SYCL standard have addressed heterogeneous programming in C++ through industry collaboration. SYCL has gained widespread support in recent years and is available on Exascale systems, desktops, embedded systems, FPGAs, and automotive platforms. Regardless of the particular constructs in the future - the material in this course will prove timeless. This course will start by teaching the fundamentals of heterogeneous parallelism using SYCL. It will also teach you how to make use of modern C++ and the SYCL programming model to build parallel algorithms for heterogeneous devices. Most of the programming focus will be on GPUs, but some time will be spent applying the techniques to simple FPGA examples. The course will teach you how to apply some common GPU optimizations.
{"title":"Tutorial: Application Development with SYCL","authors":"Rod Burns, R. Keryell, Igor Vorobtsov, Aksel Alpay, Hugh Delaney, P. Zuzek","doi":"10.1145/3529538.3530000","DOIUrl":"https://doi.org/10.1145/3529538.3530000","url":null,"abstract":"Parallel programming with heterogeneous architectures has gained a reputation for being difficult, but is it really? Modern C++ has come a long way to making parallel programming easier, and the SYCL programming model means heterogeneous programming using C++ is now more accessible than ever. SYCL uses modern standard C++, and it’s a programming model that lets developers support a wide variety of devices (CPUs, GPUs, FPGAs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model (without vendor lock-in) is essential for modern software developers. There are multiple implementations of SYCL available including open source projects, and in this tutorial you will join instructors who are developing some of these alongside experienced developers from academic institutions implementing complex SYCL code bases. This tutorial will provide a way for developers to gain expertise with SYCL in a practical environment focused more on writing code than Powerpoint. Attendees will gain a background of how the designers of the SYCL standard have addressed heterogeneous programming in C++ through industry collaboration. SYCL has gained widespread support in recent years and is available on Exascale systems, desktops, embedded systems, FPGAs, and automotive platforms. Regardless of the particular constructs in the future - the material in this course will prove timeless. This course will start by teaching the fundamentals of heterogeneous parallelism using SYCL. It will also teach you how to make use of modern C++ and the SYCL programming model to build parallel algorithms for heterogeneous devices. Most of the programming focus will be on GPUs, but some time will be spent applying the techniques to simple FPGA examples. The course will teach you how to apply some common GPU optimizations.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89975374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Po-Yao Chang, Tai-Liang Chen, Yu-Tse Huang, Meng-Shiun Yu, Jenq-Kuen Lee
In an era of artificial intelligence (AI), OpenCL serves as one of the AI frameworks’ back-ends, notably, the tensor virtual machine (TVM), which focuses on the inference side of neural networks. After optimizing a computational graph, TVM traverses the internal representations, Tensor-level IR (TIR), of each neural network (NN) operator generating OpenCL kernels for each one of them. In this work, we make TVM generate C++ for OpenCL, compile it to SPIR-V binary, and consume it with clCreateProgramWithIL inside TVM after we transform it by adding C[2]++ for_each and providing unseq as its argument. We also bumped into an llvm-spirv issue along the way. Finally, we found a workaround and proceeded to runnable TVM-generated C++ for OpenCL kernels.
{"title":"C++OpenCL4TVM: Support C++OpenCL Kernel for TVM NN Operators","authors":"Po-Yao Chang, Tai-Liang Chen, Yu-Tse Huang, Meng-Shiun Yu, Jenq-Kuen Lee","doi":"10.1145/3529538.3530001","DOIUrl":"https://doi.org/10.1145/3529538.3530001","url":null,"abstract":"In an era of artificial intelligence (AI), OpenCL serves as one of the AI frameworks’ back-ends, notably, the tensor virtual machine (TVM), which focuses on the inference side of neural networks. After optimizing a computational graph, TVM traverses the internal representations, Tensor-level IR (TIR), of each neural network (NN) operator generating OpenCL kernels for each one of them. In this work, we make TVM generate C++ for OpenCL, compile it to SPIR-V binary, and consume it with clCreateProgramWithIL inside TVM after we transform it by adding C[2]++ for_each and providing unseq as its argument. We also bumped into an llvm-spirv issue along the way. Finally, we found a workaround and proceeded to runnable TVM-generated C++ for OpenCL kernels.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"126 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87758620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the main uses for OpenCL is (scientific) compute applications where graphical rendering is done externally, after the simulation has finished. However separating simulation and rendering has many disadvantages, especially the extreme slowdown caused by copying simulation data from device to host, and needing to store raw data on the hard drive, taking up hundreds of gigabyte, just to visualize preliminary results. A much faster approach is to implement both simulation and rendering in OpenCL. The rendering kernels have direct read-only access to the raw simulation data that resides in ultra-fast GPU memory. This eliminates all PCIe data transfer but camera parameters and finished frames, allowing for interactive visualization of simulation results in real time while the simulation is running. This is an invaluable tool for rapid prototyping. Although OpenCL does not have existing functionality for graphical rendering, being a general compute language, it allows for implementing an entire graphics engine, such that no data has to be moved to the CPU during rendering. On top, specific low-level optimizations make this OpenCL graphics engine outperform any existing rendering solution for this scenario, enabling drawing billions of lines per second and fluid raytracing in real time on even non-RTX GPUs. This combination of simulation and rendering in OpenCL is demonstrated with the software FluidX3D [3] - a lattice Boltzmann method (LBM) fluid dynamics solver. The first part will briefly introduce the numerical method for simulating fluid flow in a physically accurate manner. After introducing the LBM, the optimizations to make it run at peak efficiency are discussed: Being a memory-bound algorithm, coalesced memory access is key. This is achieved through array-of-structures data layout as well as the one-step-pull scheme, a certain variant of the LBM streaming step. One-step-pull leverages the fact that the misaligned read penalty is much smaller than the misaligned write penalty on almost all GPUs. Roofline analysis shows that with these optimizations, the LBM runs at 100% efficiency on the fastest data-center and gaming GPUs [5]. To simulate free surface flows, the LBM is extended with the Volume-of-Fluid (VoF) model. An efficient algorithm has been designed to vastly accelerate the challenging surface tension computation [4]. This extremely efficient VoF-LBM GPU implementation allows covering new grounds in science: FluidX3D has been used to simulate more than 1600 raindrop impacts to statistically evaluate how microplastics transition from the ocean surface into the atmosphere when the spray droplets are generated during drop impact [6]. At the same power consumption, with existing CPU-parallelized codes, compute time would have been several years, whilst with FluidX3D it was about a week. The second part will focus on real time rendering with OpenCL, especially raytracing. Rasterization on the GPU is parallelized not over pixels but line
{"title":"Combined scientific CFD simulation and interactive raytracing with OpenCL","authors":"Moritz Lehmann","doi":"10.1145/3529538.3529542","DOIUrl":"https://doi.org/10.1145/3529538.3529542","url":null,"abstract":"One of the main uses for OpenCL is (scientific) compute applications where graphical rendering is done externally, after the simulation has finished. However separating simulation and rendering has many disadvantages, especially the extreme slowdown caused by copying simulation data from device to host, and needing to store raw data on the hard drive, taking up hundreds of gigabyte, just to visualize preliminary results. A much faster approach is to implement both simulation and rendering in OpenCL. The rendering kernels have direct read-only access to the raw simulation data that resides in ultra-fast GPU memory. This eliminates all PCIe data transfer but camera parameters and finished frames, allowing for interactive visualization of simulation results in real time while the simulation is running. This is an invaluable tool for rapid prototyping. Although OpenCL does not have existing functionality for graphical rendering, being a general compute language, it allows for implementing an entire graphics engine, such that no data has to be moved to the CPU during rendering. On top, specific low-level optimizations make this OpenCL graphics engine outperform any existing rendering solution for this scenario, enabling drawing billions of lines per second and fluid raytracing in real time on even non-RTX GPUs. This combination of simulation and rendering in OpenCL is demonstrated with the software FluidX3D [3] - a lattice Boltzmann method (LBM) fluid dynamics solver. The first part will briefly introduce the numerical method for simulating fluid flow in a physically accurate manner. After introducing the LBM, the optimizations to make it run at peak efficiency are discussed: Being a memory-bound algorithm, coalesced memory access is key. This is achieved through array-of-structures data layout as well as the one-step-pull scheme, a certain variant of the LBM streaming step. One-step-pull leverages the fact that the misaligned read penalty is much smaller than the misaligned write penalty on almost all GPUs. Roofline analysis shows that with these optimizations, the LBM runs at 100% efficiency on the fastest data-center and gaming GPUs [5]. To simulate free surface flows, the LBM is extended with the Volume-of-Fluid (VoF) model. An efficient algorithm has been designed to vastly accelerate the challenging surface tension computation [4]. This extremely efficient VoF-LBM GPU implementation allows covering new grounds in science: FluidX3D has been used to simulate more than 1600 raindrop impacts to statistically evaluate how microplastics transition from the ocean surface into the atmosphere when the spray droplets are generated during drop impact [6]. At the same power consumption, with existing CPU-parallelized codes, compute time would have been several years, whilst with FluidX3D it was about a week. The second part will focus on real time rendering with OpenCL, especially raytracing. Rasterization on the GPU is parallelized not over pixels but line","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91066498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aksel Alpay, T. Applencourt, Gordon Brown, R. Keryell, G. Lueck
SYCL is a programming standard targeting hardware platforms with a host connected to various heterogeneous accelerators. Both the host and accelerator parts of the computation are expressed in a single-source modern C++ program. While the previous versions of the SYCL standard were based only on top of the OpenCL standard to control the accelerators, starting with SYCL 2020, the standard is independent from OpenCL and can target different API, described with the concept of backend. Some SYCL implementations can thus target today various lower-level API, like OpenCL, CUDA, Level0, HIP, XRT, Vulkan, etc. with possibly different backends used at the same time in the same application. Even if the SYCL standard thrive to abstract the generic principles used in heterogeneous programming with C++ classes and functions, real applications require often to use specific details of a given architecture to benefit fully from an accelerator or need to be into integrated into a wider framework, including parts implemented in other languages and other API for heterogeneous computing. This is possible in SYCL with a less-know but powerful concept of interoperability, which is introduced at different levels. On one hand, by accessing some native backend objects from SYCL objects, it is possible to use in a SYCL program the native API, for example by calling some existing optimized libraries like mathematical libraries, machine learning, video CODEC, etc. to simplify the application development and reach the maximum performance. In that case it is for example possible to get from a sycl::queue a native queue from the backend to be used to enqueue a library function. On the other hand, it is possible to use a part of the application written in SYCL from another part of the application using another API by using SYCL interoperability functions to constructs SYCL objects like sycl::device or sycl::queue from native equivalent objects from the lower-level API backend used in the main part of the program. Another feature of SYCL 2020 interoperability is the ability to schedule backend API operations within the SYCL task DAG using host task interoperability. In SYCL, host tasks allow the user to enqueue an arbitrary C++ function within the SYCL DAG and host tasks have an optional interoperability handle which provides access to the native backend queue, device and memory objects at that point in the DAG. This feature is very powerful as it allows a SYCL application to interoperate with backend-specific libraries such as BLAS or DNN libraries. Finally, SYCL interoperability allows for calling backend-specific kernel functions in the backend kernel language such as OpenCL or CUDA via backend-specific functions when generating a kernel_bundle, which can be invoked via a SYCL queue. Some implementations can also go beyond the standard and provide some native functions directly callable from a plain SYCL kernel. SYCL can also be used to simplify the direct use of a lower-leve
{"title":"Using interoperability mode in SYCL 2020","authors":"Aksel Alpay, T. Applencourt, Gordon Brown, R. Keryell, G. Lueck","doi":"10.1145/3529538.3529997","DOIUrl":"https://doi.org/10.1145/3529538.3529997","url":null,"abstract":"SYCL is a programming standard targeting hardware platforms with a host connected to various heterogeneous accelerators. Both the host and accelerator parts of the computation are expressed in a single-source modern C++ program. While the previous versions of the SYCL standard were based only on top of the OpenCL standard to control the accelerators, starting with SYCL 2020, the standard is independent from OpenCL and can target different API, described with the concept of backend. Some SYCL implementations can thus target today various lower-level API, like OpenCL, CUDA, Level0, HIP, XRT, Vulkan, etc. with possibly different backends used at the same time in the same application. Even if the SYCL standard thrive to abstract the generic principles used in heterogeneous programming with C++ classes and functions, real applications require often to use specific details of a given architecture to benefit fully from an accelerator or need to be into integrated into a wider framework, including parts implemented in other languages and other API for heterogeneous computing. This is possible in SYCL with a less-know but powerful concept of interoperability, which is introduced at different levels. On one hand, by accessing some native backend objects from SYCL objects, it is possible to use in a SYCL program the native API, for example by calling some existing optimized libraries like mathematical libraries, machine learning, video CODEC, etc. to simplify the application development and reach the maximum performance. In that case it is for example possible to get from a sycl::queue a native queue from the backend to be used to enqueue a library function. On the other hand, it is possible to use a part of the application written in SYCL from another part of the application using another API by using SYCL interoperability functions to constructs SYCL objects like sycl::device or sycl::queue from native equivalent objects from the lower-level API backend used in the main part of the program. Another feature of SYCL 2020 interoperability is the ability to schedule backend API operations within the SYCL task DAG using host task interoperability. In SYCL, host tasks allow the user to enqueue an arbitrary C++ function within the SYCL DAG and host tasks have an optional interoperability handle which provides access to the native backend queue, device and memory objects at that point in the DAG. This feature is very powerful as it allows a SYCL application to interoperate with backend-specific libraries such as BLAS or DNN libraries. Finally, SYCL interoperability allows for calling backend-specific kernel functions in the backend kernel language such as OpenCL or CUDA via backend-specific functions when generating a kernel_bundle, which can be invoked via a SYCL queue. Some implementations can also go beyond the standard and provide some native functions directly callable from a plain SYCL kernel. SYCL can also be used to simplify the direct use of a lower-leve","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88773793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luigi Crisci, Majid Salimi Beni, Biagio Cosenza, Nicolò Scipione, D. Gadioli, E. Vitali, G. Palermo, A. Beccari
The outcome of the drug discovery process is a molecule that has strong interaction with the target protein. Domain experts expect a beneficial effect from this interaction. The virtual screening is one of the early stages of the process and it aims at finding promising molecules to forward to later stages. We perform this task in-silico to evaluate a very large chemical library in a short time frame. This activity typically comprises two compute-intensive tasks: a docking function that predicts the displacement of atoms, and a scoring function, which estimates the interaction strength [6] Dompé Farmaceutici led the development of LiGen [1, 2, 3], a molecular docking platform targeting High-Performance Computing systems. LiGen has been used for the discovery of novel treatments in the fight against viral infections and multidrug-resistant bacteria [4]. The LiGen processing pipeline includes two main components, ligen-dock and ligen-score, originally developed in OpenACC, refactored to CUDA using non-portable target-specific optimizations [7]. In this talk, we discuss the challenges of making the LiGen docking pipeline portable among different accelerators and GPUs by porting the original codebase from CUDA to SYCL. The code has been refactored by removing critical CUDA semantics with portable ones, and by exploiting several features from the SYCL 2020 standard [5], including sub-groups, group algorithms, and Unified Shared Memory. For comparison, we have developed two versions based on, respectively, accessor and USM-based memory accesses. Particular efforts have been spent on kernel tuning, in particular to optimize those kernels with high register pressure. The final SYCL code base, comprising more than 20 SYCL kernels, has been evaluated on several architectures including NVIDIA V100, NVIDIA A100, AMD MI100 as well as Intel Xeon, and by using both HipSYCL and Intel DPC++ compiler. In terms of performance portability, the SYCL implementation achieves similar performance compared to the CUDA native version on NVIDIA V100 and AMD M100, with minimal modification needed.
{"title":"Towards a Portable Drug Discovery Pipeline with SYCL 2020","authors":"Luigi Crisci, Majid Salimi Beni, Biagio Cosenza, Nicolò Scipione, D. Gadioli, E. Vitali, G. Palermo, A. Beccari","doi":"10.1145/3529538.3529688","DOIUrl":"https://doi.org/10.1145/3529538.3529688","url":null,"abstract":"The outcome of the drug discovery process is a molecule that has strong interaction with the target protein. Domain experts expect a beneficial effect from this interaction. The virtual screening is one of the early stages of the process and it aims at finding promising molecules to forward to later stages. We perform this task in-silico to evaluate a very large chemical library in a short time frame. This activity typically comprises two compute-intensive tasks: a docking function that predicts the displacement of atoms, and a scoring function, which estimates the interaction strength [6] Dompé Farmaceutici led the development of LiGen [1, 2, 3], a molecular docking platform targeting High-Performance Computing systems. LiGen has been used for the discovery of novel treatments in the fight against viral infections and multidrug-resistant bacteria [4]. The LiGen processing pipeline includes two main components, ligen-dock and ligen-score, originally developed in OpenACC, refactored to CUDA using non-portable target-specific optimizations [7]. In this talk, we discuss the challenges of making the LiGen docking pipeline portable among different accelerators and GPUs by porting the original codebase from CUDA to SYCL. The code has been refactored by removing critical CUDA semantics with portable ones, and by exploiting several features from the SYCL 2020 standard [5], including sub-groups, group algorithms, and Unified Shared Memory. For comparison, we have developed two versions based on, respectively, accessor and USM-based memory accesses. Particular efforts have been spent on kernel tuning, in particular to optimize those kernels with high register pressure. The final SYCL code base, comprising more than 20 SYCL kernels, has been evaluated on several architectures including NVIDIA V100, NVIDIA A100, AMD MI100 as well as Intel Xeon, and by using both HipSYCL and Intel DPC++ compiler. In terms of performance portability, the SYCL implementation achieves similar performance compared to the CUDA native version on NVIDIA V100 and AMD M100, with minimal modification needed.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"6 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78336465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations on the CPU node. All considered kernels in the matrix-free implementation are memory bandwidth limited, and a simple performance model is applied to estimate the asymptotic memory bandwidth and the latency. Our experiments show that in most cases the considered SYCL implementations match the asymptotic performance of the reference implementations. However, for smaller but practically relevant problem sizes latency is observed to have a significant impact on performance. For some cases the SYCL latency is reasonably close to the reference (cuBLAS/MKL) implementation latency, but in other cases it is more than one order of magnitude greater. In particular, SYCL built-in reductions on the GPU and all operations for one of the SYCL implementations on the CPU exhibit high latency, and this latency limits performance at problem sizes that can in cases be representative of full application simulations, and can degrade strong scaling performance.
{"title":"Performance analysis of matrix-free conjugate gradient kernels using SYCL","authors":"I. Baratta, C. Richardson, G. N. Wells","doi":"10.1145/3529538.3529993","DOIUrl":"https://doi.org/10.1145/3529538.3529993","url":null,"abstract":"We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations on the CPU node. All considered kernels in the matrix-free implementation are memory bandwidth limited, and a simple performance model is applied to estimate the asymptotic memory bandwidth and the latency. Our experiments show that in most cases the considered SYCL implementations match the asymptotic performance of the reference implementations. However, for smaller but practically relevant problem sizes latency is observed to have a significant impact on performance. For some cases the SYCL latency is reasonably close to the reference (cuBLAS/MKL) implementation latency, but in other cases it is more than one order of magnitude greater. In particular, SYCL built-in reductions on the GPU and all operations for one of the SYCL implementations on the CPU exhibit high latency, and this latency limits performance at problem sizes that can in cases be representative of full application simulations, and can degrade strong scaling performance.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86985236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khronos already has two safety-critical variants of APIs: OpenGL SC for graphics and Vulkan SC for graphics and low-level compute. In addition, OpenVX 1.3 has defined a safety-critical feature set for AI/Vision applications. However, in the safety space there is currently no high-level compute API to develop the complex algorithms of tomorrow. By implementing SYCL SC on top of Vulkan SC and building on safety certified drivers and hardware, the whole stack, all the way from hardware to application, can be safety certified. SYCL SC will also unlock the ecosystem of libraries and tools that already exist for SYCL for the safety-critical domain. Frameworks such as AI/ML frameworks can be built on top of SYCL SC in the future, to provide even higher levels of abstraction. This presentation will talk about the aims of the new standard, which are aligned with Vulkan SC. It will also touch on some initial design ideas, with a focus on deterministic rather than dynamic behavior. We suggest the removal of some SYCL features that are unlikely to be used in a safety context, to facilitate the safety certification of the runtime itself and the addition of extensions that provide functionality that is useful in a context where safety is critical, such as related to the timing of kernels. We will discuss the importance of Misra's C++ guidelines, particularly the upcoming Misra C++ 202x standard, for applications and hence APIs, whilst acknowledging the need to remain compatible with standard SYCL as much as possible. We set up the Exploratory Forum to collect feedback from potential runtime implementers, library developers, researchers and users on what their requirements for a high-level compute API in a safety-critical context are. The Exploratory Forum is open to non-Khronos-members under Khronos NDA and we actively encourage non-members to participate. Once a wide range of requirements has been collected the next step is the creation of a new Khronos group, which would work towards a specification. This presentation will describe the issues that organizations are facing that can be solved through the new standard and provoke discussion on how to develop an API that will meet the needs of the safety-critical community. After the presentation we invite the audience to join the Exploratory Forum to talk about their own requirements and experiences as well as collaborate to develop a framework for the new standard to be defined.
{"title":"Exploring SYCL SC","authors":"Verena Beckham, Ken Wenger","doi":"10.1145/3529538.3530006","DOIUrl":"https://doi.org/10.1145/3529538.3530006","url":null,"abstract":"Khronos already has two safety-critical variants of APIs: OpenGL SC for graphics and Vulkan SC for graphics and low-level compute. In addition, OpenVX 1.3 has defined a safety-critical feature set for AI/Vision applications. However, in the safety space there is currently no high-level compute API to develop the complex algorithms of tomorrow. By implementing SYCL SC on top of Vulkan SC and building on safety certified drivers and hardware, the whole stack, all the way from hardware to application, can be safety certified. SYCL SC will also unlock the ecosystem of libraries and tools that already exist for SYCL for the safety-critical domain. Frameworks such as AI/ML frameworks can be built on top of SYCL SC in the future, to provide even higher levels of abstraction. This presentation will talk about the aims of the new standard, which are aligned with Vulkan SC. It will also touch on some initial design ideas, with a focus on deterministic rather than dynamic behavior. We suggest the removal of some SYCL features that are unlikely to be used in a safety context, to facilitate the safety certification of the runtime itself and the addition of extensions that provide functionality that is useful in a context where safety is critical, such as related to the timing of kernels. We will discuss the importance of Misra's C++ guidelines, particularly the upcoming Misra C++ 202x standard, for applications and hence APIs, whilst acknowledging the need to remain compatible with standard SYCL as much as possible. We set up the Exploratory Forum to collect feedback from potential runtime implementers, library developers, researchers and users on what their requirements for a high-level compute API in a safety-critical context are. The Exploratory Forum is open to non-Khronos-members under Khronos NDA and we actively encourage non-members to participate. Once a wide range of requirements has been collected the next step is the creation of a new Khronos group, which would work towards a specification. This presentation will describe the issues that organizations are facing that can be solved through the new standard and provoke discussion on how to develop an API that will meet the needs of the safety-critical community. After the presentation we invite the audience to join the Exploratory Forum to talk about their own requirements and experiences as well as collaborate to develop a framework for the new standard to be defined.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89546101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}