International Workshop on OpenCL最新文献_第4页

SYCLops: A SYCL Specific LLVM to MLIR Converter

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529992

Alexandre Singer, Frank Gao, Kai-Ting Amy Wang

There is a growing need for higher level abstractions for device kernels in heterogeneous environments, and the multi-level nature of the MLIR infrastructure perfectly addresses this requirement. As SYCL begins to gain industry adoption for heterogeneous applications and MLIR continues to develop, we present SYCLops: a converter capable of translating SYCL specific LLVM IR to MLIR. This will allow for both target and application specific optimizations within the same framework to exploit opportunities for improvement present at different levels.

在异构环境中，越来越需要对设备内核进行更高级别的抽象，而MLIR基础结构的多层次特性完美地满足了这一需求。随着SYCL开始获得异构应用的行业采用和MLIR的持续发展，我们提出了SYCLops:一个能够将SYCL特定的LLVM IR转换为MLIR的转换器。这将允许在同一框架内进行目标和特定于应用程序的优化，以利用在不同级别上存在的改进机会。

引用次数: 2

OpenCL Command-buffer Extension: Design and Implementation OpenCL命令缓冲区扩展:设计与实现

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529979

Ewan W. Crawford, J. Frankland

OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed

OpenCL1允许程序员将一系列命令卸载到异构加速器(如GPU)。对于嵌入式设备，构建命令序列的开销可能非常昂贵，并且许多应用程序需要在循环中重复排队相同的命令管道。例如，在计算机视觉中，使用相同的命令序列来处理不同的图像输入。在OpenCL命令记录与提交绑定在一起，一个clEnqueueCommand API调用将创建一个命令并调度它的执行，这意味着对于在循环中排队的命令组，构建命令序列的成本是不必要的。用于定义命令列表的另一种OpenCL API机制将从重复命令序列中消除这种开销，而不管目标OpenCL设备是什么。cl_khr_command_buffer[2]扩展于2021年11月作为OpenCL 3.0.10的一部分临时发布，提供了这样的解决方案。这个扩展引入了一个命令缓冲区的概念，它被记录一次命令的图形，完成提交，然后调度执行多次。将命令设置与调度分离意味着对于重复性工作负载，命令记录开销只发生一次。此外，在结束点引入了优化机会，在此之后不能再记录命令，命令缓冲区已准备好执行。在完成之后，命令缓冲区可以以最小的运行时开销进行异步调度。这种关注点分离通过消除必须等待主机为类似的工作负载再次构造命令的延迟，实现了机器学习应用程序中常见的流水线工作流。在本技术演示的前半部分，我们概述了临时批准的命令缓冲区扩展，并深入研究了其设计的关键点。这包括与Vulkan2命令缓冲区抽象的比较[4]，这表明这种方法在现实世界中是成功的。本文还讨论了引入新入口点的设计决策，而不是使用开始/结束标记重用现有的命令队列入口点。以及为什么在新的入口点中省略了主机端同步机制。在cl_khr_command_buffer之上的未来扩展的预期分层是另一个主题，以及为什么决定以这种方式拆分功能。Cl_khr_command_buffer被设计为适用于各种供应商的基础层。对即将到来的扩展的计划也将从广义上进行概述，这些扩展将消除将命令缓冲区绑定到单个命令队列的限制，并在提交之间提供命令缓冲区的可变性。演讲的后半部分讲述了我们在ComputeAorta3[1]、Codeplay的OpenCL实现中实现命令缓冲区扩展的经验，以及如何将其反馈到扩展规范中。例如，实现允许同时提交多个命令缓冲区实例的同时使用功能。我们提供了一个高层次的概述，说明如何在ComputeAorta中使用与常规命令队列相同的机制，通过Codeplay的专有ComputeMux API实现命令缓冲区，并提供了供应商在实现命令缓冲区与常规OpenCL命令时可能面临的一些常见陷阱和陷阱的细节。

{"title":"OpenCL Command-buffer Extension: Design and Implementation","authors":"Ewan W. Crawford, J. Frankland","doi":"10.1145/3529538.3529979","DOIUrl":"https://doi.org/10.1145/3529538.3529979","url":null,"abstract":"OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed ","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86287571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Untangling Modern Parallel Programming Models 解开现代并行编程模型

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529987

M. Kinsner, Ben Ashbaugh, James C. Brodman, G. Lueck, S. Pennycook, Roland Schulz

Modern hardware is increasingly rich in diversity, including CPUs, GPUs, FPGAs and more, with new and novel architectures constantly emerging. To provide differentiation between these devices, each is typically built around architectures optimized for some classes of application or some patterns of parallelism. Numerous computational cores, varying levels of hardware vectorization, and other degrees of architectural freedom exist across the many hardware options. The need to efficiently utilize diverse hardware has led to emergence of a wide variety of programming models, execution models, and languages, and has simultaneously led to a complex landscape of confused and often conflicting terminology and abstractions. This reality makes it challenging for developers to comprehend and then choose a programming model that fits with their applications and mental model, particularly when more than one target architecture or vendor is of interest. This talk strives to untangle the landscape of modern parallel programming models, to help developers understand how the models and options relate to each other, and to frame how to think about their specific algorithms when expressing them in code. Although experienced developers typically understand much of the terminology and the relationships between models, a holistic presentation of the material is of strong value, as evidenced by feedback from parallel programming experts that have seen previews of this presentation. To begin, a brief overview will be presented to frame parallel programming and offload compute programming models, followed by characterization of the Single Program Multiple Data (SPMD) abstract model and the power it exhibits when mapping to multiple classes of architecture. We will discuss how fundamental design decisions within a compiler impact the mapping from source code to an underlying programming model, highlighting that the same code can be lowered to multiple models. This is particularly relevant in the presence of vector data types, which permit multiple interpretations and are a common cause of confusion. A core element of the presentation is decomposition of how programming model and design assumptions of a compiler are ideally understood concurrently by developers to streamline the creation and tuning of performant code. SPMD and explicit Single Instruction Multiple Data (SIMD) programming models will be discussed relative to the Khronos OpenCL and SYCL standards, as well as to OpenMP and CUDA, with the aim of clarifying the concepts and models for developers working in specific languages. The talk will conclude with an overview of an experimental extension to SYCL that proposes a mechanism for mixing SPMD and explicit SIMD programming styles with clear semantics and boundaries in code. The talk will show that providing clear points of transition with clear semantics can enable expert tuning at the granularity of a single line of code, without breaking the SPMD programming a

现代硬件的多样性越来越丰富，包括cpu, gpu, fpga等，新的和新颖的架构不断涌现。为了区分这些设备，每个设备通常都是围绕针对某些应用程序类或某些并行模式进行优化的体系结构构建的。在许多硬件选项中存在大量的计算核心、不同级别的硬件矢量化和其他程度的架构自由度。由于需要有效地利用各种硬件，因此出现了各种各样的编程模型、执行模型和语言，同时也导致了术语和抽象的混乱和经常冲突的复杂局面。这一现实使得开发人员很难理解并选择适合其应用程序和心智模型的编程模型，特别是当有多个目标体系结构或供应商感兴趣时。本次演讲致力于理清现代并行编程模型的格局，帮助开发人员理解模型和选项之间的关系，以及在代码中表达它们时如何思考它们的特定算法。尽管有经验的开发人员通常理解许多术语和模型之间的关系，但是材料的整体表示还是很有价值的，正如已经看过该表示预览版的并行编程专家的反馈所证明的那样。首先，将简要概述框架并行编程和卸载计算编程模型，然后描述单程序多数据(SPMD)抽象模型的特征，以及它在映射到多个体系结构类时所显示的功能。我们将讨论编译器中的基本设计决策如何影响从源代码到底层编程模型的映射，并强调相同的代码可以降低到多个模型。这在存在矢量数据类型时尤其重要，因为它允许多种解释，并且是引起混淆的常见原因。该演示的一个核心元素是分解开发人员如何理想地同时理解编译器的编程模型和设计假设，以简化高性能代码的创建和调优。SPMD和显式单指令多数据(SIMD)编程模型将相对于Khronos OpenCL和SYCL标准，以及OpenMP和CUDA进行讨论，目的是为使用特定语言的开发人员澄清概念和模型。讲座的最后将概述SYCL的一个实验性扩展，该扩展提出了一种机制，用于混合SPMD和显式SIMD编程风格，并在代码中具有清晰的语义和边界。该演讲将表明，提供具有清晰语义的清晰转换点可以使专家在单行代码的粒度上进行调优，而不会破坏内核其余部分使用的SPMD编程抽象。并行编程模型(如SPMD和SIMD)在现代异构计算体系结构中非常重要。再加上在特定编译器的实现过程中做出的决策，开发人员在理解概念和硬件映射如何交互时面临着一项复杂的任务。本演讲描述了通过SYCL, OpenCL, OpenMP和CUDA暴露的最常见的编程模型，旨在澄清关于软件到硬件映射的误解和困惑。与会者将全面了解SPMD和类似simd的编程模型是如何结合在一起的，以及它们如何与我们许多人每天编写的代码相关联。

{"title":"Untangling Modern Parallel Programming Models","authors":"M. Kinsner, Ben Ashbaugh, James C. Brodman, G. Lueck, S. Pennycook, Roland Schulz","doi":"10.1145/3529538.3529987","DOIUrl":"https://doi.org/10.1145/3529538.3529987","url":null,"abstract":"Modern hardware is increasingly rich in diversity, including CPUs, GPUs, FPGAs and more, with new and novel architectures constantly emerging. To provide differentiation between these devices, each is typically built around architectures optimized for some classes of application or some patterns of parallelism. Numerous computational cores, varying levels of hardware vectorization, and other degrees of architectural freedom exist across the many hardware options. The need to efficiently utilize diverse hardware has led to emergence of a wide variety of programming models, execution models, and languages, and has simultaneously led to a complex landscape of confused and often conflicting terminology and abstractions. This reality makes it challenging for developers to comprehend and then choose a programming model that fits with their applications and mental model, particularly when more than one target architecture or vendor is of interest. This talk strives to untangle the landscape of modern parallel programming models, to help developers understand how the models and options relate to each other, and to frame how to think about their specific algorithms when expressing them in code. Although experienced developers typically understand much of the terminology and the relationships between models, a holistic presentation of the material is of strong value, as evidenced by feedback from parallel programming experts that have seen previews of this presentation. To begin, a brief overview will be presented to frame parallel programming and offload compute programming models, followed by characterization of the Single Program Multiple Data (SPMD) abstract model and the power it exhibits when mapping to multiple classes of architecture. We will discuss how fundamental design decisions within a compiler impact the mapping from source code to an underlying programming model, highlighting that the same code can be lowered to multiple models. This is particularly relevant in the presence of vector data types, which permit multiple interpretations and are a common cause of confusion. A core element of the presentation is decomposition of how programming model and design assumptions of a compiler are ideally understood concurrently by developers to streamline the creation and tuning of performant code. SPMD and explicit Single Instruction Multiple Data (SIMD) programming models will be discussed relative to the Khronos OpenCL and SYCL standards, as well as to OpenMP and CUDA, with the aim of clarifying the concepts and models for developers working in specific languages. The talk will conclude with an overview of an experimental extension to SYCL that proposes a mechanism for mixing SPMD and explicit SIMD programming styles with clear semantics and boundaries in code. The talk will show that providing clear points of transition with clear semantics can enable expert tuning at the granularity of a single line of code, without breaking the SPMD programming a","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82088295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tutorial: Application Development with SYCL 教程:使用SYCL进行应用程序开发

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530000

Rod Burns, R. Keryell, Igor Vorobtsov, Aksel Alpay, Hugh Delaney, P. Zuzek

Parallel programming with heterogeneous architectures has gained a reputation for being difficult, but is it really? Modern C++ has come a long way to making parallel programming easier, and the SYCL programming model means heterogeneous programming using C++ is now more accessible than ever. SYCL uses modern standard C++, and it’s a programming model that lets developers support a wide variety of devices (CPUs, GPUs, FPGAs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model (without vendor lock-in) is essential for modern software developers. There are multiple implementations of SYCL available including open source projects, and in this tutorial you will join instructors who are developing some of these alongside experienced developers from academic institutions implementing complex SYCL code bases. This tutorial will provide a way for developers to gain expertise with SYCL in a practical environment focused more on writing code than Powerpoint. Attendees will gain a background of how the designers of the SYCL standard have addressed heterogeneous programming in C++ through industry collaboration. SYCL has gained widespread support in recent years and is available on Exascale systems, desktops, embedded systems, FPGAs, and automotive platforms. Regardless of the particular constructs in the future - the material in this course will prove timeless. This course will start by teaching the fundamentals of heterogeneous parallelism using SYCL. It will also teach you how to make use of modern C++ and the SYCL programming model to build parallel algorithms for heterogeneous devices. Most of the programming focus will be on GPUs, but some time will be spent applying the techniques to simple FPGA examples. The course will teach you how to apply some common GPU optimizations.

异构架构下的并行编程被认为是困难的，但这是真的吗?现代c++在简化并行编程方面取得了长足的进步，SYCL编程模型意味着使用c++进行异构编程现在比以往任何时候都更容易实现。SYCL使用现代标准c++，它是一种编程模型，允许开发人员从单个代码库支持各种设备(cpu、gpu、fpga等)。考虑到处理器路线图的日益异构性，转向开放标准、平台独立模型(没有供应商锁定)对现代软件开发人员来说是必不可少的。SYCL有多种可用的实现，包括开源项目，在本教程中，您将与来自学术机构的经验丰富的开发人员一起开发其中一些实现复杂SYCL代码库。本教程将为开发人员提供一种在实际环境中获得SYCL专业知识的方法，该环境更侧重于编写代码而不是Powerpoint。与会者将了解到SYCL标准的设计者如何通过行业协作解决c++中的异构编程问题。SYCL近年来获得了广泛的支持，可用于Exascale系统、台式机、嵌入式系统、fpga和汽车平台。无论未来的具体结构如何-本课程的材料将被证明是永恒的。本课程将从使用SYCL教授异构并行的基础开始。它还将教你如何使用现代c++和SYCL编程模型为异构设备构建并行算法。大部分编程重点将放在gpu上，但是一些时间将用于将这些技术应用于简单的FPGA示例。本课程将教你如何应用一些常见的GPU优化。

{"title":"Tutorial: Application Development with SYCL","authors":"Rod Burns, R. Keryell, Igor Vorobtsov, Aksel Alpay, Hugh Delaney, P. Zuzek","doi":"10.1145/3529538.3530000","DOIUrl":"https://doi.org/10.1145/3529538.3530000","url":null,"abstract":"Parallel programming with heterogeneous architectures has gained a reputation for being difficult, but is it really? Modern C++ has come a long way to making parallel programming easier, and the SYCL programming model means heterogeneous programming using C++ is now more accessible than ever. SYCL uses modern standard C++, and it’s a programming model that lets developers support a wide variety of devices (CPUs, GPUs, FPGAs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model (without vendor lock-in) is essential for modern software developers. There are multiple implementations of SYCL available including open source projects, and in this tutorial you will join instructors who are developing some of these alongside experienced developers from academic institutions implementing complex SYCL code bases. This tutorial will provide a way for developers to gain expertise with SYCL in a practical environment focused more on writing code than Powerpoint. Attendees will gain a background of how the designers of the SYCL standard have addressed heterogeneous programming in C++ through industry collaboration. SYCL has gained widespread support in recent years and is available on Exascale systems, desktops, embedded systems, FPGAs, and automotive platforms. Regardless of the particular constructs in the future - the material in this course will prove timeless. This course will start by teaching the fundamentals of heterogeneous parallelism using SYCL. It will also teach you how to make use of modern C++ and the SYCL programming model to build parallel algorithms for heterogeneous devices. Most of the programming focus will be on GPUs, but some time will be spent applying the techniques to simple FPGA examples. The course will teach you how to apply some common GPU optimizations.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89975374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

C++OpenCL4TVM: Support C++OpenCL Kernel for TVM NN Operators c++ OpenCL4TVM:支持c++ OpenCL内核的TVM NN算子

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530001

Po-Yao Chang, Tai-Liang Chen, Yu-Tse Huang, Meng-Shiun Yu, Jenq-Kuen Lee

In an era of artificial intelligence (AI), OpenCL serves as one of the AI frameworks’ back-ends, notably, the tensor virtual machine (TVM), which focuses on the inference side of neural networks. After optimizing a computational graph, TVM traverses the internal representations, Tensor-level IR (TIR), of each neural network (NN) operator generating OpenCL kernels for each one of them. In this work, we make TVM generate C++ for OpenCL, compile it to SPIR-V binary, and consume it with clCreateProgramWithIL inside TVM after we transform it by adding C[2]++ for_each and providing unseq as its argument. We also bumped into an llvm-spirv issue along the way. Finally, we found a workaround and proceeded to runnable TVM-generated C++ for OpenCL kernels.

在人工智能(AI)时代，OpenCL作为AI框架的后端之一，特别是张量虚拟机(TVM)，它专注于神经网络的推理端。优化计算图后，TVM遍历每个神经网络(NN)算子的内部表示，即张量级IR (TIR)，为每个神经网络算子生成OpenCL内核。在这项工作中，我们使TVM生成OpenCL的c++，将其编译为SPIR-V二进制文件，并通过添加C[2]++ for_each并提供unseq作为参数对其进行转换后，在TVM中使用clCreateProgramWithIL使用它。在此过程中，我们还遇到了一个llvm- spirit问题。最后，我们找到了一个解决方案，并继续为OpenCL内核运行由tvm生成的可运行c++。

引用次数: 0

Combined scientific CFD simulation and interactive raytracing with OpenCL 结合科学CFD模拟和交互式光线追踪与OpenCL

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529542

Moritz Lehmann

One of the main uses for OpenCL is (scientific) compute applications where graphical rendering is done externally, after the simulation has finished. However separating simulation and rendering has many disadvantages, especially the extreme slowdown caused by copying simulation data from device to host, and needing to store raw data on the hard drive, taking up hundreds of gigabyte, just to visualize preliminary results. A much faster approach is to implement both simulation and rendering in OpenCL. The rendering kernels have direct read-only access to the raw simulation data that resides in ultra-fast GPU memory. This eliminates all PCIe data transfer but camera parameters and finished frames, allowing for interactive visualization of simulation results in real time while the simulation is running. This is an invaluable tool for rapid prototyping. Although OpenCL does not have existing functionality for graphical rendering, being a general compute language, it allows for implementing an entire graphics engine, such that no data has to be moved to the CPU during rendering. On top, specific low-level optimizations make this OpenCL graphics engine outperform any existing rendering solution for this scenario, enabling drawing billions of lines per second and fluid raytracing in real time on even non-RTX GPUs. This combination of simulation and rendering in OpenCL is demonstrated with the software FluidX3D [3] - a lattice Boltzmann method (LBM) fluid dynamics solver. The first part will briefly introduce the numerical method for simulating fluid flow in a physically accurate manner. After introducing the LBM, the optimizations to make it run at peak efficiency are discussed: Being a memory-bound algorithm, coalesced memory access is key. This is achieved through array-of-structures data layout as well as the one-step-pull scheme, a certain variant of the LBM streaming step. One-step-pull leverages the fact that the misaligned read penalty is much smaller than the misaligned write penalty on almost all GPUs. Roofline analysis shows that with these optimizations, the LBM runs at 100% efficiency on the fastest data-center and gaming GPUs [5]. To simulate free surface flows, the LBM is extended with the Volume-of-Fluid (VoF) model. An efficient algorithm has been designed to vastly accelerate the challenging surface tension computation [4]. This extremely efficient VoF-LBM GPU implementation allows covering new grounds in science: FluidX3D has been used to simulate more than 1600 raindrop impacts to statistically evaluate how microplastics transition from the ocean surface into the atmosphere when the spray droplets are generated during drop impact [6]. At the same power consumption, with existing CPU-parallelized codes, compute time would have been several years, whilst with FluidX3D it was about a week. The second part will focus on real time rendering with OpenCL, especially raytracing. Rasterization on the GPU is parallelized not over pixels but line

OpenCL的主要用途之一是(科学的)计算应用程序，其中图形渲染是在仿真完成后在外部完成的。然而，将模拟和渲染分开有许多缺点，特别是由于将模拟数据从设备复制到主机，并且需要将原始数据存储在硬盘驱动器上，占用数百gb，只是为了可视化初步结果而导致的极端减速。一个更快的方法是在OpenCL中同时实现模拟和渲染。渲染内核可以直接只读访问驻留在超高速GPU内存中的原始模拟数据。这消除了除摄像机参数和完成帧外的所有PCIe数据传输，允许在模拟运行时实时交互式可视化模拟结果。这是快速原型制作的宝贵工具。虽然OpenCL没有现有的图形渲染功能，但作为一种通用的计算语言，它允许实现一个完整的图形引擎，这样在渲染期间就不需要将数据移动到CPU。最重要的是，特定的低级优化使得这个OpenCL图形引擎在这种情况下比任何现有的渲染解决方案都要好，每秒可以绘制数十亿行，甚至在非rtx gpu上也可以实时绘制流体光线追踪。通过软件FluidX3D[3] -晶格玻尔兹曼方法(LBM)流体动力学求解器，在OpenCL中演示了这种模拟和渲染的结合。第一部分将简要介绍以物理精确方式模拟流体流动的数值方法。在介绍了LBM之后，讨论了使其以最高效率运行的优化:作为一种内存约束算法，合并内存访问是关键。这是通过结构数组数据布局和一步拉拔方案来实现的，一步拉拔方案是LBM流步骤的某种变体。一步拉式利用了这样一个事实，即在几乎所有gpu上，不对齐的读损失比不对齐的写损失要小得多。rooline分析表明，通过这些优化，LBM在最快的数据中心和游戏gpu上以100%的效率运行[5]。为了模拟自由表面流动，将LBM扩展为流体体积(VoF)模型。设计了一种有效的算法来极大地加速具有挑战性的表面张力计算[4]。这种极其高效的VoF-LBM GPU实现可以覆盖科学领域的新领域:FluidX3D已被用于模拟1600多个雨滴撞击，以统计评估当水滴撞击期间产生喷雾液滴时，微塑料如何从海洋表面过渡到大气中[6]。在相同的功耗下，使用现有的cpu并行代码，计算时间可能需要几年，而使用FluidX3D大约需要一周。第二部分将重点介绍OpenCL的实时渲染，特别是光线追踪。GPU上的栅格化不是在像素上并行化，而是在线/三角形上并行化，这使得运行时基本上与屏幕分辨率无关，而且速度极快。每条线/三角形用相机参数从3D屏幕坐标转换为2D屏幕坐标，然后用Bresenham算法[2]和z-buffer光栅化到帧(整数数组)上。光线追踪图形基于快速光线网格遍历和移动立方体的组合，利用来自LBM的计算网格已经是光线追踪的理想加速结构。光线追踪的想法很简单:通过屏幕上的每个像素，从相机中射出一条反向光线，看看它与场景中的表面相交的地方。然后(递归地)计算反射/折射光线并混合颜色。如果光线不与任何物体相交，则其颜色由天空盒图像通过UV映射和双线性像素插值确定。对于由许多三角形组成的网格表面，计算时间很快成为一个问题，因为对于每条光线，所有三角形都必须进行相交测试。为了克服这个问题，需要一个加速结构。虽然电脑游戏经常使用边界体积层次结构，但LBM已经提供了一种理想的替代加速结构:模拟网格。相应的算法称为射线-网格遍历:当射线穿过三维网格时，只需检查每个遍历的网格单元与表面的交点，而不必检查整个网格。在每个遍历的网格单元中，使用行进立方体算法实时生成0-5个表面三角形，并使用Möller-Trumbore算法检查射线三角形的交点。如果找到了一个交点，只有在之后才会在跨越单元格的8个网格点上计算法线，并对交点坐标进行三线性插值。这样插值的表面法线使光线追踪的表面看起来非常光滑。

{"title":"Combined scientific CFD simulation and interactive raytracing with OpenCL","authors":"Moritz Lehmann","doi":"10.1145/3529538.3529542","DOIUrl":"https://doi.org/10.1145/3529538.3529542","url":null,"abstract":"One of the main uses for OpenCL is (scientific) compute applications where graphical rendering is done externally, after the simulation has finished. However separating simulation and rendering has many disadvantages, especially the extreme slowdown caused by copying simulation data from device to host, and needing to store raw data on the hard drive, taking up hundreds of gigabyte, just to visualize preliminary results. A much faster approach is to implement both simulation and rendering in OpenCL. The rendering kernels have direct read-only access to the raw simulation data that resides in ultra-fast GPU memory. This eliminates all PCIe data transfer but camera parameters and finished frames, allowing for interactive visualization of simulation results in real time while the simulation is running. This is an invaluable tool for rapid prototyping. Although OpenCL does not have existing functionality for graphical rendering, being a general compute language, it allows for implementing an entire graphics engine, such that no data has to be moved to the CPU during rendering. On top, specific low-level optimizations make this OpenCL graphics engine outperform any existing rendering solution for this scenario, enabling drawing billions of lines per second and fluid raytracing in real time on even non-RTX GPUs. This combination of simulation and rendering in OpenCL is demonstrated with the software FluidX3D [3] - a lattice Boltzmann method (LBM) fluid dynamics solver. The first part will briefly introduce the numerical method for simulating fluid flow in a physically accurate manner. After introducing the LBM, the optimizations to make it run at peak efficiency are discussed: Being a memory-bound algorithm, coalesced memory access is key. This is achieved through array-of-structures data layout as well as the one-step-pull scheme, a certain variant of the LBM streaming step. One-step-pull leverages the fact that the misaligned read penalty is much smaller than the misaligned write penalty on almost all GPUs. Roofline analysis shows that with these optimizations, the LBM runs at 100% efficiency on the fastest data-center and gaming GPUs [5]. To simulate free surface flows, the LBM is extended with the Volume-of-Fluid (VoF) model. An efficient algorithm has been designed to vastly accelerate the challenging surface tension computation [4]. This extremely efficient VoF-LBM GPU implementation allows covering new grounds in science: FluidX3D has been used to simulate more than 1600 raindrop impacts to statistically evaluate how microplastics transition from the ocean surface into the atmosphere when the spray droplets are generated during drop impact [6]. At the same power consumption, with existing CPU-parallelized codes, compute time would have been several years, whilst with FluidX3D it was about a week. The second part will focus on real time rendering with OpenCL, especially raytracing. Rasterization on the GPU is parallelized not over pixels but line","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91066498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Using interoperability mode in SYCL 2020 在SYCL 2020中使用互操作性模式

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529997

Aksel Alpay, T. Applencourt, Gordon Brown, R. Keryell, G. Lueck

SYCL is a programming standard targeting hardware platforms with a host connected to various heterogeneous accelerators. Both the host and accelerator parts of the computation are expressed in a single-source modern C++ program. While the previous versions of the SYCL standard were based only on top of the OpenCL standard to control the accelerators, starting with SYCL 2020, the standard is independent from OpenCL and can target different API, described with the concept of backend. Some SYCL implementations can thus target today various lower-level API, like OpenCL, CUDA, Level0, HIP, XRT, Vulkan, etc. with possibly different backends used at the same time in the same application. Even if the SYCL standard thrive to abstract the generic principles used in heterogeneous programming with C++ classes and functions, real applications require often to use specific details of a given architecture to benefit fully from an accelerator or need to be into integrated into a wider framework, including parts implemented in other languages and other API for heterogeneous computing. This is possible in SYCL with a less-know but powerful concept of interoperability, which is introduced at different levels. On one hand, by accessing some native backend objects from SYCL objects, it is possible to use in a SYCL program the native API, for example by calling some existing optimized libraries like mathematical libraries, machine learning, video CODEC, etc. to simplify the application development and reach the maximum performance. In that case it is for example possible to get from a sycl::queue a native queue from the backend to be used to enqueue a library function. On the other hand, it is possible to use a part of the application written in SYCL from another part of the application using another API by using SYCL interoperability functions to constructs SYCL objects like sycl::device or sycl::queue from native equivalent objects from the lower-level API backend used in the main part of the program. Another feature of SYCL 2020 interoperability is the ability to schedule backend API operations within the SYCL task DAG using host task interoperability. In SYCL, host tasks allow the user to enqueue an arbitrary C++ function within the SYCL DAG and host tasks have an optional interoperability handle which provides access to the native backend queue, device and memory objects at that point in the DAG. This feature is very powerful as it allows a SYCL application to interoperate with backend-specific libraries such as BLAS or DNN libraries. Finally, SYCL interoperability allows for calling backend-specific kernel functions in the backend kernel language such as OpenCL or CUDA via backend-specific functions when generating a kernel_bundle, which can be invoked via a SYCL queue. Some implementations can also go beyond the standard and provide some native functions directly callable from a plain SYCL kernel. SYCL can also be used to simplify the direct use of a lower-leve

SYCL是一种针对硬件平台的编程标准，其主机连接到各种异构加速器。计算的主机和加速器部分都在一个单一源的现代c++程序中表示。先前版本的SYCL标准仅基于OpenCL标准来控制加速器，而从SYCL 2020开始，该标准独立于OpenCL，可以针对不同的API，并使用后端概念进行描述。因此，一些SYCL实现可以针对今天的各种低级API，如OpenCL, CUDA, Level0, HIP, XRT, Vulkan等，可能在同一应用程序中同时使用不同的后端。即使SYCL标准能够抽象出使用c++类和函数的异构编程中使用的一般原则，实际的应用程序通常需要使用给定体系结构的特定细节来充分受益于加速器，或者需要集成到更广泛的框架中，包括用其他语言实现的部分和用于异构计算的其他API。这在SYCL中是可能的，它有一个鲜为人知但功能强大的互操作性概念，它是在不同级别引入的。一方面，通过从SYCL对象访问一些本地后端对象，可以在SYCL程序中使用本地API，例如通过调用一些现有的优化库，如数学库，机器学习，视频CODEC等来简化应用程序开发并达到最大性能。在这种情况下，例如可以从sycl::queue从后端获得一个本机队列，用于对库函数进行排队。另一方面，通过使用SYCL互操作性函数从程序主要部分使用的低级API后端的本机等效对象构造SYCL::device或SYCL::queue等SYCL对象，可以使用使用另一个API的应用程序的另一部分使用SYCL编写的应用程序的一部分。SYCL 2020互操作性的另一个特性是能够使用主机任务互操作性在SYCL任务DAG内调度后端API操作。在SYCL中，主机任务允许用户在SYCL DAG中为任意c++函数排队，并且主机任务有一个可选的互操作性句柄，该句柄提供对DAG中该点的本机后端队列、设备和内存对象的访问。这个特性非常强大，因为它允许SYCL应用程序与特定于后端的库(如BLAS或DNN库)进行互操作。最后，SYCL互操作性允许在生成kernel_bundle(可以通过SYCL队列调用)时，通过后端特定函数在后端内核语言(如OpenCL或CUDA)中调用后端特定内核函数。一些实现还可以超越标准，提供一些可从普通SYCL内核直接调用的本地函数。SYCL还可用于简化对低级API(如高级c++包装器)的直接使用，从而删除使用低级API所需的大量样板代码。由于可以使用sycl::buffer和sycl::accessor的互操作性模式，一些使用本机API的代码可以受益于sycl编程模型提供的隐式数据依赖任务图和计算的自动重叠以及隐式通信。在SYCL中拥有所有这些互操作性模式允许利用现有的其他互操作性模式，并在单个应用程序中的多个框架或标准之间构建一些复杂的互操作性路径。例如，在HPC中，SYCL应用程序可以通过公共后端与OpenMP库进行互操作，以协作的方式使用并行性，或者可以使用OpenCL后端通过OpenCL-Vulkan互操作性与Vulkan进行高性能图形渲染。多媒体应用程序可以使用SYCL-OpenCL-OpenGL-DX12路径对本地图像进行图像处理。

{"title":"Using interoperability mode in SYCL 2020","authors":"Aksel Alpay, T. Applencourt, Gordon Brown, R. Keryell, G. Lueck","doi":"10.1145/3529538.3529997","DOIUrl":"https://doi.org/10.1145/3529538.3529997","url":null,"abstract":"SYCL is a programming standard targeting hardware platforms with a host connected to various heterogeneous accelerators. Both the host and accelerator parts of the computation are expressed in a single-source modern C++ program. While the previous versions of the SYCL standard were based only on top of the OpenCL standard to control the accelerators, starting with SYCL 2020, the standard is independent from OpenCL and can target different API, described with the concept of backend. Some SYCL implementations can thus target today various lower-level API, like OpenCL, CUDA, Level0, HIP, XRT, Vulkan, etc. with possibly different backends used at the same time in the same application. Even if the SYCL standard thrive to abstract the generic principles used in heterogeneous programming with C++ classes and functions, real applications require often to use specific details of a given architecture to benefit fully from an accelerator or need to be into integrated into a wider framework, including parts implemented in other languages and other API for heterogeneous computing. This is possible in SYCL with a less-know but powerful concept of interoperability, which is introduced at different levels. On one hand, by accessing some native backend objects from SYCL objects, it is possible to use in a SYCL program the native API, for example by calling some existing optimized libraries like mathematical libraries, machine learning, video CODEC, etc. to simplify the application development and reach the maximum performance. In that case it is for example possible to get from a sycl::queue a native queue from the backend to be used to enqueue a library function. On the other hand, it is possible to use a part of the application written in SYCL from another part of the application using another API by using SYCL interoperability functions to constructs SYCL objects like sycl::device or sycl::queue from native equivalent objects from the lower-level API backend used in the main part of the program. Another feature of SYCL 2020 interoperability is the ability to schedule backend API operations within the SYCL task DAG using host task interoperability. In SYCL, host tasks allow the user to enqueue an arbitrary C++ function within the SYCL DAG and host tasks have an optional interoperability handle which provides access to the native backend queue, device and memory objects at that point in the DAG. This feature is very powerful as it allows a SYCL application to interoperate with backend-specific libraries such as BLAS or DNN libraries. Finally, SYCL interoperability allows for calling backend-specific kernel functions in the backend kernel language such as OpenCL or CUDA via backend-specific functions when generating a kernel_bundle, which can be invoked via a SYCL queue. Some implementations can also go beyond the standard and provide some native functions directly callable from a plain SYCL kernel. SYCL can also be used to simplify the direct use of a lower-leve","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88773793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Towards a Portable Drug Discovery Pipeline with SYCL 2020 SYCL 2020的便携式药物发现管道

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529688

Luigi Crisci, Majid Salimi Beni, Biagio Cosenza, Nicolò Scipione, D. Gadioli, E. Vitali, G. Palermo, A. Beccari

The outcome of the drug discovery process is a molecule that has strong interaction with the target protein. Domain experts expect a beneficial effect from this interaction. The virtual screening is one of the early stages of the process and it aims at finding promising molecules to forward to later stages. We perform this task in-silico to evaluate a very large chemical library in a short time frame. This activity typically comprises two compute-intensive tasks: a docking function that predicts the displacement of atoms, and a scoring function, which estimates the interaction strength [6] Dompé Farmaceutici led the development of LiGen [1, 2, 3], a molecular docking platform targeting High-Performance Computing systems. LiGen has been used for the discovery of novel treatments in the fight against viral infections and multidrug-resistant bacteria [4]. The LiGen processing pipeline includes two main components, ligen-dock and ligen-score, originally developed in OpenACC, refactored to CUDA using non-portable target-specific optimizations [7]. In this talk, we discuss the challenges of making the LiGen docking pipeline portable among different accelerators and GPUs by porting the original codebase from CUDA to SYCL. The code has been refactored by removing critical CUDA semantics with portable ones, and by exploiting several features from the SYCL 2020 standard [5], including sub-groups, group algorithms, and Unified Shared Memory. For comparison, we have developed two versions based on, respectively, accessor and USM-based memory accesses. Particular efforts have been spent on kernel tuning, in particular to optimize those kernels with high register pressure. The final SYCL code base, comprising more than 20 SYCL kernels, has been evaluated on several architectures including NVIDIA V100, NVIDIA A100, AMD MI100 as well as Intel Xeon, and by using both HipSYCL and Intel DPC++ compiler. In terms of performance portability, the SYCL implementation achieves similar performance compared to the CUDA native version on NVIDIA V100 and AMD M100, with minimal modification needed.

药物发现过程的结果是与靶蛋白具有强相互作用的分子。领域专家期望从这种互动中获得有益的效果。虚拟筛选是该过程的早期阶段之一，其目的是发现有前途的分子，以推进到后期阶段。我们在计算机上执行这项任务，以在短时间内评估一个非常大的化学库。该活动通常包括两个计算密集型任务:一个是预测原子位移的对接函数，另一个是估计相互作用强度的评分函数[6]，dompere Farmaceutici领导了LiGen[1,2,3]的开发，这是一个针对高性能计算系统的分子对接平台。LiGen已被用于发现对抗病毒感染和耐多药细菌的新疗法[4]。LiGen处理管道包括两个主要组件，LiGen -dock和LiGen -score，最初在OpenACC中开发，使用不可移植的目标特定优化重构到CUDA[7]。在这次演讲中，我们将讨论通过将原始代码库从CUDA移植到SYCL，使LiGen对接管道在不同加速器和gpu之间可移植的挑战。代码已经重构，通过移除关键的CUDA语义与可移植的，并利用几个功能从SYCL 2020标准[5]，包括子组，组算法，和统一共享内存。为了比较，我们开发了两个版本，分别基于访问器和基于usm的内存访问。在内核调优方面已经付出了特别的努力，特别是优化那些具有高寄存器压力的内核。最终的SYCL代码库，包括20多个SYCL内核，已经在几种架构上进行了评估，包括NVIDIA V100, NVIDIA A100, AMD MI100以及英特尔至强，并使用HipSYCL和英特尔dpc++编译器。在性能可移植性方面，与NVIDIA V100和AMD M100上的CUDA原生版本相比，SYCL实现实现了类似的性能，只需要很少的修改。

{"title":"Towards a Portable Drug Discovery Pipeline with SYCL 2020","authors":"Luigi Crisci, Majid Salimi Beni, Biagio Cosenza, Nicolò Scipione, D. Gadioli, E. Vitali, G. Palermo, A. Beccari","doi":"10.1145/3529538.3529688","DOIUrl":"https://doi.org/10.1145/3529538.3529688","url":null,"abstract":"The outcome of the drug discovery process is a molecule that has strong interaction with the target protein. Domain experts expect a beneficial effect from this interaction. The virtual screening is one of the early stages of the process and it aims at finding promising molecules to forward to later stages. We perform this task in-silico to evaluate a very large chemical library in a short time frame. This activity typically comprises two compute-intensive tasks: a docking function that predicts the displacement of atoms, and a scoring function, which estimates the interaction strength [6] Dompé Farmaceutici led the development of LiGen [1, 2, 3], a molecular docking platform targeting High-Performance Computing systems. LiGen has been used for the discovery of novel treatments in the fight against viral infections and multidrug-resistant bacteria [4]. The LiGen processing pipeline includes two main components, ligen-dock and ligen-score, originally developed in OpenACC, refactored to CUDA using non-portable target-specific optimizations [7]. In this talk, we discuss the challenges of making the LiGen docking pipeline portable among different accelerators and GPUs by porting the original codebase from CUDA to SYCL. The code has been refactored by removing critical CUDA semantics with portable ones, and by exploiting several features from the SYCL 2020 standard [5], including sub-groups, group algorithms, and Unified Shared Memory. For comparison, we have developed two versions based on, respectively, accessor and USM-based memory accesses. Particular efforts have been spent on kernel tuning, in particular to optimize those kernels with high register pressure. The final SYCL code base, comprising more than 20 SYCL kernels, has been evaluated on several architectures including NVIDIA V100, NVIDIA A100, AMD MI100 as well as Intel Xeon, and by using both HipSYCL and Intel DPC++ compiler. In terms of performance portability, the SYCL implementation achieves similar performance compared to the CUDA native version on NVIDIA V100 and AMD M100, with minimal modification needed.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"6 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78336465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Performance analysis of matrix-free conjugate gradient kernels using SYCL 基于SYCL的无矩阵共轭梯度核性能分析

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529993

I. Baratta, C. Richardson, G. N. Wells

We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations on the CPU node. All considered kernels in the matrix-free implementation are memory bandwidth limited, and a simple performance model is applied to estimate the asymptotic memory bandwidth and the latency. Our experiments show that in most cases the considered SYCL implementations match the asymptotic performance of the reference implementations. However, for smaller but practically relevant problem sizes latency is observed to have a significant impact on performance. For some cases the SYCL latency is reasonably close to the reference (cuBLAS/MKL) implementation latency, but in other cases it is more than one order of magnitude greater. In particular, SYCL built-in reductions on the GPU and all operations for one of the SYCL implementations on the CPU exhibit high latency, and this latency limits performance at problem sizes that can in cases be representative of full application simulations, and can degrade strong scaling performance.

我们研究了求解稀疏线性方程组的共轭梯度法的无矩阵SYCL实现的性能。在NVIDIA A100- 80gb设备和双插槽Intel冰湖CPU节点上使用不同的SYCL实现进行性能测试，并与A100 GPU上的CUDA BLAS (cuBLAS)实现和CPU节点上的MKL实现进行比较。在无矩阵实现中，所有考虑的核都是有限的内存带宽，并应用一个简单的性能模型来估计渐近内存带宽和延迟。我们的实验表明，在大多数情况下，考虑的SYCL实现与参考实现的渐近性能相匹配。但是，对于较小但实际相关的问题大小，可以观察到延迟对性能有重大影响。在某些情况下，SYCL延迟与参考(cuBLAS/MKL)实现延迟相当接近，但在其他情况下，它比参考(cuBLAS/MKL)实现延迟大一个数量级以上。特别是，GPU上的SYCL内置缩减和CPU上SYCL实现之一的所有操作都表现出高延迟，这种延迟限制了问题规模的性能，在某些情况下，这些问题规模可以代表完整的应用程序模拟，并且可能降低强大的扩展性能。

{"title":"Performance analysis of matrix-free conjugate gradient kernels using SYCL","authors":"I. Baratta, C. Richardson, G. N. Wells","doi":"10.1145/3529538.3529993","DOIUrl":"https://doi.org/10.1145/3529538.3529993","url":null,"abstract":"We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations on the CPU node. All considered kernels in the matrix-free implementation are memory bandwidth limited, and a simple performance model is applied to estimate the asymptotic memory bandwidth and the latency. Our experiments show that in most cases the considered SYCL implementations match the asymptotic performance of the reference implementations. However, for smaller but practically relevant problem sizes latency is observed to have a significant impact on performance. For some cases the SYCL latency is reasonably close to the reference (cuBLAS/MKL) implementation latency, but in other cases it is more than one order of magnitude greater. In particular, SYCL built-in reductions on the GPU and all operations for one of the SYCL implementations on the CPU exhibit high latency, and this latency limits performance at problem sizes that can in cases be representative of full application simulations, and can degrade strong scaling performance.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86985236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring SYCL SC 探索SYCL SC

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530006

Verena Beckham, Ken Wenger

Khronos already has two safety-critical variants of APIs: OpenGL SC for graphics and Vulkan SC for graphics and low-level compute. In addition, OpenVX 1.3 has defined a safety-critical feature set for AI/Vision applications. However, in the safety space there is currently no high-level compute API to develop the complex algorithms of tomorrow. By implementing SYCL SC on top of Vulkan SC and building on safety certified drivers and hardware, the whole stack, all the way from hardware to application, can be safety certified. SYCL SC will also unlock the ecosystem of libraries and tools that already exist for SYCL for the safety-critical domain. Frameworks such as AI/ML frameworks can be built on top of SYCL SC in the future, to provide even higher levels of abstraction. This presentation will talk about the aims of the new standard, which are aligned with Vulkan SC. It will also touch on some initial design ideas, with a focus on deterministic rather than dynamic behavior. We suggest the removal of some SYCL features that are unlikely to be used in a safety context, to facilitate the safety certification of the runtime itself and the addition of extensions that provide functionality that is useful in a context where safety is critical, such as related to the timing of kernels. We will discuss the importance of Misra's C++ guidelines, particularly the upcoming Misra C++ 202x standard, for applications and hence APIs, whilst acknowledging the need to remain compatible with standard SYCL as much as possible. We set up the Exploratory Forum to collect feedback from potential runtime implementers, library developers, researchers and users on what their requirements for a high-level compute API in a safety-critical context are. The Exploratory Forum is open to non-Khronos-members under Khronos NDA and we actively encourage non-members to participate. Once a wide range of requirements has been collected the next step is the creation of a new Khronos group, which would work towards a specification. This presentation will describe the issues that organizations are facing that can be solved through the new standard and provoke discussion on how to develop an API that will meet the needs of the safety-critical community. After the presentation we invite the audience to join the Exploratory Forum to talk about their own requirements and experiences as well as collaborate to develop a framework for the new standard to be defined.

Khronos已经有两个安全关键的api变体:用于图形的OpenGL SC和用于图形和低级计算的Vulkan SC。此外，OpenVX 1.3还为AI/视觉应用程序定义了一个安全关键特性集。然而，在安全领域，目前还没有高级计算API来开发未来的复杂算法。通过在Vulkan SC之上实施SYCL SC，并建立在安全认证的驱动程序和硬件之上，从硬件到应用程序的整个堆栈都可以获得安全认证。SYCL SC还将解锁SYCL在安全关键领域已经存在的库和工具生态系统。将来，AI/ML框架等框架可以构建在SYCL SC之上，以提供更高层次的抽象。本演讲将讨论与Vulkan SC一致的新标准的目标，也将触及一些最初的设计思想，重点是确定性而不是动态行为。我们建议删除一些不太可能在安全上下文中使用的SYCL特性，以促进运行时本身的安全认证，并增加在安全至关重要的上下文中提供有用功能的扩展，例如与内核计时相关的扩展。我们将讨论Misra c++指南的重要性，特别是即将发布的Misra c++ 202x标准，对于应用程序和api的重要性，同时承认需要尽可能地保持与标准SYCL的兼容。我们建立了探索性论坛，以收集来自潜在运行时实现者、库开发人员、研究人员和用户的反馈，了解他们在安全关键上下文中对高级计算API的需求。探索性论坛根据《赫罗诺斯保密协议》向非赫罗诺斯成员开放，我们积极鼓励非成员参与。一旦收集了广泛的需求，下一步就是创建一个新的Khronos组，该组将朝着规范的方向工作。本演讲将描述组织面临的问题，这些问题可以通过新标准解决，并引发关于如何开发满足安全关键社区需求的API的讨论。演讲结束后，我们邀请观众加入探索性论坛，讨论他们自己的需求和经验，并合作开发一个框架，以确定新标准。

{"title":"Exploring SYCL SC","authors":"Verena Beckham, Ken Wenger","doi":"10.1145/3529538.3530006","DOIUrl":"https://doi.org/10.1145/3529538.3530006","url":null,"abstract":"Khronos already has two safety-critical variants of APIs: OpenGL SC for graphics and Vulkan SC for graphics and low-level compute. In addition, OpenVX 1.3 has defined a safety-critical feature set for AI/Vision applications. However, in the safety space there is currently no high-level compute API to develop the complex algorithms of tomorrow. By implementing SYCL SC on top of Vulkan SC and building on safety certified drivers and hardware, the whole stack, all the way from hardware to application, can be safety certified. SYCL SC will also unlock the ecosystem of libraries and tools that already exist for SYCL for the safety-critical domain. Frameworks such as AI/ML frameworks can be built on top of SYCL SC in the future, to provide even higher levels of abstraction. This presentation will talk about the aims of the new standard, which are aligned with Vulkan SC. It will also touch on some initial design ideas, with a focus on deterministic rather than dynamic behavior. We suggest the removal of some SYCL features that are unlikely to be used in a safety context, to facilitate the safety certification of the runtime itself and the addition of extensions that provide functionality that is useful in a context where safety is critical, such as related to the timing of kernels. We will discuss the importance of Misra's C++ guidelines, particularly the upcoming Misra C++ 202x standard, for applications and hence APIs, whilst acknowledging the need to remain compatible with standard SYCL as much as possible. We set up the Exploratory Forum to collect feedback from potential runtime implementers, library developers, researchers and users on what their requirements for a high-level compute API in a safety-critical context are. The Exploratory Forum is open to non-Khronos-members under Khronos NDA and we actively encourage non-members to participate. Once a wide range of requirements has been collected the next step is the creation of a new Khronos group, which would work towards a specification. This presentation will describe the issues that organizations are facing that can be solved through the new standard and provoke discussion on how to develop an API that will meet the needs of the safety-critical community. After the presentation we invite the audience to join the Exploratory Forum to talk about their own requirements and experiences as well as collaborate to develop a framework for the new standard to be defined.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89546101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1