OpenCL命令缓冲区扩展:设计与实现

International Workshop on OpenCL Pub Date : 2022-05-10 DOI:10.1145/3529538.3529979

Ewan W. Crawford, J. Frankland

{"title":"OpenCL命令缓冲区扩展:设计与实现","authors":"Ewan W. Crawford, J. Frankland","doi":"10.1145/3529538.3529979","DOIUrl":null,"url":null,"abstract":"OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed back into the extension specification. For example, implementing the simultaneous use capability that allows more than one submission of a command-buffer instance to be in-flight at once. We provide a high level overview of how command-buffers in ComputeAorta are implemented using the same machinery as regular command enqueues via Codeplay’s propriety ComputeMux API, and provide details of some of the common pitfalls and gotchas a vendor may face when implementing command-buffers vs. regular OpenCL commands.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"OpenCL Command-buffer Extension: Design and Implementation\",\"authors\":\"Ewan W. Crawford, J. Frankland\",\"doi\":\"10.1145/3529538.3529979\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed back into the extension specification. For example, implementing the simultaneous use capability that allows more than one submission of a command-buffer instance to be in-flight at once. We provide a high level overview of how command-buffers in ComputeAorta are implemented using the same machinery as regular command enqueues via Codeplay’s propriety ComputeMux API, and provide details of some of the common pitfalls and gotchas a vendor may face when implementing command-buffers vs. regular OpenCL commands.\",\"PeriodicalId\":73497,\"journal\":{\"name\":\"International Workshop on OpenCL\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3529538.3529979\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529538.3529979","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

OpenCL1允许程序员将一系列命令卸载到异构加速器(如GPU)。对于嵌入式设备，构建命令序列的开销可能非常昂贵，并且许多应用程序需要在循环中重复排队相同的命令管道。例如，在计算机视觉中，使用相同的命令序列来处理不同的图像输入。在OpenCL命令记录与提交绑定在一起，一个clEnqueueCommand API调用将创建一个命令并调度它的执行，这意味着对于在循环中排队的命令组，构建命令序列的成本是不必要的。用于定义命令列表的另一种OpenCL API机制将从重复命令序列中消除这种开销，而不管目标OpenCL设备是什么。cl_khr_command_buffer[2]扩展于2021年11月作为OpenCL 3.0.10的一部分临时发布，提供了这样的解决方案。这个扩展引入了一个命令缓冲区的概念，它被记录一次命令的图形，完成提交，然后调度执行多次。将命令设置与调度分离意味着对于重复性工作负载，命令记录开销只发生一次。此外，在结束点引入了优化机会，在此之后不能再记录命令，命令缓冲区已准备好执行。在完成之后，命令缓冲区可以以最小的运行时开销进行异步调度。这种关注点分离通过消除必须等待主机为类似的工作负载再次构造命令的延迟，实现了机器学习应用程序中常见的流水线工作流。在本技术演示的前半部分，我们概述了临时批准的命令缓冲区扩展，并深入研究了其设计的关键点。这包括与Vulkan2命令缓冲区抽象的比较[4]，这表明这种方法在现实世界中是成功的。本文还讨论了引入新入口点的设计决策，而不是使用开始/结束标记重用现有的命令队列入口点。以及为什么在新的入口点中省略了主机端同步机制。在cl_khr_command_buffer之上的未来扩展的预期分层是另一个主题，以及为什么决定以这种方式拆分功能。Cl_khr_command_buffer被设计为适用于各种供应商的基础层。对即将到来的扩展的计划也将从广义上进行概述，这些扩展将消除将命令缓冲区绑定到单个命令队列的限制，并在提交之间提供命令缓冲区的可变性。演讲的后半部分讲述了我们在ComputeAorta3[1]、Codeplay的OpenCL实现中实现命令缓冲区扩展的经验，以及如何将其反馈到扩展规范中。例如，实现允许同时提交多个命令缓冲区实例的同时使用功能。我们提供了一个高层次的概述，说明如何在ComputeAorta中使用与常规命令队列相同的机制，通过Codeplay的专有ComputeMux API实现命令缓冲区，并提供了供应商在实现命令缓冲区与常规OpenCL命令时可能面临的一些常见陷阱和陷阱的细节。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

OpenCL Command-buffer Extension: Design and Implementation

OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed back into the extension specification. For example, implementing the simultaneous use capability that allows more than one submission of a command-buffer instance to be in-flight at once. We provide a high level overview of how command-buffers in ComputeAorta are implemented using the same machinery as regular command enqueues via Codeplay’s propriety ComputeMux API, and provide details of some of the common pitfalls and gotchas a vendor may face when implementing command-buffers vs. regular OpenCL commands.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量

期刊最新文献

Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL