{"title":"OpenCL命令缓冲区扩展:设计与实现","authors":"Ewan W. Crawford, J. Frankland","doi":"10.1145/3529538.3529979","DOIUrl":null,"url":null,"abstract":"OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed back into the extension specification. For example, implementing the simultaneous use capability that allows more than one submission of a command-buffer instance to be in-flight at once. We provide a high level overview of how command-buffers in ComputeAorta are implemented using the same machinery as regular command enqueues via Codeplay’s propriety ComputeMux API, and provide details of some of the common pitfalls and gotchas a vendor may face when implementing command-buffers vs. regular OpenCL commands.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"OpenCL Command-buffer Extension: Design and Implementation\",\"authors\":\"Ewan W. Crawford, J. Frankland\",\"doi\":\"10.1145/3529538.3529979\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed back into the extension specification. For example, implementing the simultaneous use capability that allows more than one submission of a command-buffer instance to be in-flight at once. We provide a high level overview of how command-buffers in ComputeAorta are implemented using the same machinery as regular command enqueues via Codeplay’s propriety ComputeMux API, and provide details of some of the common pitfalls and gotchas a vendor may face when implementing command-buffers vs. regular OpenCL commands.\",\"PeriodicalId\":73497,\"journal\":{\"name\":\"International Workshop on OpenCL\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3529538.3529979\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529538.3529979","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
OpenCL Command-buffer Extension: Design and Implementation
OpenCL1 allows a programmer to offload a sequence of commands to a heterogeneous accelerator, such as a GPU. For embedded devices the overhead of building a command sequence can be expensive, and many applications require the same pipeline of commands to be repeatedly enqueued in a loop. For example, in computer vision where the same command sequence is used to process different image inputs. In OpenCL command recording is tied to submission, a clEnqueueCommand API invocation will both create a command and schedule it for execution, meaning that for groups of commands enqueued in a loop the cost of building the command sequence is unnecessarily incurred over and over again. An alternative OpenCL API mechanism for defining the command list would remove this overhead from repeated command sequences, regardless of the target OpenCL device. The cl_khr_command_buffer[2] extension, provisionally released in November 2021 as part of OpenCL 3.0.10, provides such as solution. This extension introduces the concept of a command-buffer which is recorded once with a graph of commands, finalized for submission, and then dispatched for execution many times. Separating command setup from dispatch means that for repetitive workloads the command recording overheads are only incurred once. Additionally, optimization opportunities are introduced at the point of finalization, after which no more commands can be recorded, and the command-buffer is made ready for execution. After finalization, the command-buffer can then be asynchronously dispatched with minimal runtime overhead. This separation of concerns enables pipelined workflows common in machine learning applications by eliminating the latency of having to wait on the host to construct commands again for a similar workload. In the first half of this technical presentation, we give an overview of the provisionally ratified command-buffer extension and dive into key points of its design. This includes a comparison with the Vulkan2 command-buffer abstraction[4], which shows that this approach is successful in the real-world. The design decision to introduce new entry-points, rather than reuse existing command-queue entry-points with begin/end markers, is also covered. As well as why mechanisms for host side synchronization were omitted from the new entry-points. Intended layering of future extensions on top of cl_khr_command_buffer is another topic raised, and why it was decided to split the functionality this way. cl_khr_command_buffer is designed as the base layer that is applicable to a wide range of vendors. Plans for the upcoming extensions layered on top will also be outlined in broad terms, these remove the restriction tying a command-buffer to a single command-queue as well as provide mutability of the command-buffer between submissions. The second half of the presentation relays our experience implementing the command-buffer extension in ComputeAorta3[1], Codeplay’s OpenCL implementation, and how this fed back into the extension specification. For example, implementing the simultaneous use capability that allows more than one submission of a command-buffer instance to be in-flight at once. We provide a high level overview of how command-buffers in ComputeAorta are implemented using the same machinery as regular command enqueues via Codeplay’s propriety ComputeMux API, and provide details of some of the common pitfalls and gotchas a vendor may face when implementing command-buffers vs. regular OpenCL commands.