用OpenCL执行图形

Erik Tomusk
{"title":"用OpenCL执行图形","authors":"Erik Tomusk","doi":"10.1145/3456669.3456681","DOIUrl":null,"url":null,"abstract":"For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network—a type of graph—for handwriting recognition. There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing larger and increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these large, heterogeneous accelerators. Does OpenCL have the expressive power required to describe an execution graph to accelerator hardware, or does OpenCL serialize graphs and execute them sequentially? This technical presentation argues that it is the former: OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and OpenCL is sufficiently powerful to execute that graph on a graph accelerator. The OpenCL API is designed around the concept of the user enqueuing commands onto the front of a command-queue. Commands include executing kernels (i.e., functions), and reading, writing, and copying data buffers. The OpenCL device driver removes commands from the back of a command-queue, sets up data transfers to and from the accelerator device, and schedules kernels to execute on the device. The command-queue abstraction can encode execution graphs in one of two ways, depending on whether the command-queue is an in-order command-queue or an out-of-order command-queue. An in-order command-queue guarantees that the effects of the enqueued commands will be as if the commands were executed in the order in which they were enqueued. However, the OpenCL device driver is allowed to reorder commands, provided that reordering does not affect the output. For example, if two kernels do not have a data dependency between them, then they can be executed in reverse order or even in parallel, if the driver and hardware support it. An out-of-order command-queue does not guarantee that commands will appear to have been executed in the order in which they were enqueued. Instead, it is the OpenCL API user’s responsibility to attach events and event wait lists to commands. When a command finishes executing, it triggers its attached event, and when all the events in a command’s event wait list have triggered, then that command is allowed to execute. Both types of command-queues are capable of describing execution graphs. For in-order command-queues, the graph is implied by kernels’ data dependencies; for out-of-order command-queues, the graph is explicitly defined with events. By instrumenting Codeplay’s ComputeAorta[2] OpenCL implementation, it is possible to record OpenCL API calls and to reconstruct the execution graph as seen by the OpenCL device driver. This presentation investigates the execution graphs generated by a simplified handwriting recognition neural network implemented in TensorFlow[1] and running on top of OpenCL via SYCL. Training a neural network and using a neural network for inference produce substantially different execution graphs. Both graphs are considered. The graphs show that data dependencies, opportunities for executing kernels in parallel, and opportunities for reordering kernels are all visible to the driver. It is therefore possible for an OpenCL device driver to schedule work to a hardware accelerator that has been designed for graph execution. It is important to note that OpenCL makes it possible to expose an execution graph to a device driver, but OpenCL cannot guarantee that OpenCL API calls will form a meaningful graph. For example, if a user places many independent data arrays into one memory buffer and enqueues kernels that all operate on the single memory buffer, then information about the execution graph is hidden from OpenCL. Opportunities for parallel execution and kernel reordering are lost. Often, application developers do not write OpenCL code directly, but use libraries that have OpenCL backends. Consequently, it is the responsibility of library developers to ensure that the graph that an application intends to execute is represented correctly on the OpenCL level.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Executing Graphs with OpenCL\",\"authors\":\"Erik Tomusk\",\"doi\":\"10.1145/3456669.3456681\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network—a type of graph—for handwriting recognition. There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing larger and increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these large, heterogeneous accelerators. Does OpenCL have the expressive power required to describe an execution graph to accelerator hardware, or does OpenCL serialize graphs and execute them sequentially? This technical presentation argues that it is the former: OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and OpenCL is sufficiently powerful to execute that graph on a graph accelerator. The OpenCL API is designed around the concept of the user enqueuing commands onto the front of a command-queue. Commands include executing kernels (i.e., functions), and reading, writing, and copying data buffers. The OpenCL device driver removes commands from the back of a command-queue, sets up data transfers to and from the accelerator device, and schedules kernels to execute on the device. The command-queue abstraction can encode execution graphs in one of two ways, depending on whether the command-queue is an in-order command-queue or an out-of-order command-queue. An in-order command-queue guarantees that the effects of the enqueued commands will be as if the commands were executed in the order in which they were enqueued. However, the OpenCL device driver is allowed to reorder commands, provided that reordering does not affect the output. For example, if two kernels do not have a data dependency between them, then they can be executed in reverse order or even in parallel, if the driver and hardware support it. An out-of-order command-queue does not guarantee that commands will appear to have been executed in the order in which they were enqueued. Instead, it is the OpenCL API user’s responsibility to attach events and event wait lists to commands. When a command finishes executing, it triggers its attached event, and when all the events in a command’s event wait list have triggered, then that command is allowed to execute. Both types of command-queues are capable of describing execution graphs. For in-order command-queues, the graph is implied by kernels’ data dependencies; for out-of-order command-queues, the graph is explicitly defined with events. By instrumenting Codeplay’s ComputeAorta[2] OpenCL implementation, it is possible to record OpenCL API calls and to reconstruct the execution graph as seen by the OpenCL device driver. This presentation investigates the execution graphs generated by a simplified handwriting recognition neural network implemented in TensorFlow[1] and running on top of OpenCL via SYCL. Training a neural network and using a neural network for inference produce substantially different execution graphs. Both graphs are considered. The graphs show that data dependencies, opportunities for executing kernels in parallel, and opportunities for reordering kernels are all visible to the driver. It is therefore possible for an OpenCL device driver to schedule work to a hardware accelerator that has been designed for graph execution. It is important to note that OpenCL makes it possible to expose an execution graph to a device driver, but OpenCL cannot guarantee that OpenCL API calls will form a meaningful graph. For example, if a user places many independent data arrays into one memory buffer and enqueues kernels that all operate on the single memory buffer, then information about the execution graph is hidden from OpenCL. Opportunities for parallel execution and kernel reordering are lost. Often, application developers do not write OpenCL code directly, but use libraries that have OpenCL backends. Consequently, it is the responsibility of library developers to ensure that the graph that an application intends to execute is represented correctly on the OpenCL level.\",\"PeriodicalId\":73497,\"journal\":{\"name\":\"International Workshop on OpenCL\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3456669.3456681\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3456669.3456681","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

几十年来,图和数据流编程模型一直是局限于少数高度专门化领域的小众主题。然而,近年来,机器学习(ML)革命和ML库的激增使得即使是新手程序员也可以使用图编程。在此之前,初级程序员可能会说要编写一个数字猜谜游戏;今天,程序员将描述训练一个现成的神经网络——一种图形——用于手写识别。行业和个人用户越来越需要运行基于ML图的程序。硬件供应商正在满足这种需求,他们正在设计能够有效执行图形的更大且日益异构的加速器设备。自创建以来,OpenCL一直是弥合用户应用程序和加速器硬件之间鸿沟的关键API。那么问题来了,OpenCL对于运行在这些大型异构加速器上的新型图形软件来说,是否是一个合适的API。OpenCL是否具有向加速器硬件描述执行图所需的表达能力,或者OpenCL是否将图序列化并顺序执行?这个技术演示认为是前者:OpenCL具有足够的表现力,允许ML库描述执行图,OpenCL也足够强大,可以在图加速器上执行该图。OpenCL API是围绕用户将命令排队到命令队列前面的概念设计的。命令包括执行内核(即函数),读取、写入和复制数据缓冲区。OpenCL设备驱动程序从命令队列的后面删除命令,设置加速器设备之间和加速器设备之间的数据传输,并安排内核在设备上执行。命令队列抽象可以用以下两种方式之一对执行图进行编码,具体取决于命令队列是有序命令队列还是无序命令队列。有序命令队列保证了排队命令的效果,就像命令按照排队的顺序执行一样。但是,OpenCL设备驱动程序允许重新排序命令,前提是重新排序不影响输出。例如,如果两个内核之间没有数据依赖关系,那么它们可以反向执行,甚至可以并行执行,如果驱动程序和硬件支持的话。乱序命令队列不能保证命令看起来是按照它们进入队列的顺序执行的。相反,将事件和事件等待列表附加到命令是OpenCL API用户的责任。当命令完成执行时,它触发其附带的事件,当命令的事件等待列表中的所有事件都被触发时,该命令才被允许执行。这两种类型的命令队列都能够描述执行图。对于有序命令队列,图是由内核的数据依赖关系隐含的;对于乱序命令队列,图是用事件显式定义的。通过检测Codeplay的ComputeAorta[2] OpenCL实现,可以记录OpenCL API调用并重构OpenCL设备驱动程序所看到的执行图。本报告研究了在TensorFlow[1]中实现的简化手写识别神经网络生成的执行图,并通过SYCL运行在OpenCL之上。训练神经网络和使用神经网络进行推理会产生本质上不同的执行图。这两个图都被考虑。图中显示了数据依赖性、并行执行内核的机会以及重新排序内核的机会,这些对驱动程序都是可见的。因此,OpenCL设备驱动程序可以将工作安排到为图形执行而设计的硬件加速器上。值得注意的是,OpenCL使向设备驱动程序公开执行图成为可能,但OpenCL不能保证OpenCL API调用将形成有意义的图。例如,如果用户将许多独立的数据数组放入一个内存缓冲区中,并对所有在单个内存缓冲区上操作的内核进行排队,那么关于执行图的信息对OpenCL是隐藏的。失去了并行执行和内核重排序的机会。通常,应用程序开发人员不直接编写OpenCL代码,而是使用具有OpenCL后端的库。因此,库开发人员有责任确保应用程序打算执行的图在OpenCL级别上得到正确的表示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Executing Graphs with OpenCL
For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network—a type of graph—for handwriting recognition. There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing larger and increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these large, heterogeneous accelerators. Does OpenCL have the expressive power required to describe an execution graph to accelerator hardware, or does OpenCL serialize graphs and execute them sequentially? This technical presentation argues that it is the former: OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and OpenCL is sufficiently powerful to execute that graph on a graph accelerator. The OpenCL API is designed around the concept of the user enqueuing commands onto the front of a command-queue. Commands include executing kernels (i.e., functions), and reading, writing, and copying data buffers. The OpenCL device driver removes commands from the back of a command-queue, sets up data transfers to and from the accelerator device, and schedules kernels to execute on the device. The command-queue abstraction can encode execution graphs in one of two ways, depending on whether the command-queue is an in-order command-queue or an out-of-order command-queue. An in-order command-queue guarantees that the effects of the enqueued commands will be as if the commands were executed in the order in which they were enqueued. However, the OpenCL device driver is allowed to reorder commands, provided that reordering does not affect the output. For example, if two kernels do not have a data dependency between them, then they can be executed in reverse order or even in parallel, if the driver and hardware support it. An out-of-order command-queue does not guarantee that commands will appear to have been executed in the order in which they were enqueued. Instead, it is the OpenCL API user’s responsibility to attach events and event wait lists to commands. When a command finishes executing, it triggers its attached event, and when all the events in a command’s event wait list have triggered, then that command is allowed to execute. Both types of command-queues are capable of describing execution graphs. For in-order command-queues, the graph is implied by kernels’ data dependencies; for out-of-order command-queues, the graph is explicitly defined with events. By instrumenting Codeplay’s ComputeAorta[2] OpenCL implementation, it is possible to record OpenCL API calls and to reconstruct the execution graph as seen by the OpenCL device driver. This presentation investigates the execution graphs generated by a simplified handwriting recognition neural network implemented in TensorFlow[1] and running on top of OpenCL via SYCL. Training a neural network and using a neural network for inference produce substantially different execution graphs. Both graphs are considered. The graphs show that data dependencies, opportunities for executing kernels in parallel, and opportunities for reordering kernels are all visible to the driver. It is therefore possible for an OpenCL device driver to schedule work to a hardware accelerator that has been designed for graph execution. It is important to note that OpenCL makes it possible to expose an execution graph to a device driver, but OpenCL cannot guarantee that OpenCL API calls will form a meaningful graph. For example, if a user places many independent data arrays into one memory buffer and enqueues kernels that all operate on the single memory buffer, then information about the execution graph is hidden from OpenCL. Opportunities for parallel execution and kernel reordering are lost. Often, application developers do not write OpenCL code directly, but use libraries that have OpenCL backends. Consequently, it is the responsibility of library developers to ensure that the graph that an application intends to execute is represented correctly on the OpenCL level.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1