使用SYCL-DNN实现AI模型的性能可移植性

International Workshop on OpenCL Pub Date : 2022-05-10 DOI:10.1145/3529538.3529999

Muhammad Tanvir, Kumudha Narasimhan, M. Goli, Ouadie El Farouki, S. Georgiev, Isaac Ault

{"title":"使用SYCL-DNN实现AI模型的性能可移植性","authors":"Muhammad Tanvir, Kumudha Narasimhan, M. Goli, Ouadie El Farouki, S. Georgiev, Isaac Ault","doi":"10.1145/3529538.3529999","DOIUrl":null,"url":null,"abstract":"The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers. This huge diversity soon becomes a burden due to the emerging dependencies between development stacks and deployment hardware. While the proposed ONNX as a de facto for AI model description, provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging. Several existing AI frameworks such as Tensorflow, Pytorch, ONNXRuntime provides performance portability via a dedicated backend implementations per hardware architecture. While such approach provides wider support of hardware devices, maintainability and readability remains challenging. There are many libraries and frameworks which were developed to support neural network models and we discuss some of the important ones in this section. Frameworks like Glow [18], nGraph [14] and Tensor Comprehensions [19] use a compiler-based approach to accept the neural network model and emit optimised code for a specific hardware. The neural network model is lowered into one or more intermediate representations before generating an optimised kernel. These frameworks, target a specific set of backends and targeting any new hardware requires implementing a considerable fraction of the operators. Other frameworks like Caffe [16], PyTorch [17] and TinyNN [10] provide runtime solution by integrating various vendor specific libraries or graph as a backend to support neural network models on different set of architectures. Framework like TensorFlow [11], rely on calling vendor-specific libraries or graph compilers. While embedding vendor-specific library can lead to achieving near metal performance, it can make adding and maintaining different backends quite tedious. Intel oneMKL [4] and oneDNN [7] are the optimized libraries for linear algebra subroutine and deep neural network routines for multi-core and manycore Intel systems. Recently, oneMKL and oneDNN have added support for running on Nvidia GPUs as well [15] via SYCL interoperability with third party libraries. This approach integrates the existing vendor optimised backend in SYCL to provide a unique SYCL-interface for memory management and runtime control from the user point of view while reusing the highly optimised vendor backend. ARM Compute Library [1], cuBLAS [6] and cuDNN [13], MIOpen [5] provides optimised routines for linear algebra and machine learning for ARM, Nvidia and AMD respectively. All these libraries are optimised for specific architectures, and very rarely provide portability. SYCL provides a C++-based portable parallel programming model to target various devices like CPUs, GPUs, DSPs, FPGAs, etc. SYCL programming model allows the developers to write highly parametrized kernels for a diverse hardware set in a unified setting. These kernels can then be tuned for the specified hardware. Hence, enabling SYCL backend for an AI framework (e.g. Tensorflow, Pytorch etc.) can lead to a hardware agnostic model for heterogeneous systems and also allow to reuse the existing optimized library implementations. Libraries like SYCL-BLAS [8] and SYCL-DNN [9] are open source and are a part of the SYCL eco-system. They can be compiled with any SYCL compiler such as ComputeCPP [2] or DPC++ [3] and run on any SYCL-enabled devices. ComputeCPP also supports SYCL RISC-V architecture [12]. Thus, making the applications using these libraries sufficiently portable. The SYCL kernels implemented in SYCL-DNN and SYCL-BLAS allow for tuning parameters such as cache size, work group size and local memory size based on the hardware we execute on. This helps reuse the existing kernels but still provide good performance on new hardware by tuning for these finer details. SYCL-DNN already supports OpenCL backend and in this paper we extend SYCL-DNN to support Nvidia and RISC-V architectures. Figure 1 shows the NN operation mapping. The results provide a detailed analysis of the performance portability of SYCL based AI frameworks on various architectures with respect to state-of-the-art optimized vendor specific libraries. The performance of SYCL-DNN is in-par on existing OpenCL backends as compared to device specific optimized libraries. We run the VGG model to understand and compare performance (Table 1). In case of Intel GPU - HD Graphics 530, SYCL-DNN provides 80% the performance of optimized oneDNN execution provider. The performance gap in this case is because of the extra graph optimizations that oneDNN performs. For Intel CPU, we used ComputeCPP 2.4 as the SYCL compiler and the latest oneDNN from github repository. We observe that SYCL-DNN performs 19% slower than oneDNN. However, tuning the matmul operation in SYCL-DNN provides a considerable speedup and SYCL-DNN performance 37% better than oneDNN. We extend SYCL-DNN to support DPC++ as one of the SYCL compilers. DPC++ provides support for CUDA backend, there by enabling running SYCL kernels on Nvidia devices. We compare the performance with optimized cuDNN libraries. We used the latest DPC++ as the SYCL compiler and cuDNN version 7.6.5. We see that untuned SYCL-DNN is almost 50% slower than cuDNN. This is because the matmul implementation in SYCL-DNN does not optimize for local memory. Further tuning and using the optimized SYCL-BLAS implementation of matmul improves the performance and we observe that SYCL-DNN comes within 90% of the performance of cuDNN. cuDNN has hand-written optimized implementation of some of the routines and hence achieves 10% more performance then SYC-DNN but, the code written in cuDNN cannot be reused on any other hardware. Further, there are no Execution providers / frameworks which provide full support for RISC-V architectures. By integrating SYCl-DNN with the Acoran compute stack, we are able to support generating RISC-V ISA. The Acoran compute stack uses ComputeCPP and ComputeAorta to enable running SYCL Kernels on RISC-V architectures. We run the VGG-16 model on the RISC-V Spike simulator. The current implementation of the simulator is single core and hence VGG-16 takes 312 seconds and ResNet-50 takes 198 seconds to complete execution. In case of VGG-16, the simulator requires 16532358513 cycles to finish execution. We are enabling SYCL backend for ONNXRuntime as a future work to exploit the benefit of ONXX model loader to load the ONXX model from different AI framework and also benefit from ONXX runtime graph optimisation.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Towards performance portability of AI models using SYCL-DNN\",\"authors\":\"Muhammad Tanvir, Kumudha Narasimhan, M. Goli, Ouadie El Farouki, S. Georgiev, Isaac Ault\",\"doi\":\"10.1145/3529538.3529999\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers. This huge diversity soon becomes a burden due to the emerging dependencies between development stacks and deployment hardware. While the proposed ONNX as a de facto for AI model description, provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging. Several existing AI frameworks such as Tensorflow, Pytorch, ONNXRuntime provides performance portability via a dedicated backend implementations per hardware architecture. While such approach provides wider support of hardware devices, maintainability and readability remains challenging. There are many libraries and frameworks which were developed to support neural network models and we discuss some of the important ones in this section. Frameworks like Glow [18], nGraph [14] and Tensor Comprehensions [19] use a compiler-based approach to accept the neural network model and emit optimised code for a specific hardware. The neural network model is lowered into one or more intermediate representations before generating an optimised kernel. These frameworks, target a specific set of backends and targeting any new hardware requires implementing a considerable fraction of the operators. Other frameworks like Caffe [16], PyTorch [17] and TinyNN [10] provide runtime solution by integrating various vendor specific libraries or graph as a backend to support neural network models on different set of architectures. Framework like TensorFlow [11], rely on calling vendor-specific libraries or graph compilers. While embedding vendor-specific library can lead to achieving near metal performance, it can make adding and maintaining different backends quite tedious. Intel oneMKL [4] and oneDNN [7] are the optimized libraries for linear algebra subroutine and deep neural network routines for multi-core and manycore Intel systems. Recently, oneMKL and oneDNN have added support for running on Nvidia GPUs as well [15] via SYCL interoperability with third party libraries. This approach integrates the existing vendor optimised backend in SYCL to provide a unique SYCL-interface for memory management and runtime control from the user point of view while reusing the highly optimised vendor backend. ARM Compute Library [1], cuBLAS [6] and cuDNN [13], MIOpen [5] provides optimised routines for linear algebra and machine learning for ARM, Nvidia and AMD respectively. All these libraries are optimised for specific architectures, and very rarely provide portability. SYCL provides a C++-based portable parallel programming model to target various devices like CPUs, GPUs, DSPs, FPGAs, etc. SYCL programming model allows the developers to write highly parametrized kernels for a diverse hardware set in a unified setting. These kernels can then be tuned for the specified hardware. Hence, enabling SYCL backend for an AI framework (e.g. Tensorflow, Pytorch etc.) can lead to a hardware agnostic model for heterogeneous systems and also allow to reuse the existing optimized library implementations. Libraries like SYCL-BLAS [8] and SYCL-DNN [9] are open source and are a part of the SYCL eco-system. They can be compiled with any SYCL compiler such as ComputeCPP [2] or DPC++ [3] and run on any SYCL-enabled devices. ComputeCPP also supports SYCL RISC-V architecture [12]. Thus, making the applications using these libraries sufficiently portable. The SYCL kernels implemented in SYCL-DNN and SYCL-BLAS allow for tuning parameters such as cache size, work group size and local memory size based on the hardware we execute on. This helps reuse the existing kernels but still provide good performance on new hardware by tuning for these finer details. SYCL-DNN already supports OpenCL backend and in this paper we extend SYCL-DNN to support Nvidia and RISC-V architectures. Figure 1 shows the NN operation mapping. The results provide a detailed analysis of the performance portability of SYCL based AI frameworks on various architectures with respect to state-of-the-art optimized vendor specific libraries. The performance of SYCL-DNN is in-par on existing OpenCL backends as compared to device specific optimized libraries. We run the VGG model to understand and compare performance (Table 1). In case of Intel GPU - HD Graphics 530, SYCL-DNN provides 80% the performance of optimized oneDNN execution provider. The performance gap in this case is because of the extra graph optimizations that oneDNN performs. For Intel CPU, we used ComputeCPP 2.4 as the SYCL compiler and the latest oneDNN from github repository. We observe that SYCL-DNN performs 19% slower than oneDNN. However, tuning the matmul operation in SYCL-DNN provides a considerable speedup and SYCL-DNN performance 37% better than oneDNN. We extend SYCL-DNN to support DPC++ as one of the SYCL compilers. DPC++ provides support for CUDA backend, there by enabling running SYCL kernels on Nvidia devices. We compare the performance with optimized cuDNN libraries. We used the latest DPC++ as the SYCL compiler and cuDNN version 7.6.5. We see that untuned SYCL-DNN is almost 50% slower than cuDNN. This is because the matmul implementation in SYCL-DNN does not optimize for local memory. Further tuning and using the optimized SYCL-BLAS implementation of matmul improves the performance and we observe that SYCL-DNN comes within 90% of the performance of cuDNN. cuDNN has hand-written optimized implementation of some of the routines and hence achieves 10% more performance then SYC-DNN but, the code written in cuDNN cannot be reused on any other hardware. Further, there are no Execution providers / frameworks which provide full support for RISC-V architectures. By integrating SYCl-DNN with the Acoran compute stack, we are able to support generating RISC-V ISA. The Acoran compute stack uses ComputeCPP and ComputeAorta to enable running SYCL Kernels on RISC-V architectures. We run the VGG-16 model on the RISC-V Spike simulator. The current implementation of the simulator is single core and hence VGG-16 takes 312 seconds and ResNet-50 takes 198 seconds to complete execution. In case of VGG-16, the simulator requires 16532358513 cycles to finish execution. We are enabling SYCL backend for ONNXRuntime as a future work to exploit the benefit of ONXX model loader to load the ONXX model from different AI framework and also benefit from ONXX runtime graph optimisation.\",\"PeriodicalId\":73497,\"journal\":{\"name\":\"International Workshop on OpenCL\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3529538.3529999\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529538.3529999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

深度神经网络(DNN)的广泛采用激励了设计和制造强大而专业的硬件技术，目标是从边缘设备到云和超级计算机的系统。由于开发栈和部署硬件之间出现了依赖性，这种巨大的多样性很快成为一种负担。虽然提议的ONNX作为人工智能模型描述的事实，提供了人工智能模型跨各种人工智能框架的可移植性，但在各种硬件架构上支持DNN模型仍然具有挑战性。一些现有的AI框架，如Tensorflow、Pytorch、ONNXRuntime，通过每个硬件架构的专用后端实现提供了性能可移植性。虽然这种方法为硬件设备提供了更广泛的支持，但可维护性和可读性仍然具有挑战性。有许多支持神经网络模型的库和框架被开发出来，我们将在本节讨论其中一些重要的。Glow[18]、nGraph[14]和Tensor comprehension[19]等框架使用基于编译器的方法来接受神经网络模型，并针对特定硬件发出优化的代码。在生成优化核之前，神经网络模型被降低到一个或多个中间表示。这些框架针对一组特定的后端，针对任何新硬件都需要实现相当一部分操作符。Caffe[16]、PyTorch[17]和TinyNN[10]等其他框架通过集成各种供应商特定的库或图作为后端来提供运行时解决方案，以支持不同体系结构上的神经网络模型。像TensorFlow[11]这样的框架依赖于调用供应商特定的库或图编译器。虽然嵌入特定于供应商的库可以实现接近金属的性能，但它可能会使添加和维护不同的后端变得非常繁琐。英特尔oneMKL[4]和oneDNN[7]是针对多核和多核英特尔系统的线性代数子程序和深度神经网络例程进行优化的库。最近，oneMKL和oneDNN也通过与第三方库的SYCL互操作性增加了对在Nvidia gpu上运行的支持[15]。这种方法在SYCL中集成了现有的供应商优化后端，从用户的角度为内存管理和运行时控制提供了一个独特的SYCL接口，同时重用了高度优化的供应商后端。ARM Compute Library[1]、cuBLAS[6]和cuDNN[13]、MIOpen[5]分别为ARM、Nvidia和AMD提供了线性代数和机器学习的优化例程。所有这些库都针对特定的体系结构进行了优化，很少提供可移植性。SYCL提供了一个基于c++的可移植并行编程模型，可以针对各种设备，如cpu、gpu、dsp、fpga等。SYCL编程模型允许开发人员在统一的设置中为不同的硬件集编写高度参数化的内核。然后可以针对指定的硬件对这些内核进行调优。因此，为AI框架(例如Tensorflow, Pytorch等)启用SYCL后端可以为异构系统提供硬件无关模型，并且还允许重用现有的优化库实现。像SYCL- blas[8]和SYCL- dnn[9]这样的库是开源的，是SYCL生态系统的一部分。它们可以用任何SYCL编译器(如ComputeCPP[2]或dpc++[3])编译，并在任何启用SYCL的设备上运行。ComputeCPP还支持SYCL RISC-V架构[12]。因此，使使用这些库的应用程序具有足够的可移植性。在SYCL- dnn和SYCL- blas中实现的SYCL内核允许根据我们执行的硬件调整参数，如缓存大小、工作组大小和本地内存大小。这有助于重用现有的内核，但通过调优这些更精细的细节，仍然可以在新硬件上提供良好的性能。SYCL-DNN已经支持OpenCL后端，在本文中我们扩展SYCL-DNN以支持Nvidia和RISC-V架构。图1显示了NN操作映射。结果提供了基于SYCL的AI框架在各种架构上的性能可移植性的详细分析，这些架构与最先进的优化供应商特定库有关。与特定于设备的优化库相比，SYCL-DNN的性能在现有的OpenCL后端上处于中等水平。我们运行VGG模型来理解和比较性能(表1)。在Intel GPU - HD Graphics 530的情况下，SYCL-DNN提供了优化后的oneDNN执行提供商的80%的性能。这种情况下的性能差距是因为oneDNN执行了额外的图优化。对于Intel CPU，我们使用ComputeCPP 2.4作为SYCL编译器和github存储库中最新的oneDNN。我们观察到SYCL-DNN的执行速度比oneDNN慢19%。然而，在SYCL-DNN中调整多线程操作提供了相当大的加速，SYCL-DNN的性能比oneDNN提高了37%。我们扩展了SYCL- dnn以支持dpc++作为SYCL编译器之一。dpc++通过在Nvidia设备上运行SYCL内核提供了对CUDA后端的支持。我们比较了优化后的cuDNN库的性能。我们使用最新的dpc++作为SYCL编译器和cuDNN版本7.6.5。我们看到未调优的SYCL-DNN比cuDNN慢了近50%。这是因为SYCL-DNN中的多线程实现没有针对本地内存进行优化。进一步调优和使用优化的SYCL-BLAS实现的matmul提高了性能，我们观察到SYCL-DNN的性能在cuDNN的90%以内。cuDNN对一些例程进行了手写优化实现，因此比SYC-DNN的性能提高了10%，但是，用cuDNN编写的代码不能在任何其他硬件上重用。此外，没有执行提供者/框架为RISC-V架构提供全面支持。通过将SYCl-DNN与Acoran计算堆栈集成，我们能够支持生成RISC-V ISA。Acoran计算栈使用ComputeCPP和ComputeAorta在RISC-V架构上运行SYCL内核。我们在RISC-V Spike模拟器上运行VGG-16模型。目前模拟器的实现是单核的，因此VGG-16需要312秒，ResNet-50需要198秒才能完成执行。在VGG-16的情况下，模拟器需要16532358513个周期来完成执行。我们正在为ONNXRuntime启用SYCL后端，作为未来的工作，利用ONXX模型加载器的好处，从不同的人工智能框架加载ONXX模型，并从ONXX运行时图形优化中受益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Towards performance portability of AI models using SYCL-DNN

The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers. This huge diversity soon becomes a burden due to the emerging dependencies between development stacks and deployment hardware. While the proposed ONNX as a de facto for AI model description, provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging. Several existing AI frameworks such as Tensorflow, Pytorch, ONNXRuntime provides performance portability via a dedicated backend implementations per hardware architecture. While such approach provides wider support of hardware devices, maintainability and readability remains challenging. There are many libraries and frameworks which were developed to support neural network models and we discuss some of the important ones in this section. Frameworks like Glow [18], nGraph [14] and Tensor Comprehensions [19] use a compiler-based approach to accept the neural network model and emit optimised code for a specific hardware. The neural network model is lowered into one or more intermediate representations before generating an optimised kernel. These frameworks, target a specific set of backends and targeting any new hardware requires implementing a considerable fraction of the operators. Other frameworks like Caffe [16], PyTorch [17] and TinyNN [10] provide runtime solution by integrating various vendor specific libraries or graph as a backend to support neural network models on different set of architectures. Framework like TensorFlow [11], rely on calling vendor-specific libraries or graph compilers. While embedding vendor-specific library can lead to achieving near metal performance, it can make adding and maintaining different backends quite tedious. Intel oneMKL [4] and oneDNN [7] are the optimized libraries for linear algebra subroutine and deep neural network routines for multi-core and manycore Intel systems. Recently, oneMKL and oneDNN have added support for running on Nvidia GPUs as well [15] via SYCL interoperability with third party libraries. This approach integrates the existing vendor optimised backend in SYCL to provide a unique SYCL-interface for memory management and runtime control from the user point of view while reusing the highly optimised vendor backend. ARM Compute Library [1], cuBLAS [6] and cuDNN [13], MIOpen [5] provides optimised routines for linear algebra and machine learning for ARM, Nvidia and AMD respectively. All these libraries are optimised for specific architectures, and very rarely provide portability. SYCL provides a C++-based portable parallel programming model to target various devices like CPUs, GPUs, DSPs, FPGAs, etc. SYCL programming model allows the developers to write highly parametrized kernels for a diverse hardware set in a unified setting. These kernels can then be tuned for the specified hardware. Hence, enabling SYCL backend for an AI framework (e.g. Tensorflow, Pytorch etc.) can lead to a hardware agnostic model for heterogeneous systems and also allow to reuse the existing optimized library implementations. Libraries like SYCL-BLAS [8] and SYCL-DNN [9] are open source and are a part of the SYCL eco-system. They can be compiled with any SYCL compiler such as ComputeCPP [2] or DPC++ [3] and run on any SYCL-enabled devices. ComputeCPP also supports SYCL RISC-V architecture [12]. Thus, making the applications using these libraries sufficiently portable. The SYCL kernels implemented in SYCL-DNN and SYCL-BLAS allow for tuning parameters such as cache size, work group size and local memory size based on the hardware we execute on. This helps reuse the existing kernels but still provide good performance on new hardware by tuning for these finer details. SYCL-DNN already supports OpenCL backend and in this paper we extend SYCL-DNN to support Nvidia and RISC-V architectures. Figure 1 shows the NN operation mapping. The results provide a detailed analysis of the performance portability of SYCL based AI frameworks on various architectures with respect to state-of-the-art optimized vendor specific libraries. The performance of SYCL-DNN is in-par on existing OpenCL backends as compared to device specific optimized libraries. We run the VGG model to understand and compare performance (Table 1). In case of Intel GPU - HD Graphics 530, SYCL-DNN provides 80% the performance of optimized oneDNN execution provider. The performance gap in this case is because of the extra graph optimizations that oneDNN performs. For Intel CPU, we used ComputeCPP 2.4 as the SYCL compiler and the latest oneDNN from github repository. We observe that SYCL-DNN performs 19% slower than oneDNN. However, tuning the matmul operation in SYCL-DNN provides a considerable speedup and SYCL-DNN performance 37% better than oneDNN. We extend SYCL-DNN to support DPC++ as one of the SYCL compilers. DPC++ provides support for CUDA backend, there by enabling running SYCL kernels on Nvidia devices. We compare the performance with optimized cuDNN libraries. We used the latest DPC++ as the SYCL compiler and cuDNN version 7.6.5. We see that untuned SYCL-DNN is almost 50% slower than cuDNN. This is because the matmul implementation in SYCL-DNN does not optimize for local memory. Further tuning and using the optimized SYCL-BLAS implementation of matmul improves the performance and we observe that SYCL-DNN comes within 90% of the performance of cuDNN. cuDNN has hand-written optimized implementation of some of the routines and hence achieves 10% more performance then SYC-DNN but, the code written in cuDNN cannot be reused on any other hardware. Further, there are no Execution providers / frameworks which provide full support for RISC-V architectures. By integrating SYCl-DNN with the Acoran compute stack, we are able to support generating RISC-V ISA. The Acoran compute stack uses ComputeCPP and ComputeAorta to enable running SYCL Kernels on RISC-V architectures. We run the VGG-16 model on the RISC-V Spike simulator. The current implementation of the simulator is single core and hence VGG-16 takes 312 seconds and ResNet-50 takes 198 seconds to complete execution. In case of VGG-16, the simulator requires 16532358513 cycles to finish execution. We are enabling SYCL backend for ONNXRuntime as a future work to exploit the benefit of ONXX model loader to load the ONXX model from different AI framework and also benefit from ONXX runtime graph optimisation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量

期刊最新文献

Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL