首页 > 最新文献

International Workshop on OpenCL最新文献

英文 中文
Optimization of Fast Fourier Transform for Qualcomm Adreno Graphics Processing Unit 高通 Adreno 图形处理器的快速傅立叶变换优化
Pub Date : 2024-04-08 DOI: 10.1145/3648115.3648119
Skyler Szot, Hongqiang Wang, Alexander Angus
{"title":"Optimization of Fast Fourier Transform for Qualcomm Adreno Graphics Processing Unit","authors":"Skyler Szot, Hongqiang Wang, Alexander Angus","doi":"10.1145/3648115.3648119","DOIUrl":"https://doi.org/10.1145/3648115.3648119","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140730357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Experiences with implementing Kokkos' SYCL backend 实施 Kokkos SYCL 后端的经验
Pub Date : 2024-04-08 DOI: 10.1145/3648115.3648118
Daniel Arndt, Damien Lebrun-Grandié, Christian Trott
{"title":"Experiences with implementing Kokkos' SYCL backend","authors":"Daniel Arndt, Damien Lebrun-Grandié, Christian Trott","doi":"10.1145/3648115.3648118","DOIUrl":"https://doi.org/10.1145/3648115.3648118","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140729774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A source-to-source CUDA to SYCL code migration tool: Intel® DPC++ Compatibility Tool 一个源到源CUDA到SYCL代码迁移工具:英特尔®dpc++兼容性工具
Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529562
Zhiming Wang, Yury Plyakhin, Chenwei Sun, Ziran Zhang, Z. Jiang, Andy Huang, Hao Wang
oneAPI [1] is an industry initiative creating an open, standards-based, cross-architecture programming model to simplify development for a wide range of data-centric workloads across a variety of architectures including CPU, GPU, FPGA, and other accelerators. It includes a cross-architecture compiler, Data Parallel C++ (DPC++), [2] to support ISO C++ and Khronos Group's SYCL [3] and advanced libraries. Intel has created a product implementation of oneAPI with the Intel oneAPI Toolkits. These help developers efficiently build, analyze, and optimize high-performance, cross-architecture applications for CPUs, GPUs, and FPGAs. SYCL [3] is an open standard from Khronos for a portable, architecture-neutral language for expressing parallelism. The SYCL specification can be implemented by anybody for any platform. To take advantage of oneAPI and SYCL, for applications written in another language e.g., CUDA, developers seek to migrate existing code to SYCL. Once a customer migrates their code to SYCL, they are no longer tied to a single platform and can run the code on any platform that has SYCL compiler support. Intel® DPC++ Compatibility Tool is included in the Intel® oneAPI Base Toolkit, it is a tool that assists developers to do source-to-source migration, e.g., migrate code written in CUDA to SYCL code [3] to enable their code to run on multiple platforms. The tool generates human readable and maintainable code whenever possible and provides inline comments to help developers complete their code. Typically, about 90-95% of CUDA code in applications can be migrated by this tool.1 Completion of the code and verification of the final code is expected to be a manual process done by the developers. The goal of the compatibility tool is to make it as easy as possible for developers to migrate their existing CUDA codebase to SYCL to facilitate more hardware choices and access to the advantages of oneAPI and SYCL. The compatibility tool is based on LLVM/Clang [4]. It mainly contains 3 functional components: • The intercept-build tool: It is used to collect compilation options of the user input project by intercepting build process of user input project, like build option, macro definitions, include folders and so on information. During source-to-source migration, those compilation options are used to identify the active code path, header files dependencies to build a right abstract syntax tree for the user input project. • The ‘dpct’ binary tool: The tool is the main migration tool, which does source-to-source migration based on compiler front end technology. It implements a set of migration rules to migrate source language elements like types, APIs, macros to functionally compatible elements in target language. If some C/C++ code is the same between source and target language, then the tool keeps this C/C++ code unchanged. Also, the tool provides a way to let users define migration rules by themselves in migration rule description file to guide a customized
oneAPI[1]是一项行业倡议,它创建了一个开放的、基于标准的、跨架构的编程模型,以简化跨各种架构(包括CPU、GPU、FPGA和其他加速器)的广泛数据中心工作负载的开发。它包括一个跨架构编译器,Data Parallel c++ (dpc++)[2],以支持ISO c++和Khronos Group的SYCL[3]和高级库。英特尔已经通过英特尔oneAPI工具包创建了一个oneAPI的产品实现。这些帮助开发人员高效地构建、分析和优化针对cpu、gpu和fpga的高性能、跨架构应用程序。SYCL[3]是来自Khronos的一个开放标准,用于表达并行性的可移植的、与体系结构无关的语言。任何人都可以在任何平台上实现SYCL规范。为了利用一个api和SYCL,对于用另一种语言(例如CUDA)编写的应用程序,开发人员寻求将现有代码迁移到SYCL。一旦客户将其代码迁移到SYCL,他们就不再受限于单一平台,可以在任何具有SYCL编译器支持的平台上运行代码。英特尔®dpc++兼容性工具包含在英特尔®oneAPI基础工具包中,它是一个帮助开发人员进行源到源迁移的工具,例如,将用CUDA编写的代码迁移到SYCL代码[3],以使他们的代码能够在多个平台上运行。该工具在可能的情况下生成人类可读和可维护的代码,并提供内联注释来帮助开发人员完成他们的代码。通常,应用程序中大约90-95%的CUDA代码可以通过该工具迁移1代码的完成和最终代码的验证预计将由开发人员手动完成。兼容性工具的目标是使开发人员尽可能容易地将他们现有的CUDA代码库迁移到SYCL,以方便更多的硬件选择,并访问oneAPI和SYCL的优势。兼容性工具基于LLVM/Clang[4]。它主要包含3个功能组件:•拦截构建工具:通过拦截用户输入项目的构建过程,收集用户输入项目的编译选项,如构建选项、宏定义、包含文件夹等信息。在源到源迁移期间,这些编译选项用于标识活动代码路径、头文件依赖项,以便为用户输入项目构建正确的抽象语法树。•“dpct”二进制工具:该工具是主要的迁移工具,它基于编译器前端技术进行源代码到源代码的迁移。它实现了一组迁移规则,将源语言元素(如类型、api、宏)迁移到目标语言中功能兼容的元素。如果某些C/ c++代码在源语言和目标语言之间是相同的,那么该工具将保持该C/ c++代码不变。此外,该工具还提供了一种方法,允许用户在迁移规则描述文件中自行定义迁移规则,以指导自定义的迁移。•Helper头库:它提供了帮助函数和宏来帮助输入源代码的迁移。这些用C/ c++ /SYCL编写的头文件旨在成为由兼容性工具生成的迁移代码的一部分。如果需要,用户可以复制这些头文件,并将它们包含在生成的代码中。兼容性工具可以帮助开发人员以适当的性能将用CUDA编写的代码迁移到用SYCL编写的代码,从而最大限度地减少开发人员的工作量。该工具可以通过帮助开发人员将更多应用程序迁移到运行在oneAPI上的SYCL来丰富oneAPI生态系统。英特尔技术可能需要启用硬件、软件或服务激活。没有任何产品或组件是绝对安全的。您的成本和结果可能会有所不同。©英特尔公司。“英特尔”、“英特尔”标识和其他“英特尔”标志均为“英特尔公司”或其子公司的商标。其他名称和品牌可以主张为他人的财产。
{"title":"A source-to-source CUDA to SYCL code migration tool: Intel® DPC++ Compatibility Tool","authors":"Zhiming Wang, Yury Plyakhin, Chenwei Sun, Ziran Zhang, Z. Jiang, Andy Huang, Hao Wang","doi":"10.1145/3529538.3529562","DOIUrl":"https://doi.org/10.1145/3529538.3529562","url":null,"abstract":"oneAPI [1] is an industry initiative creating an open, standards-based, cross-architecture programming model to simplify development for a wide range of data-centric workloads across a variety of architectures including CPU, GPU, FPGA, and other accelerators. It includes a cross-architecture compiler, Data Parallel C++ (DPC++), [2] to support ISO C++ and Khronos Group's SYCL [3] and advanced libraries. Intel has created a product implementation of oneAPI with the Intel oneAPI Toolkits. These help developers efficiently build, analyze, and optimize high-performance, cross-architecture applications for CPUs, GPUs, and FPGAs. SYCL [3] is an open standard from Khronos for a portable, architecture-neutral language for expressing parallelism. The SYCL specification can be implemented by anybody for any platform. To take advantage of oneAPI and SYCL, for applications written in another language e.g., CUDA, developers seek to migrate existing code to SYCL. Once a customer migrates their code to SYCL, they are no longer tied to a single platform and can run the code on any platform that has SYCL compiler support. Intel® DPC++ Compatibility Tool is included in the Intel® oneAPI Base Toolkit, it is a tool that assists developers to do source-to-source migration, e.g., migrate code written in CUDA to SYCL code [3] to enable their code to run on multiple platforms. The tool generates human readable and maintainable code whenever possible and provides inline comments to help developers complete their code. Typically, about 90-95% of CUDA code in applications can be migrated by this tool.1 Completion of the code and verification of the final code is expected to be a manual process done by the developers. The goal of the compatibility tool is to make it as easy as possible for developers to migrate their existing CUDA codebase to SYCL to facilitate more hardware choices and access to the advantages of oneAPI and SYCL. The compatibility tool is based on LLVM/Clang [4]. It mainly contains 3 functional components: • The intercept-build tool: It is used to collect compilation options of the user input project by intercepting build process of user input project, like build option, macro definitions, include folders and so on information. During source-to-source migration, those compilation options are used to identify the active code path, header files dependencies to build a right abstract syntax tree for the user input project. • The ‘dpct’ binary tool: The tool is the main migration tool, which does source-to-source migration based on compiler front end technology. It implements a set of migration rules to migrate source language elements like types, APIs, macros to functionally compatible elements in target language. If some C/C++ code is the same between source and target language, then the tool keeps this C/C++ code unchanged. Also, the tool provides a way to let users define migration rules by themselves in migration rule description file to guide a customized","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88953169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Towards performance portability of AI models using SYCL-DNN 使用SYCL-DNN实现AI模型的性能可移植性
Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529999
Muhammad Tanvir, Kumudha Narasimhan, M. Goli, Ouadie El Farouki, S. Georgiev, Isaac Ault
The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers. This huge diversity soon becomes a burden due to the emerging dependencies between development stacks and deployment hardware. While the proposed ONNX as a de facto for AI model description, provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging. Several existing AI frameworks such as Tensorflow, Pytorch, ONNXRuntime provides performance portability via a dedicated backend implementations per hardware architecture. While such approach provides wider support of hardware devices, maintainability and readability remains challenging. There are many libraries and frameworks which were developed to support neural network models and we discuss some of the important ones in this section. Frameworks like Glow [18], nGraph [14] and Tensor Comprehensions [19] use a compiler-based approach to accept the neural network model and emit optimised code for a specific hardware. The neural network model is lowered into one or more intermediate representations before generating an optimised kernel. These frameworks, target a specific set of backends and targeting any new hardware requires implementing a considerable fraction of the operators. Other frameworks like Caffe [16], PyTorch [17] and TinyNN [10] provide runtime solution by integrating various vendor specific libraries or graph as a backend to support neural network models on different set of architectures. Framework like TensorFlow [11], rely on calling vendor-specific libraries or graph compilers. While embedding vendor-specific library can lead to achieving near metal performance, it can make adding and maintaining different backends quite tedious. Intel oneMKL [4] and oneDNN [7] are the optimized libraries for linear algebra subroutine and deep neural network routines for multi-core and manycore Intel systems. Recently, oneMKL and oneDNN have added support for running on Nvidia GPUs as well [15] via SYCL interoperability with third party libraries. This approach integrates the existing vendor optimised backend in SYCL to provide a unique SYCL-interface for memory management and runtime control from the user point of view while reusing the highly optimised vendor backend. ARM Compute Library [1], cuBLAS [6] and cuDNN [13], MIOpen [5] provides optimised routines for linear algebra and machine learning for ARM, Nvidia and AMD respectively. All these libraries are optimised for specific architectures, and very rarely provide portability. SYCL provides a C++-based portable parallel programming model to target various devices like CPUs, GPUs, DSPs, FPGAs, etc. SYCL programming model allows the developers to write highly parametrized kernels for a diverse hardware set in a unified setting. These kernels can then b
深度神经网络(DNN)的广泛采用激励了设计和制造强大而专业的硬件技术,目标是从边缘设备到云和超级计算机的系统。由于开发栈和部署硬件之间出现了依赖性,这种巨大的多样性很快成为一种负担。虽然提议的ONNX作为人工智能模型描述的事实,提供了人工智能模型跨各种人工智能框架的可移植性,但在各种硬件架构上支持DNN模型仍然具有挑战性。一些现有的AI框架,如Tensorflow、Pytorch、ONNXRuntime,通过每个硬件架构的专用后端实现提供了性能可移植性。虽然这种方法为硬件设备提供了更广泛的支持,但可维护性和可读性仍然具有挑战性。有许多支持神经网络模型的库和框架被开发出来,我们将在本节讨论其中一些重要的。Glow[18]、nGraph[14]和Tensor comprehension[19]等框架使用基于编译器的方法来接受神经网络模型,并针对特定硬件发出优化的代码。在生成优化核之前,神经网络模型被降低到一个或多个中间表示。这些框架针对一组特定的后端,针对任何新硬件都需要实现相当一部分操作符。Caffe[16]、PyTorch[17]和TinyNN[10]等其他框架通过集成各种供应商特定的库或图作为后端来提供运行时解决方案,以支持不同体系结构上的神经网络模型。像TensorFlow[11]这样的框架依赖于调用供应商特定的库或图编译器。虽然嵌入特定于供应商的库可以实现接近金属的性能,但它可能会使添加和维护不同的后端变得非常繁琐。英特尔oneMKL[4]和oneDNN[7]是针对多核和多核英特尔系统的线性代数子程序和深度神经网络例程进行优化的库。最近,oneMKL和oneDNN也通过与第三方库的SYCL互操作性增加了对在Nvidia gpu上运行的支持[15]。这种方法在SYCL中集成了现有的供应商优化后端,从用户的角度为内存管理和运行时控制提供了一个独特的SYCL接口,同时重用了高度优化的供应商后端。ARM Compute Library[1]、cuBLAS[6]和cuDNN[13]、MIOpen[5]分别为ARM、Nvidia和AMD提供了线性代数和机器学习的优化例程。所有这些库都针对特定的体系结构进行了优化,很少提供可移植性。SYCL提供了一个基于c++的可移植并行编程模型,可以针对各种设备,如cpu、gpu、dsp、fpga等。SYCL编程模型允许开发人员在统一的设置中为不同的硬件集编写高度参数化的内核。然后可以针对指定的硬件对这些内核进行调优。因此,为AI框架(例如Tensorflow, Pytorch等)启用SYCL后端可以为异构系统提供硬件无关模型,并且还允许重用现有的优化库实现。像SYCL- blas[8]和SYCL- dnn[9]这样的库是开源的,是SYCL生态系统的一部分。它们可以用任何SYCL编译器(如ComputeCPP[2]或dpc++[3])编译,并在任何启用SYCL的设备上运行。ComputeCPP还支持SYCL RISC-V架构[12]。因此,使使用这些库的应用程序具有足够的可移植性。在SYCL- dnn和SYCL- blas中实现的SYCL内核允许根据我们执行的硬件调整参数,如缓存大小、工作组大小和本地内存大小。这有助于重用现有的内核,但通过调优这些更精细的细节,仍然可以在新硬件上提供良好的性能。SYCL-DNN已经支持OpenCL后端,在本文中我们扩展SYCL-DNN以支持Nvidia和RISC-V架构。图1显示了NN操作映射。结果提供了基于SYCL的AI框架在各种架构上的性能可移植性的详细分析,这些架构与最先进的优化供应商特定库有关。与特定于设备的优化库相比,SYCL-DNN的性能在现有的OpenCL后端上处于中等水平。我们运行VGG模型来理解和比较性能(表1)。在Intel GPU - HD Graphics 530的情况下,SYCL-DNN提供了优化后的oneDNN执行提供商的80%的性能。这种情况下的性能差距是因为oneDNN执行了额外的图优化。对于Intel CPU,我们使用ComputeCPP 2.4作为SYCL编译器和github存储库中最新的oneDNN。我们观察到SYCL-DNN的执行速度比oneDNN慢19%。然而,在SYCL-DNN中调整多线程操作提供了相当大的加速,SYCL-DNN的性能比oneDNN提高了37%。 我们扩展了SYCL- dnn以支持dpc++作为SYCL编译器之一。dpc++通过在Nvidia设备上运行SYCL内核提供了对CUDA后端的支持。我们比较了优化后的cuDNN库的性能。我们使用最新的dpc++作为SYCL编译器和cuDNN版本7.6.5。我们看到未调优的SYCL-DNN比cuDNN慢了近50%。这是因为SYCL-DNN中的多线程实现没有针对本地内存进行优化。进一步调优和使用优化的SYCL-BLAS实现的matmul提高了性能,我们观察到SYCL-DNN的性能在cuDNN的90%以内。cuDNN对一些例程进行了手写优化实现,因此比SYC-DNN的性能提高了10%,但是,用cuDNN编写的代码不能在任何其他硬件上重用。此外,没有执行提供者/框架为RISC-V架构提供全面支持。通过将SYCl-DNN与Acoran计算堆栈集成,我们能够支持生成RISC-V ISA。Acoran计算栈使用ComputeCPP和ComputeAorta在RISC-V架构上运行SYCL内核。我们在RISC-V Spike模拟器上运行VGG-16模型。目前模拟器的实现是单核的,因此VGG-16需要312秒,ResNet-50需要198秒才能完成执行。在VGG-16的情况下,模拟器需要16532358513个周期来完成执行。我们正在为ONNXRuntime启用SYCL后端,作为未来的工作,利用ONXX模型加载器的好处,从不同的人工智能框架加载ONXX模型,并从ONXX运行时图形优化中受益。
{"title":"Towards performance portability of AI models using SYCL-DNN","authors":"Muhammad Tanvir, Kumudha Narasimhan, M. Goli, Ouadie El Farouki, S. Georgiev, Isaac Ault","doi":"10.1145/3529538.3529999","DOIUrl":"https://doi.org/10.1145/3529538.3529999","url":null,"abstract":"The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers. This huge diversity soon becomes a burden due to the emerging dependencies between development stacks and deployment hardware. While the proposed ONNX as a de facto for AI model description, provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging. Several existing AI frameworks such as Tensorflow, Pytorch, ONNXRuntime provides performance portability via a dedicated backend implementations per hardware architecture. While such approach provides wider support of hardware devices, maintainability and readability remains challenging. There are many libraries and frameworks which were developed to support neural network models and we discuss some of the important ones in this section. Frameworks like Glow [18], nGraph [14] and Tensor Comprehensions [19] use a compiler-based approach to accept the neural network model and emit optimised code for a specific hardware. The neural network model is lowered into one or more intermediate representations before generating an optimised kernel. These frameworks, target a specific set of backends and targeting any new hardware requires implementing a considerable fraction of the operators. Other frameworks like Caffe [16], PyTorch [17] and TinyNN [10] provide runtime solution by integrating various vendor specific libraries or graph as a backend to support neural network models on different set of architectures. Framework like TensorFlow [11], rely on calling vendor-specific libraries or graph compilers. While embedding vendor-specific library can lead to achieving near metal performance, it can make adding and maintaining different backends quite tedious. Intel oneMKL [4] and oneDNN [7] are the optimized libraries for linear algebra subroutine and deep neural network routines for multi-core and manycore Intel systems. Recently, oneMKL and oneDNN have added support for running on Nvidia GPUs as well [15] via SYCL interoperability with third party libraries. This approach integrates the existing vendor optimised backend in SYCL to provide a unique SYCL-interface for memory management and runtime control from the user point of view while reusing the highly optimised vendor backend. ARM Compute Library [1], cuBLAS [6] and cuDNN [13], MIOpen [5] provides optimised routines for linear algebra and machine learning for ARM, Nvidia and AMD respectively. All these libraries are optimised for specific architectures, and very rarely provide portability. SYCL provides a C++-based portable parallel programming model to target various devices like CPUs, GPUs, DSPs, FPGAs, etc. SYCL programming model allows the developers to write highly parametrized kernels for a diverse hardware set in a unified setting. These kernels can then b","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73500410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware SYCL、OpenCL、CUDA和OpenMP在多厂商硬件上大规模并行支持向量机分类的比较
Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529980
Marcel Breyer, Alexander Van Craen, D. Pflüger
In scientific computing and Artificial Intelligence (AI), which both rely on massively parallel tasks, frameworks like the Compute Unified Device Architecture (CUDA) and the Open Computing Language (OpenCL) are widely used to harvest the computational power of accelerator cards, in particular of Graphics Processing Units (GPUs). A few years ago, GPUs from NVIDIA were used almost exclusively for these tasks but meanwhile, AMD and Intel are increasing their shares of the GPUs market. This introduces many new challenges for code development, as the prevailing CUDA code can only run on NVIDIA hardware and must be adapted or even completely rewritten to run on GPUs from AMD or Intel. In this paper, we compare the different competing programming frameworks OpenMP, CUDA, OpenCL, and SYCL, paying special attention to the two SYCL implementations hipSYCL and DPC++. Thereby, we investigate the different frameworks with respect to their usability, performance, and performance portability on a variety of hardware platforms from different vendors, i.e., GPUs from NVIDIA, AMD, and Intel and Central Processing Units (CPUs) from AMD and Intel. Besides discussing the runtimes of these frameworks on the different hardware platforms, we also focus our comparison on the differences between the nd_range kernel formulation and the SYCL specific hierarchical kernels. Our Parallel Least Squares Support Vector Machine (PLSSVM) library implements backends for the four previously mentioned programming frameworks for a Least Squares Support Vector Machines (LS-SVMs). At its example, we show which of the frameworks is best suited for a standard workload that is frequently employed in scientific computing and AI, depending on the target hardware: The most computationally intensive part of our PLSSVM library is solving a system of linear equations using the Conjugate Gradient (CG) method. Specifically, we parallelize the implicit matrix-vector multiplication inside the CG method, a workload common in many scientific codes. The PLSSVM code, utility scripts, and documentation are all available on GitHub: https://github.com/SC-SGS/PLSSVM.
在科学计算和人工智能(AI)中,它们都依赖于大规模并行任务,像计算统一设备架构(CUDA)和开放计算语言(OpenCL)这样的框架被广泛用于获取加速卡的计算能力,特别是图形处理单元(gpu)。几年前,NVIDIA的gpu几乎专门用于这些任务,但与此同时,AMD和英特尔正在增加它们在gpu市场的份额。这给代码开发带来了许多新的挑战,因为流行的CUDA代码只能在NVIDIA硬件上运行,并且必须调整甚至完全重写才能在AMD或英特尔的gpu上运行。在本文中,我们比较了不同的竞争编程框架OpenMP, CUDA, OpenCL和SYCL,特别关注了两种SYCL实现hipSYCL和dpc++。因此,我们研究了不同框架在不同硬件平台上的可用性、性能和性能可移植性,即NVIDIA、AMD和Intel的gpu以及AMD和Intel的中央处理器(cpu)。除了讨论这些框架在不同硬件平台上的运行时外,我们还重点比较了nd_range内核公式与SYCL特定层次结构内核之间的差异。我们的并行最小二乘支持向量机(PLSSVM)库实现了前面提到的四个最小二乘支持向量机(ls - svm)编程框架的后端。在其示例中,我们展示了哪个框架最适合科学计算和人工智能中经常使用的标准工作负载,具体取决于目标硬件:我们PLSSVM库中计算最密集的部分是使用共轭梯度(CG)方法求解线性方程组。具体来说,我们在CG方法中并行化隐式矩阵向量乘法,这是许多科学代码中常见的工作量。PLSSVM代码、实用程序脚本和文档都可以在GitHub上获得:https://github.com/SC-SGS/PLSSVM。
{"title":"A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware","authors":"Marcel Breyer, Alexander Van Craen, D. Pflüger","doi":"10.1145/3529538.3529980","DOIUrl":"https://doi.org/10.1145/3529538.3529980","url":null,"abstract":"In scientific computing and Artificial Intelligence (AI), which both rely on massively parallel tasks, frameworks like the Compute Unified Device Architecture (CUDA) and the Open Computing Language (OpenCL) are widely used to harvest the computational power of accelerator cards, in particular of Graphics Processing Units (GPUs). A few years ago, GPUs from NVIDIA were used almost exclusively for these tasks but meanwhile, AMD and Intel are increasing their shares of the GPUs market. This introduces many new challenges for code development, as the prevailing CUDA code can only run on NVIDIA hardware and must be adapted or even completely rewritten to run on GPUs from AMD or Intel. In this paper, we compare the different competing programming frameworks OpenMP, CUDA, OpenCL, and SYCL, paying special attention to the two SYCL implementations hipSYCL and DPC++. Thereby, we investigate the different frameworks with respect to their usability, performance, and performance portability on a variety of hardware platforms from different vendors, i.e., GPUs from NVIDIA, AMD, and Intel and Central Processing Units (CPUs) from AMD and Intel. Besides discussing the runtimes of these frameworks on the different hardware platforms, we also focus our comparison on the differences between the nd_range kernel formulation and the SYCL specific hierarchical kernels. Our Parallel Least Squares Support Vector Machine (PLSSVM) library implements backends for the four previously mentioned programming frameworks for a Least Squares Support Vector Machines (LS-SVMs). At its example, we show which of the frameworks is best suited for a standard workload that is frequently employed in scientific computing and AI, depending on the target hardware: The most computationally intensive part of our PLSSVM library is solving a system of linear equations using the Conjugate Gradient (CG) method. Specifically, we parallelize the implicit matrix-vector multiplication inside the CG method, a workload common in many scientific codes. The PLSSVM code, utility scripts, and documentation are all available on GitHub: https://github.com/SC-SGS/PLSSVM.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72868617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Reaching even richer C++ in OpenCL kernels with use of libclcxx 使用libclcxx在OpenCL内核中实现更丰富的c++
Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529540
Anastasia Stulova, Ishfaq Wardag
{"title":"Reaching even richer C++ in OpenCL kernels with use of libclcxx","authors":"Anastasia Stulova, Ishfaq Wardag","doi":"10.1145/3529538.3529540","DOIUrl":"https://doi.org/10.1145/3529538.3529540","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89933664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Celerity: How (Well) Does the SYCL API Translate to Distributed Clusters? 敏捷:SYCL API如何(很好地)转换到分布式集群?
Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530004
Philip Salzmann, Fabian Knorr, Peter Thoman, Biagio Cosenza
{"title":"Celerity: How (Well) Does the SYCL API Translate to Distributed Clusters?","authors":"Philip Salzmann, Fabian Knorr, Peter Thoman, Biagio Cosenza","doi":"10.1145/3529538.3530004","DOIUrl":"https://doi.org/10.1145/3529538.3530004","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89389656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interfacing SYCL and Python for XPU Programming XPU编程中SYCL和Python的接口
Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529990
O. Pavlyk, Diptorup Deb
This paper introduces a new framework to help build and use SYCL-based Python native extensions. We present the core design and implementation detail of the framework that includes an overview of the API, a technique to support asynchronous SYCL kernel execution via Python, and discussion around the usage of Python extension generator tools to build SYCL-based extensions. Details of ongoing work are presented and we demonstrate the development of a performance portable Python native extension that relies on the SYCL-based oneMKL specification.
本文介绍了一个新的框架来帮助构建和使用基于sycl的Python本地扩展。我们介绍了框架的核心设计和实现细节,包括API的概述,通过Python支持异步SYCL内核执行的技术,以及关于使用Python扩展生成器工具构建基于SYCL的扩展的讨论。介绍了正在进行的工作的细节,并演示了依赖于基于sycl的oneMKL规范的性能可移植Python本地扩展的开发。
{"title":"Interfacing SYCL and Python for XPU Programming","authors":"O. Pavlyk, Diptorup Deb","doi":"10.1145/3529538.3529990","DOIUrl":"https://doi.org/10.1145/3529538.3529990","url":null,"abstract":"This paper introduces a new framework to help build and use SYCL-based Python native extensions. We present the core design and implementation detail of the framework that includes an overview of the API, a technique to support asynchronous SYCL kernel execution via Python, and discussion around the usage of Python extension generator tools to build SYCL-based extensions. Details of ongoing work are presented and we demonstrate the development of a performance portable Python native extension that relies on the SYCL-based oneMKL specification.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88027131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Compiler-aided nd-range parallel-for implementations on CPU in hipSYCL 在hipSYCL中,编译器辅助的和范围并行的CPU实现
Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530216
Joachim Meyer, Aksel Alpay, H. Fröning, V. Heuveline
With heterogeneous programming continuously on the rise, performance portability is still to be improved. SYCL provides the nd-range parallel-for paradigm for writing data-parallel kernels. This model allows barriers for group-local synchronization, similar to CUDA and OpenCL kernels. GPUs provide efficient means to model this, but on CPUs the necessary forward-progress guarantees require the use of many (lightweight) threads in library-only SYCL implementations, rendering the nd-range parallel-for unacceptably inefficient. By adopting two compiler-based approaches solving this, the present work improves the performance of the nd-range parallel-for in hipSYCL for CPUs by up to multiple orders of magnitude on various CPU architectures. The two alternatives are compared with regard to their functional correctness and performance. By upstreaming one of the variants, hipSYCL is the first SYCL implementation to provide a well performing nd-range parallel-for on CPU, without requiring an available OpenCL runtime.
随着异构编程的不断兴起,性能可移植性仍有待改进。SYCL为编写数据并行内核提供了极差的parallel-for范型。这个模型允许组本地同步的障碍,类似于CUDA和OpenCL内核。gpu提供了有效的建模方法,但是在cpu上,必要的前向进度保证需要在仅库的SYCL实现中使用许多(轻量级)线程,从而使范围内的并行效率低得令人无法接受。通过采用两种基于编译器的方法来解决这个问题,本研究在不同的CPU体系结构上将用于CPU的hipSYCL的近程并行性能提高了多个数量级。比较了这两种备选方案的功能正确性和性能。通过对其中一个变体进行上传到,hipSYCL是第一个在CPU上提供性能良好的远程并行处理的SYCL实现,而不需要可用的OpenCL运行时。
{"title":"Compiler-aided nd-range parallel-for implementations on CPU in hipSYCL","authors":"Joachim Meyer, Aksel Alpay, H. Fröning, V. Heuveline","doi":"10.1145/3529538.3530216","DOIUrl":"https://doi.org/10.1145/3529538.3530216","url":null,"abstract":"With heterogeneous programming continuously on the rise, performance portability is still to be improved. SYCL provides the nd-range parallel-for paradigm for writing data-parallel kernels. This model allows barriers for group-local synchronization, similar to CUDA and OpenCL kernels. GPUs provide efficient means to model this, but on CPUs the necessary forward-progress guarantees require the use of many (lightweight) threads in library-only SYCL implementations, rendering the nd-range parallel-for unacceptably inefficient. By adopting two compiler-based approaches solving this, the present work improves the performance of the nd-range parallel-for in hipSYCL for CPUs by up to multiple orders of magnitude on various CPU architectures. The two alternatives are compared with regard to their functional correctness and performance. By upstreaming one of the variants, hipSYCL is the first SYCL implementation to provide a well performing nd-range parallel-for on CPU, without requiring an available OpenCL runtime.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74202660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An Overview of OpenCL Vendor Extensions Supported in Qualcomm Adreno GPUs 高通Adreno gpu支持的OpenCL供应商扩展概述
Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530002
Hongqiang Wang, Balaji Calidas
One of the key advantages of using OpenCL is its openness and flexibility, as it allows OpenCL vendors to extend the standard OpenCL features or add new features through the extension mechanism. OpenCL allows three types of extensions, the KHR extensions, the external extensions, and the vendor extensions. Vendor extensions are less restrictive than the KHR and the external extensions, which normally require multiple vendors to adopt or conformance tests to pass. This poster focuses on the vendor extensions solely available on the Adreno mobile GPUs in Qualcomm’s Snapdragon SOCs (system-on-chip). Adreno GPUs support a wide range of vendor extensions. This poster will provide a high-level overview of the extensions. More detailed descriptions and examples can be found in [1]. Note that Adreno GPUs have many tiers and generations featuring different capabilities. Generally, developers must query its availability on the device before using the extension via API calls such as clGetDeviceInfo , to avoid possible incompatibility or portability issues in future.
使用OpenCL的一个关键优势是它的开放性和灵活性,因为它允许OpenCL供应商扩展标准的OpenCL特性或通过扩展机制添加新特性。OpenCL允许三种类型的扩展:KHR扩展、外部扩展和供应商扩展。供应商扩展比KHR和外部扩展限制更少,后者通常需要多个供应商采用或通过一致性测试。这张海报关注的是在高通骁龙soc(片上系统)的Adreno移动gpu上唯一可用的供应商扩展。Adreno gpu支持广泛的供应商扩展。这张海报将提供扩展的高级概述。可以在[1]中找到更详细的描述和示例。请注意,Adreno gpu有许多层和代,具有不同的功能。通常,开发人员必须在通过API调用(如clGetDeviceInfo)使用扩展之前查询其在设备上的可用性,以避免将来可能出现的不兼容或可移植性问题。
{"title":"An Overview of OpenCL Vendor Extensions Supported in Qualcomm Adreno GPUs","authors":"Hongqiang Wang, Balaji Calidas","doi":"10.1145/3529538.3530002","DOIUrl":"https://doi.org/10.1145/3529538.3530002","url":null,"abstract":"One of the key advantages of using OpenCL is its openness and flexibility, as it allows OpenCL vendors to extend the standard OpenCL features or add new features through the extension mechanism. OpenCL allows three types of extensions, the KHR extensions, the external extensions, and the vendor extensions. Vendor extensions are less restrictive than the KHR and the external extensions, which normally require multiple vendors to adopt or conformance tests to pass. This poster focuses on the vendor extensions solely available on the Adreno mobile GPUs in Qualcomm’s Snapdragon SOCs (system-on-chip). Adreno GPUs support a wide range of vendor extensions. This poster will provide a high-level overview of the extensions. More detailed descriptions and examples can be found in [1]. Note that Adreno GPUs have many tiers and generations featuring different capabilities. Generally, developers must query its availability on the device before using the extension via API calls such as clGetDeviceInfo , to avoid possible incompatibility or portability issues in future.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85231853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Workshop on OpenCL
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1