International Workshop on OpenCL最新文献_第7页

Sylkan: Towards a Vulkan Compute Target Platform for SYCL 面向SYCL的Vulkan计算目标平台

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456683

Peter Thoman, Daniel Gogl, T. Fahringer

SYCL is a modern high-level C++ programming interface which excels at expressing data parallelism for heterogeneous hardware platforms in a programmer-friendly way, and is standardized by the Khronos Group. The latest version of the standard, SYCL 2020, removes the previous dependence of the specification and its implementations on an underlying OpenCL target, opening the door for compliant alternative implementations. In this paper, we discuss the opportunities and challenges of mapping SYCL to Vulkan, a low-level explicit programming model for GPUs. This includes an analysis of the potential semantic mismatch between each respective standard, as well as approaches to work around some of these issues. Additionally, we present a prototype research implementation of Sylkan, a SYCL compiler and runtime targeting Vulkan. In order to evaluate our prototype qualitatively and quantitatively, we chose a variety of functional tests as well as three performance benchmarks. For the functional tests, we discuss and categorize the failures of the current prototype, noting which semantic mismatch or missing implementation causes them. For the performance benchmarks, we compare execution times against a OpenCL-based SYCL implementation and a native Vulkan version of each benchmark, on two hardware platforms.

SYCL是一种现代的高级c++编程接口，它擅长以一种程序员友好的方式表达异构硬件平台的数据并行性，并由Khronos Group进行了标准化。该标准的最新版本SYCL 2020消除了先前对底层OpenCL目标的规范及其实现的依赖，为兼容的替代实现打开了大门。在本文中，我们讨论了将SYCL映射到Vulkan的机遇和挑战，Vulkan是gpu的低级显式编程模型。这包括对每个标准之间潜在的语义不匹配的分析，以及解决其中一些问题的方法。此外，我们还介绍了Sylkan的原型研究实现，Sylkan是针对Vulkan的SYCL编译器和运行时。为了定性和定量地评估我们的原型，我们选择了各种功能测试以及三个性能基准。对于功能测试，我们讨论并对当前原型的失败进行分类，注意是语义不匹配或缺少实现导致了失败。对于性能基准测试，我们在两个硬件平台上比较了基于opencl的SYCL实现和每个基准测试的原生Vulkan版本的执行时间。

{"title":"Sylkan: Towards a Vulkan Compute Target Platform for SYCL","authors":"Peter Thoman, Daniel Gogl, T. Fahringer","doi":"10.1145/3456669.3456683","DOIUrl":"https://doi.org/10.1145/3456669.3456683","url":null,"abstract":"SYCL is a modern high-level C++ programming interface which excels at expressing data parallelism for heterogeneous hardware platforms in a programmer-friendly way, and is standardized by the Khronos Group. The latest version of the standard, SYCL 2020, removes the previous dependence of the specification and its implementations on an underlying OpenCL target, opening the door for compliant alternative implementations. In this paper, we discuss the opportunities and challenges of mapping SYCL to Vulkan, a low-level explicit programming model for GPUs. This includes an analysis of the potential semantic mismatch between each respective standard, as well as approaches to work around some of these issues. Additionally, we present a prototype research implementation of Sylkan, a SYCL compiler and runtime targeting Vulkan. In order to evaluate our prototype qualitatively and quantitatively, we chose a variety of functional tests as well as three performance benchmarks. For the functional tests, we discuss and categorize the failures of the current prototype, noting which semantic mismatch or missing implementation causes them. For the performance benchmarks, we compare execution times against a OpenCL-based SYCL implementation and a native Vulkan version of each benchmark, on two hardware platforms.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86089096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Experiences Supporting DPC++ in AMReX 在AMReX中支持dpc++的经验

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456673

Sravani Konda, Dunni Aribuki, Weiqun Zhang, K. Gott, C. Lishka

AMReX is a software framework for massively parallel, block-structured adaptive mesh refinement (AMR) applications. AMReX is developed as part of the United States Department Of Energy’s Exascale Computing Project (ECP). Besides AMR capabilities, AMReX also provides a parallel programming framework for numerous applications including six ECP projects, and it implements several backends for CPU-GPU heterogeneous computing. In this talk, we present our experiences supporting DPC++, a language based on the SYCL specification as a backend for AMReX. We will demonstrate how AMReX provides an abstraction layer for its users so that they can write performance portable code for a variety of heterogeneous platforms. We will discuss key DPC++ features that allow AMReX to implement the abstractions and our contributions to the oneAPI specification and Intel’s implementation. We will also highlight some features missing in SYCL/DPC++ that limits its efficiency and our future plans.

AMReX是一个用于大规模并行、块结构自适应网格细化(AMR)应用的软件框架。AMReX是作为美国能源部百亿亿次计算项目(ECP)的一部分而开发的。除了AMR功能，AMReX还为包括6个ECP项目在内的众多应用程序提供了并行编程框架，并实现了多个CPU-GPU异构计算的后端。在这次演讲中，我们将介绍我们支持dpc++的经验，dpc++是一种基于SYCL规范的语言，作为AMReX的后端。我们将演示AMReX如何为其用户提供抽象层，以便他们可以为各种异构平台编写性能可移植的代码。我们将讨论允许AMReX实现抽象的关键dpc++特性，以及我们对oneAPI规范和英特尔实现的贡献。我们还将重点介绍SYCL/ dpc++中缺少的一些限制其效率和我们未来计划的特性。

引用次数: 0

Experiences With Adding SYCL Support to GROMACS 向GROMACS添加SYCL支持的经验

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456690

Andrey Alekseenko, Szilárd Páll, E. Lindahl

GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for 5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices. In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained. Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS’s task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.

GROMACS是一个开源的高性能分子动力学(MD)软件包，主要用于生物分子模拟，占全球HPC利用率的5%。由于MD的极端计算需求，在提高模拟的性能和可扩展性方面投入了大量的努力。目标硬件的范围从超级计算机到个人研究人员和分布式计算项目(如Folding@Home)志愿者的笔记本电脑。通过显式地使算法适应SIMD和数据并行处理器，代码的设计兼顾了可移植性和性能。SIMD固有抽象层提供了高CPU性能。显式GPU加速长期以来一直使用CUDA针对NVIDIA设备和OpenCL针对AMD/Intel设备。在本次演讲中，我们将讨论在已建立的GROMACS代码库中添加SYCL平台支持的经验和挑战，并分享移植和优化方面的经验和注意事项。虽然OpenCL提供了使用相同代码针对不同硬件的好处，但它也有一些缺点，增加了显著的开发摩擦。它的分离源模型会导致代码重复，并使更改变得复杂。内核需要使用C99，而其余代码库使用c++ 17，这加剧了这些问题。另一个问题是，OpenCL虽然得到大多数GPU供应商的支持，但从来都不是主要的框架，因此没有得到主要的支持或调优。SYCL采用基于现代c++标准的单一源代码模型，减轻了许多这些问题。除了作为英特尔GPU的主要平台之外，通过其他实现(例如，hipSYCL)瞄准AMD和NVIDIA GPU的可能性可能会减少必须维护的独立GPU端口的数量。与OpenCL的一些设计上的差异，比如用流向无环图(dag)代替有序队列，使得有必要重新考虑GROMACS的任务调度方法和GPU后端的架构选择。此外，支持多个GPU平台带来了平衡性能(低级和特定于硬件的代码)和可维护性(更多的泛化和代码重用)的挑战。我们将讨论现有代码库和互操作性层在添加新平台方面的局限性;计算性能和延迟比较;代码质量考虑;以及我们在测试SYCL实现时遇到的问题。最后，我们将讨论SYCL后端下一个发布周期的目标，以及GROMACS中GPU加速代码的整体架构。

{"title":"Experiences With Adding SYCL Support to GROMACS","authors":"Andrey Alekseenko, Szilárd Páll, E. Lindahl","doi":"10.1145/3456669.3456690","DOIUrl":"https://doi.org/10.1145/3456669.3456690","url":null,"abstract":"GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for 5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices. In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained. Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS’s task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81528769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL 使用位置敏感散列和SYCL的性能可移植分布式k近邻

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456692

Marcel Breyer, Gregor Daiß, D. Pflüger

In the age of data collection, machine learning algorithms have to be able to efficiently cope with vast data sets. This requires scalable algorithms and efficient implementations that can cope with heterogeneous hardware. We propose a new, performance-portable implementation of a well-known, robust, and versatile multi-class classification method that supports multiple Graphics Processing Units (GPUs) from different vendors. It is based on a performance-portable implementation of the approximate k-nearest neighbors (k-NN) algorithm in SYCL. The k-NN assigns a class to a data point based on a majority vote of its neighborhood. The naive approach compares a data point x to all other data points in the training data to identify the k nearest ones. However, this has quadratic runtime and is infeasible for large data sets. Therefore, approximate variants have been developed. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that have to be examined to compute the k-NN. To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL 1.2.1, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems. We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the library supports different SYCL implementations—ComputeCpp, hipSYCL, and DPC++—to target different hardware architectures without significant performance differences.

在数据收集时代，机器学习算法必须能够有效地处理大量数据集。这需要可伸缩的算法和能够处理异构硬件的高效实现。我们提出了一种新的、性能可移植的实现，它是一种众所周知的、健壮的、通用的多类分类方法，支持来自不同供应商的多个图形处理单元(gpu)。它基于SYCL中近似k近邻(k-NN)算法的性能可移植实现。k-NN根据邻域的多数票将类分配给数据点。朴素的方法将数据点x与训练数据中的所有其他数据点进行比较，以确定k个最接近的数据点。然而，这种方法的运行时间是二次的，对于大型数据集来说是不可行的。因此，近似变体已经被开发出来。这种算法就是位置敏感哈希(LSH)算法，它使用哈希表和位置敏感哈希函数来减少计算k-NN时必须检查的数据点。据我们所知，尽管k- nn经常被使用，但到目前为止还没有支持来自不同供应商的多个gpu的分布式LSH版本。因此，我们开发了这个库。它提供了适用于现代超级计算机的LSH算法的第一个独立于硬件但高效的分布式实现。该实现使用c++ 17和SYCL 1.2.1, SYCL 1.2.1是OpenCL的一个抽象层，允许使用单个实现针对不同的硬件。为了支持大型数据集，我们利用使用消息传递接口(MPI)的多个gpu来启用共享和分布式内存系统。我们为两个对位置敏感的散列函数实现测试了不同的参数组合，并进行了比较。我们的结果表明，我们的库可以使用两种哈希函数类型轻松地在多个gpu上扩展，在8个gpu上实现近乎最佳的并行加速，最高可达7.6。此外，我们还演示了该库支持不同的SYCL实现(computecpp、hipSYCL和dpc++)，以针对不同的硬件体系结构，而不会产生显著的性能差异。

{"title":"Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL","authors":"Marcel Breyer, Gregor Daiß, D. Pflüger","doi":"10.1145/3456669.3456692","DOIUrl":"https://doi.org/10.1145/3456669.3456692","url":null,"abstract":"In the age of data collection, machine learning algorithms have to be able to efficiently cope with vast data sets. This requires scalable algorithms and efficient implementations that can cope with heterogeneous hardware. We propose a new, performance-portable implementation of a well-known, robust, and versatile multi-class classification method that supports multiple Graphics Processing Units (GPUs) from different vendors. It is based on a performance-portable implementation of the approximate k-nearest neighbors (k-NN) algorithm in SYCL. The k-NN assigns a class to a data point based on a majority vote of its neighborhood. The naive approach compares a data point x to all other data points in the training data to identify the k nearest ones. However, this has quadratic runtime and is infeasible for large data sets. Therefore, approximate variants have been developed. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that have to be examined to compute the k-NN. To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL 1.2.1, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems. We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the library supports different SYCL implementations—ComputeCpp, hipSYCL, and DPC++—to target different hardware architectures without significant performance differences.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84888522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL 基于SYCL的高参数化TRSM算法性能可移植性研究

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456694

T. Sabino, M. Goli

Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators. While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world. Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.

BLAS于1979年提出，至今仍是低级线性代数例程的事实上的标准。BLAS提供基本的线性代数例程，用于各种领域，如数值和科学计算，天气模拟，计算流体动力学，机器学习，并用于从HPC到嵌入式系统和人工智能专用加速器的广泛硬件。虽然最初的BLAS例程是为CPU实现的，但随着GPGPU的出现，BLAS例程必须重新编写以利用所提供的广泛计算能力。机器学习正在通过激励能够更有效地执行某些操作的专用硬件的开发，再次迅速改变这一格局。有了各种各样的硬件，每种硬件都有一种新的内存层次结构、不同的缓存线大小和性能所需的各种内存访问模式，有了不同数量的寄存器和不同类型的内存连接，BLAS例程在不同平台上的性能可移植性，同时避免重写现有代码，这是异构编程世界的一个主要挑战。SYCL-BLAS是一个用SYCL编写的开源BLAS库，它提供了跨各种支持SYCL的平台的性能可移植性。本文通过利用SYCL-BLAS中已经提供的高度优化的GEMM例程的公式，介绍了SYCL-BLAS中基于参数化瓷砖的TRSM例程的实现。我们的结果表明，与高度优化的clBLAST和clBLAS库相比，通过调整每个设备的tile大小，我们可以在Intel GPU上实现高达2.6倍的加速，在AMD GPU上实现7倍的加速，在ARM GPU上实现高达3.4倍的加速，而无需重新实现内核。

{"title":"Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL","authors":"T. Sabino, M. Goli","doi":"10.1145/3456669.3456694","DOIUrl":"https://doi.org/10.1145/3456669.3456694","url":null,"abstract":"Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators. While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world. Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75768674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs Intel cpu上PoCL开源OpenCL实现的性能评估与改进

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456698

Tobias Baumann, M. Noack, T. Steinke

The Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant performance drawbacks of PoCL on Intel CPUs – which run 92 % of the TOP500 list. Using a selection of benchmarks, we identify and analyse performance issues in PoCL with a focus on scheduling and vectorisation. We propose a new CPU device-driver based on Intel Threading Building Blocks (TBB), and evaluate LLVM with respect to automatic compiler vectorisation across work-items in PoCL. Using the TBB driver, it is possible to narrow the gap to Intel OpenCL and even outperform it by a factor of up to 1.3 × in our proxy application benchmark with a manual vectorisation strategy.

便携式计算语言(PoCL)是一种独立于供应商的开源OpenCL实现，旨在在单一平台上支持各种计算设备。将PoCL与英特尔OpenCL实现进行比较，可以发现PoCL在英特尔cpu上的显著性能缺陷——英特尔cpu占据了TOP500榜单的92%。使用一系列基准，我们识别和分析了PoCL中的性能问题，重点是调度和向量化。我们提出了一种基于英特尔线程构建块(TBB)的新的CPU设备驱动程序，并评估了LLVM在PoCL中跨工作项的自动编译器矢量化方面的性能。使用TBB驱动程序，有可能缩小与英特尔OpenCL的差距，甚至在我们的代理应用程序基准测试中，通过手动向量化策略，可以比它高出1.3倍。

引用次数: 2

Enabling OpenCL and SYCL for RISC-V processors 为RISC-V处理器启用OpenCL和SYCL

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456687

Rod Burns, Colin Davidson, Aidan Dodds

RISC-V is a non-profit, member managed organization and is gaining momentum in the processor space, with more than 900 members. One of the goals of the organization is to build an open software platform, providing software developers an easy way to harness the familiar benefits already available on CPUs and GPUs. Today, system-on-chip manufacturers are building specialist accelerator processors based on the RISC-V architecture, taking advantage of the Vectorized extensions that match compute performance mostly seen on GPUs today. The availability of a familiar and well defined programming model is an absolute requirement if expecting to successfully bring these new processors to market. This presentation will dive into the details of Codeplay’s work in partnership with NSI-TEXE and Kyoto Microcomputer, describing the components needed to integrate OpenCL and SYCL onto RISC-V using multiple simulators. This project forms part of Japan’s New Energy and Industrial Technology Development Organisation (“NEDO”) project to build a powerful supercomputer. While Codeplay has previously enabled OpenCL for a variety processor architectures, there are a number of technical challenges involved in delivering a generic integration that can be used by multiple RISC-V based systems, and the solution required a change in approach. By adding to the existing LLVM back-end for RISC-V, and creating an integration layer that plugs into OpenCL, we have built a common base architecture for a range of RISC-V processors from different companies. This presentation will explain how Codeplay’s current driver interface works, and how it has been adapted to integrate with multiple RISC-V targets, in particular the riscvOVPsim and Spike RISC-V ISA simulators. We will also talk about some of the RISC-V extensions that are available, and how these can help to to expose features specific to the RISC-V architecture through OpenCL.

RISC-V是一个非营利性的成员管理组织，在处理器领域发展势头强劲，拥有900多名成员。该组织的目标之一是构建一个开放的软件平台，为软件开发人员提供一种简单的方法来利用cpu和gpu上已经提供的熟悉的优势。今天，片上系统制造商正在构建基于RISC-V架构的专业加速器处理器，利用与当前gpu上主要看到的计算性能相匹配的矢量化扩展。如果希望成功地将这些新处理器推向市场，那么熟悉且定义良好的编程模型的可用性是绝对必要的。本演讲将深入探讨Codeplay与NSI-TEXE和京都微机合作的工作细节，描述使用多个模拟器将OpenCL和SYCL集成到RISC-V所需的组件。该项目是日本新能源和工业技术开发组织(NEDO)项目的一部分，该项目旨在建造一台强大的超级计算机。虽然Codeplay之前已经为各种处理器架构启用了OpenCL，但在提供可用于多个基于RISC-V的系统的通用集成方面存在许多技术挑战，并且解决方案需要改变方法。通过为RISC-V添加现有的LLVM后端，并创建一个插入OpenCL的集成层，我们已经为来自不同公司的一系列RISC-V处理器构建了一个通用的基础架构。本演讲将解释Codeplay当前的驱动程序接口是如何工作的，以及它是如何适应与多个RISC-V目标集成的，特别是riscvOVPsim和Spike RISC-V ISA模拟器。我们还将讨论一些可用的RISC-V扩展，以及它们如何通过OpenCL帮助暴露RISC-V架构的特定功能。

{"title":"Enabling OpenCL and SYCL for RISC-V processors","authors":"Rod Burns, Colin Davidson, Aidan Dodds","doi":"10.1145/3456669.3456687","DOIUrl":"https://doi.org/10.1145/3456669.3456687","url":null,"abstract":"RISC-V is a non-profit, member managed organization and is gaining momentum in the processor space, with more than 900 members. One of the goals of the organization is to build an open software platform, providing software developers an easy way to harness the familiar benefits already available on CPUs and GPUs. Today, system-on-chip manufacturers are building specialist accelerator processors based on the RISC-V architecture, taking advantage of the Vectorized extensions that match compute performance mostly seen on GPUs today. The availability of a familiar and well defined programming model is an absolute requirement if expecting to successfully bring these new processors to market. This presentation will dive into the details of Codeplay’s work in partnership with NSI-TEXE and Kyoto Microcomputer, describing the components needed to integrate OpenCL and SYCL onto RISC-V using multiple simulators. This project forms part of Japan’s New Energy and Industrial Technology Development Organisation (“NEDO”) project to build a powerful supercomputer. While Codeplay has previously enabled OpenCL for a variety processor architectures, there are a number of technical challenges involved in delivering a generic integration that can be used by multiple RISC-V based systems, and the solution required a change in approach. By adding to the existing LLVM back-end for RISC-V, and creating an integration layer that plugs into OpenCL, we have built a common base architecture for a range of RISC-V processors from different companies. This presentation will explain how Codeplay’s current driver interface works, and how it has been adapted to integrate with multiple RISC-V targets, in particular the riscvOVPsim and Spike RISC-V ISA simulators. We will also talk about some of the RISC-V extensions that are available, and how these can help to to expose features specific to the RISC-V architecture through OpenCL.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77197573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Executing Graphs with OpenCL 用OpenCL执行图形

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456681

Erik Tomusk

For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network—a type of graph—for handwriting recognition. There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing larger and increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these large, heterogeneous accelerators. Does OpenCL have the expressive power required to describe an execution graph to accelerator hardware, or does OpenCL serialize graphs and execute them sequentially? This technical presentation argues that it is the former: OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and OpenCL is sufficiently powerful to execute that graph on a graph accelerator. The OpenCL API is designed around the concept of the user enqueuing commands onto the front of a command-queue. Commands include executing kernels (i.e., functions), and reading, writing, and copying data buffers. The OpenCL device driver removes commands from the back of a command-queue, sets up data transfers to and from the accelerator device, and schedules kernels to execute on the device. The command-queue abstraction can encode execution graphs in one of two ways, depending on whether the command-queue is an in-order command-queue or an out-of-order command-queue. An in-order command-queue guarantees that the effects of the enqueued commands will be as if the commands were executed in the order in which they were enqueued. However, the OpenCL device driver is allowed to reorder commands, provided that reordering does not affect the output. For example, if two kernels do not have a data dependency between them, then they can be executed in reverse order or even in parallel, if the driver and hardware support it. An out-of-order command-queue does not guarantee that commands will appear to have been executed in the order in which they were enqueued. Instead, it is the OpenCL API user’s responsibility to attach events and event wait lists to commands. When a command finishes executing, it triggers its attached event, and when all the events in a command’s event wait list have triggered, then that command is allowed to execute. Both types of command-queues are capable of describing execution graphs. For in-order command-queues, the gra

几十年来，图和数据流编程模型一直是局限于少数高度专门化领域的小众主题。然而，近年来，机器学习(ML)革命和ML库的激增使得即使是新手程序员也可以使用图编程。在此之前，初级程序员可能会说要编写一个数字猜谜游戏;今天，程序员将描述训练一个现成的神经网络——一种图形——用于手写识别。行业和个人用户越来越需要运行基于ML图的程序。硬件供应商正在满足这种需求，他们正在设计能够有效执行图形的更大且日益异构的加速器设备。自创建以来，OpenCL一直是弥合用户应用程序和加速器硬件之间鸿沟的关键API。那么问题来了，OpenCL对于运行在这些大型异构加速器上的新型图形软件来说，是否是一个合适的API。OpenCL是否具有向加速器硬件描述执行图所需的表达能力，或者OpenCL是否将图序列化并顺序执行?这个技术演示认为是前者:OpenCL具有足够的表现力，允许ML库描述执行图，OpenCL也足够强大，可以在图加速器上执行该图。OpenCL API是围绕用户将命令排队到命令队列前面的概念设计的。命令包括执行内核(即函数)，读取、写入和复制数据缓冲区。OpenCL设备驱动程序从命令队列的后面删除命令，设置加速器设备之间和加速器设备之间的数据传输，并安排内核在设备上执行。命令队列抽象可以用以下两种方式之一对执行图进行编码，具体取决于命令队列是有序命令队列还是无序命令队列。有序命令队列保证了排队命令的效果，就像命令按照排队的顺序执行一样。但是，OpenCL设备驱动程序允许重新排序命令，前提是重新排序不影响输出。例如，如果两个内核之间没有数据依赖关系，那么它们可以反向执行，甚至可以并行执行，如果驱动程序和硬件支持的话。乱序命令队列不能保证命令看起来是按照它们进入队列的顺序执行的。相反，将事件和事件等待列表附加到命令是OpenCL API用户的责任。当命令完成执行时，它触发其附带的事件，当命令的事件等待列表中的所有事件都被触发时，该命令才被允许执行。这两种类型的命令队列都能够描述执行图。对于有序命令队列，图是由内核的数据依赖关系隐含的;对于乱序命令队列，图是用事件显式定义的。通过检测Codeplay的ComputeAorta[2] OpenCL实现，可以记录OpenCL API调用并重构OpenCL设备驱动程序所看到的执行图。本报告研究了在TensorFlow[1]中实现的简化手写识别神经网络生成的执行图，并通过SYCL运行在OpenCL之上。训练神经网络和使用神经网络进行推理会产生本质上不同的执行图。这两个图都被考虑。图中显示了数据依赖性、并行执行内核的机会以及重新排序内核的机会，这些对驱动程序都是可见的。因此，OpenCL设备驱动程序可以将工作安排到为图形执行而设计的硬件加速器上。值得注意的是，OpenCL使向设备驱动程序公开执行图成为可能，但OpenCL不能保证OpenCL API调用将形成有意义的图。例如，如果用户将许多独立的数据数组放入一个内存缓冲区中，并对所有在单个内存缓冲区上操作的内核进行排队，那么关于执行图的信息对OpenCL是隐藏的。失去了并行执行和内核重排序的机会。通常，应用程序开发人员不直接编写OpenCL代码，而是使用具有OpenCL后端的库。因此，库开发人员有责任确保应用程序打算执行的图在OpenCL级别上得到正确的表示。

{"title":"Executing Graphs with OpenCL","authors":"Erik Tomusk","doi":"10.1145/3456669.3456681","DOIUrl":"https://doi.org/10.1145/3456669.3456681","url":null,"abstract":"For several decades, graph and dataflow programming models have been niche topics limited to a small number of highly specialized domains. In recent years, however, the machine learning (ML) revolution and the proliferation of ML libraries has made graph programming accessible to even novice programmers. Before, a beginner programmer may have talked about writing a number-guessing game; today the programmer will describe training an off-the-shelf neural network—a type of graph—for handwriting recognition. There is growing demand from industry and individual users to run programs that are based on ML graphs. This demand is being met by hardware vendors, who are designing larger and increasingly heterogeneous accelerator devices that can efficiently execute graphs. Since its creation, OpenCL has been a key API for bridging the gap between user applications and accelerator hardware. The question, then, is whether OpenCL is an appropriate API for this new breed of graph software running on these large, heterogeneous accelerators. Does OpenCL have the expressive power required to describe an execution graph to accelerator hardware, or does OpenCL serialize graphs and execute them sequentially? This technical presentation argues that it is the former: OpenCL is sufficiently expressive to allow an ML library to describe an execution graph, and OpenCL is sufficiently powerful to execute that graph on a graph accelerator. The OpenCL API is designed around the concept of the user enqueuing commands onto the front of a command-queue. Commands include executing kernels (i.e., functions), and reading, writing, and copying data buffers. The OpenCL device driver removes commands from the back of a command-queue, sets up data transfers to and from the accelerator device, and schedules kernels to execute on the device. The command-queue abstraction can encode execution graphs in one of two ways, depending on whether the command-queue is an in-order command-queue or an out-of-order command-queue. An in-order command-queue guarantees that the effects of the enqueued commands will be as if the commands were executed in the order in which they were enqueued. However, the OpenCL device driver is allowed to reorder commands, provided that reordering does not affect the output. For example, if two kernels do not have a data dependency between them, then they can be executed in reverse order or even in parallel, if the driver and hardware support it. An out-of-order command-queue does not guarantee that commands will appear to have been executed in the order in which they were enqueued. Instead, it is the OpenCL API user’s responsibility to attach events and event wait lists to commands. When a command finishes executing, it triggers its attached event, and when all the events in a command’s event wait list have triggered, then that command is allowed to execute. Both types of command-queues are capable of describing execution graphs. For in-order command-queues, the gra","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77169458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SYCL, DPC++, XPUs, oneAPI

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456719

J. Reinders

James will share his passion for getting to a world of heterogeneous computing where software tooling (compilers, frameworks, libraries, etc.) all have an “XPU view” of the world that spans vendors and devices. In this world, James advocates that we all be free to write our programs to use whatever XPUs we want, get full access to all XPU capabilities, and be comfortable trusting our ability to do this without extra risk to performance or stability. James will discuss how SYCL, DPC++, XPUs, and oneAPI all are important on our journey to make this vision a reality. James invites all conference attendees to join in and help guide Intel’s enthusiasm to help us all succeed together. Note: James co-authored the first (and only for now) book that teaches SYCL 2020 programming.

James将分享他对异构计算世界的热情，在异构计算世界中，软件工具(编译器、框架、库等)都具有跨供应商和设备的“XPU视图”。在这个世界上，James主张我们都可以自由地编写程序来使用我们想要的任何XPU，可以完全访问所有XPU功能，并且可以放心地相信我们有能力这样做，而不会对性能或稳定性造成额外的风险。James将讨论SYCL、dpc++、xpu和oneAPI在我们实现这一愿景的过程中是如何发挥重要作用的。James邀请所有与会者加入并帮助引导英特尔的热情，帮助我们共同取得成功。注:James是第一本(也是目前唯一一本)讲授SYCL 2020编程的书的合著者。

引用次数: 1

FAST: A framework for high-performance medical image computing and visualization FAST:用于高性能医学图像计算和可视化的框架

International Workshop on OpenCL

Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456717

E. Smistad

Medical image processing and visualization is often computationally demanding. Ultrasound images are acquired in real-time and needs to be processed at a high framerate with low latency. Computed tomography (CT) and magnetic resonance imaging (MRI) create large three dimensional volumes with sizes up to 512 × 512 × 800 voxels. In digital pathology, whole slide microscopy images can have an extreme image size of up to 200, 000 × 100, 000 pixels, which does not even fit into the memory of most computers. Thus, there is a need for smart data storage, processing and visualization methods to handle medical image data. The development of FAST started in 2014, the goal was to create an open-source framework which made GPU and parallel processing of medical images easy and portable. While there existed popular image processing libraries such as the visualization toolkit (VTK), insight toolkit (ITK) and OpenCV, the GPU processing capabilities were still implemented ad-hoc and often implied copying data back and forth from the GPU and CPU. Thus it was decided to use the new OpenCL API to create a cross-platform framework designed bottom-up with GPU processing at the very core. One of the design goals was to remove the burden of moving data back and forth from different processors and memory spaces from the developer. Instead, the developer requests access to the data on a given processor, and FAST will copy and update data as needed. Now, seven years later FAST version 3.2 is released, it still uses OpenCL 1.2 and OpenGL 3.3 at the core of almost all of its operations. FAST can stream images in real-time from ultrasound scanners, webcameras, Intel’s RealSense depth camera, and read many different formats from disk including medical formats such as DICOM, Metaimage and huge microscopy images stored as tiled image pyramids. FAST uses a processing pipeline concept, meaning that you define a pipeline as multiple processing and visualization steps first, then initiate the processing by executing the pipeline. The advantages of this is that it’s easy to change data sources and processing steps. The same pipeline used to process an ultrasound image on disk, can be used to process a real-time stream of ultrasound images. Today FAST pipelines can be created with C++, Python 3 and even without any programming using simple text files. The pipeline approach also opens up possibilities for load balancing and tuning based on analyzing the pipeline as computational graphs, although this has not yet been implemented. In the last five years or so, deep neural networks have become the standard for almost all image processing tasks. Many high-performance frameworks for deep neural network inference already exist, but have very different APIs and use different formats for storing neural network models. FAST now provides a common API for neural networks with multiple backends such as NVIDIA’s TensorRT, Intel’s OpenVINO and Google’s TensorFlow. This removes the burden of the us

医学图像处理和可视化通常需要大量的计算量。超声图像是实时获取的，需要以低延迟的高帧率进行处理。计算机断层扫描(CT)和磁共振成像(MRI)可以创建尺寸高达512 × 512 × 800体素的大型三维体积。在数字病理学中，整个玻片显微镜图像可以有一个极端的图像大小高达200000 × 100000像素，这甚至不适合大多数计算机的内存。因此，需要智能数据存储、处理和可视化方法来处理医学图像数据。FAST的开发始于2014年，目标是创建一个开源框架，使GPU和并行处理医学图像变得容易和便携。虽然存在流行的图像处理库，如可视化工具包(VTK)、洞察力工具包(ITK)和OpenCV，但GPU处理能力仍然是临时实现的，通常意味着从GPU和CPU来回复制数据。因此，我们决定使用新的OpenCL API来创建一个以GPU处理为核心的自下而上设计的跨平台框架。设计目标之一是消除开发人员在不同处理器和内存空间之间来回移动数据的负担。相反，开发人员请求访问给定处理器上的数据，FAST将根据需要复制和更新数据。现在，7年过去了，FAST 3.2版本发布了，它仍然使用OpenCL 1.2和OpenGL 3.3作为几乎所有操作的核心。FAST可以实时传输来自超声扫描仪、网络摄像头、英特尔RealSense深度摄像头的图像，并从磁盘读取许多不同的格式，包括医疗格式，如DICOM、Metaimage和存储为平纹图像金字塔的巨大显微镜图像。FAST使用处理管道概念，这意味着您首先将管道定义为多个处理和可视化步骤，然后通过执行管道来启动处理。这样做的优点是很容易更改数据源和处理步骤。同样的流水线用于处理磁盘上的超声图像，也可以用于处理实时的超声图像流。如今，FAST管道可以用c++、Python 3创建，甚至不需要使用简单的文本文件进行任何编程。管道方法还提供了基于将管道分析为计算图的负载平衡和调优的可能性，尽管这还没有实现。在过去五年左右的时间里，深度神经网络已经成为几乎所有图像处理任务的标准。目前已经存在许多用于深度神经网络推理的高性能框架，但它们具有非常不同的api，并且使用不同的格式来存储神经网络模型。FAST现在为具有多个后端(如NVIDIA的TensorRT、Intel的OpenVINO和Google的TensorFlow)的神经网络提供了一个通用API。这消除了用户学习每个推理库API的负担，并使神经网络推理像加载存储在磁盘上的模型一样简单。本演讲将介绍FAST框架以及如何使用OpenCL来制作它。在可移植性/易用性/代码复杂性和性能之间的权衡一直是一个挑战，经常导致牺牲性能或不得不编写相同算法的多个版本来处理不同的OpenCL实现。该演讲还将讨论OpenGL的一些重要特性，如OpenGL互操作性和2D/3D图像/纹理。FAST是开源的，我们邀请社区通过https://github.com/smistad/FAST的GitHub做出贡献

{"title":"FAST: A framework for high-performance medical image computing and visualization","authors":"E. Smistad","doi":"10.1145/3456669.3456717","DOIUrl":"https://doi.org/10.1145/3456669.3456717","url":null,"abstract":"Medical image processing and visualization is often computationally demanding. Ultrasound images are acquired in real-time and needs to be processed at a high framerate with low latency. Computed tomography (CT) and magnetic resonance imaging (MRI) create large three dimensional volumes with sizes up to 512 × 512 × 800 voxels. In digital pathology, whole slide microscopy images can have an extreme image size of up to 200, 000 × 100, 000 pixels, which does not even fit into the memory of most computers. Thus, there is a need for smart data storage, processing and visualization methods to handle medical image data. The development of FAST started in 2014, the goal was to create an open-source framework which made GPU and parallel processing of medical images easy and portable. While there existed popular image processing libraries such as the visualization toolkit (VTK), insight toolkit (ITK) and OpenCV, the GPU processing capabilities were still implemented ad-hoc and often implied copying data back and forth from the GPU and CPU. Thus it was decided to use the new OpenCL API to create a cross-platform framework designed bottom-up with GPU processing at the very core. One of the design goals was to remove the burden of moving data back and forth from different processors and memory spaces from the developer. Instead, the developer requests access to the data on a given processor, and FAST will copy and update data as needed. Now, seven years later FAST version 3.2 is released, it still uses OpenCL 1.2 and OpenGL 3.3 at the core of almost all of its operations. FAST can stream images in real-time from ultrasound scanners, webcameras, Intel’s RealSense depth camera, and read many different formats from disk including medical formats such as DICOM, Metaimage and huge microscopy images stored as tiled image pyramids. FAST uses a processing pipeline concept, meaning that you define a pipeline as multiple processing and visualization steps first, then initiate the processing by executing the pipeline. The advantages of this is that it’s easy to change data sources and processing steps. The same pipeline used to process an ultrasound image on disk, can be used to process a real-time stream of ultrasound images. Today FAST pipelines can be created with C++, Python 3 and even without any programming using simple text files. The pipeline approach also opens up possibilities for load balancing and tuning based on analyzing the pipeline as computational graphs, although this has not yet been implemented. In the last five years or so, deep neural networks have become the standard for almost all image processing tasks. Many high-performance frameworks for deep neural network inference already exist, but have very different APIs and use different formats for storing neural network models. FAST now provides a common API for neural networks with multiple backends such as NVIDIA’s TensorRT, Intel’s OpenVINO and Google’s TensorFlow. This removes the burden of the us","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89270950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2