International Workshop on OpenCL最新文献

英文中文

Embedding a DSL in SYCL for Productive and Performant Tensor Computing on Heterogeneous Devices 在SYCL中嵌入DSL以实现异构设备上的高效张量计算

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529988

Abenezer Wudenhe, Hongbo Rong

引用次数: 0

Exploring the possibility of a hipSYCL-based implementation of oneAPI 探索基于hipsycl的oneAPI实现的可能性

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530005

Aksel Alpay, Bálint Soproni, Holger Wünsche, Vincent Heuveline

oneAPI is an open standard for a software platform built around SYCL 2020 and accelerated libraries such as oneMKL as well as low-level building blocks such as oneAPI Level Zero. All oneAPI implementations currently are based on the DPC++ SYCL implementation. However, being able to utilize multiple independent SYCL implementations with oneAPI code can be beneficial to both users and implementors when it comes to testing code, or e.g. noticing ambiguities in the specification. In this work, we explore the possibility of implementing oneAPI using hipSYCL as an independent SYCL implementation instead. We review hipSYCL’s design and demonstrate it running on oneAPI Level Zero with competitive performance. We also discuss hipSYCL’s support for SYCL 2020 with the examples of unified shared memory (USM), group algorithms and optional kernel lambda naming. To this end, we also contribute microbenchmarks for the SYCL 2020 group algorithms and demonstrate their performance. When testing hipSYCL with HeCBench, a large benchmark suite containing SYCL benchmarks initially developed for DPC++, we point out specification ambiguities and practices that negatively impact code portability when transitioning from DPC++ to hipSYCL. We find that we can compile 122 benchmarks with little effort with hipSYCL, and demonstrate performance for a selection of benchmarks within 20% of native models on NVIDIA and AMD GPUs. Lastly, we demonstrate oneMKL’s BLAS domain running with hipSYCL on AMD and NVIDIA GPUs, and find that it can match native cuBLAS and rocBLAS performance for BLAS level 1, level 2 and level 3 operations, while significantly outperforming oneMKL with DPC++ on NVIDIA GPUs for all but the largest problem sizes. Overall, we find that hipSYCL can support low-level building blocks like Level Zero, oneAPI libraries like oneMKL, and the SYCL 2020 programming model efficiently, and hence conclude that it is indeed possible to implement oneAPI independently from DPC++.

oneAPI是围绕SYCL 2020和加速库(如oneMKL)以及低级构建块(如oneAPI Level Zero)构建的软件平台的开放标准。目前所有的oneAPI实现都是基于dpc++ SYCL实现的。然而，能够使用一个api代码使用多个独立的SYCL实现对用户和实现者都是有益的，当涉及到测试代码时，或者注意规范中的歧义。在这项工作中，我们探索了使用hipSYCL作为独立SYCL实现来实现oneAPI的可能性。我们回顾了hipSYCL的设计，并演示了它在oneAPI Level Zero上运行的具有竞争力的性能。我们还通过统一共享内存(USM)、组算法和可选内核lambda命名的例子讨论了hipSYCL对SYCL 2020的支持。为此，我们还为SYCL 2020组算法提供了微基准测试，并展示了它们的性能。当使用HeCBench测试hipSYCL时(HeCBench是一个包含SYCL基准测试的大型基准套件，最初是为dpc++开发的)，我们指出规范的模糊性和实践在从dpc++过渡到hipSYCL时对代码可移植性产生负面影响。我们发现我们可以用hipSYCL编译122个基准测试，并且在NVIDIA和AMD gpu上20%的本地模型中展示了一些基准测试的性能。最后，我们演示了oneMKL的BLAS域在AMD和NVIDIA gpu上与hipSYCL一起运行，并发现它可以在BLAS 1级，2级和3级操作中匹配原生cuBLAS和rocBLAS性能，同时在NVIDIA gpu上显著优于oneMKL与dpc++除了最大的问题规模之外的所有问题。总的来说，我们发现hipSYCL可以有效地支持底层构建块(如Level Zero)、oneAPI库(如oneMKL)和SYCL 2020编程模型，因此得出结论，确实有可能独立于dpc++实现oneAPI。

{"title":"Exploring the possibility of a hipSYCL-based implementation of oneAPI","authors":"Aksel Alpay, Bálint Soproni, Holger Wünsche, Vincent Heuveline","doi":"10.1145/3529538.3530005","DOIUrl":"https://doi.org/10.1145/3529538.3530005","url":null,"abstract":"oneAPI is an open standard for a software platform built around SYCL 2020 and accelerated libraries such as oneMKL as well as low-level building blocks such as oneAPI Level Zero. All oneAPI implementations currently are based on the DPC++ SYCL implementation. However, being able to utilize multiple independent SYCL implementations with oneAPI code can be beneficial to both users and implementors when it comes to testing code, or e.g. noticing ambiguities in the specification. In this work, we explore the possibility of implementing oneAPI using hipSYCL as an independent SYCL implementation instead. We review hipSYCL’s design and demonstrate it running on oneAPI Level Zero with competitive performance. We also discuss hipSYCL’s support for SYCL 2020 with the examples of unified shared memory (USM), group algorithms and optional kernel lambda naming. To this end, we also contribute microbenchmarks for the SYCL 2020 group algorithms and demonstrate their performance. When testing hipSYCL with HeCBench, a large benchmark suite containing SYCL benchmarks initially developed for DPC++, we point out specification ambiguities and practices that negatively impact code portability when transitioning from DPC++ to hipSYCL. We find that we can compile 122 benchmarks with little effort with hipSYCL, and demonstrate performance for a selection of benchmarks within 20% of native models on NVIDIA and AMD GPUs. Lastly, we demonstrate oneMKL’s BLAS domain running with hipSYCL on AMD and NVIDIA GPUs, and find that it can match native cuBLAS and rocBLAS performance for BLAS level 1, level 2 and level 3 operations, while significantly outperforming oneMKL with DPC++ on NVIDIA GPUs for all but the largest problem sizes. Overall, we find that hipSYCL can support low-level building blocks like Level Zero, oneAPI libraries like oneMKL, and the SYCL 2020 programming model efficiently, and hence conclude that it is indeed possible to implement oneAPI independently from DPC++.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91109251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Optimize AI pipelines with SYCL and OpenVINO 优化人工智能管道与SYCL和OpenVINO

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529561

Nico Galoppo

Sensor data processing pipelines that are a ”mix” of feature-engineered and deep learning based processing have become prevalent today. For example, sensor fusion of point cloud data with RGB image streams is common in autonomous mobile robots and self-driving technology. The state-of-the-art in computer vision for extracting semantic information from RGB data is using deep learning today, and great advancements have been made recently in LiDAR odometry based on deep learning [x]. At the same time, other processing components in ”mixed” pipelines still use feature-engineered approaches that are not relying on deep neural nets. Embedded compute platforms in robotics systems are inherently heterogeneous in nature, often with a variety of CPUs, (integrated) GPUs, VPUs, and so on. This means that there is a growing need to implement ”mixed” pipelines on heterogeneous platforms that include a variety of xPUs. We want such pipeline implementations to benefit from the latest advancements in data- and thread-parallel computation, as well as state-of-the-art in optimized inference of AI DNN models. SYCL and OpenVINO are two open, industry supported APIs that allow a developer to do so. It is not only important to optimize the individual components of the processing pipeline - it is at least as important to also optimize the data flow and minimize data copies. This provides a way to benefit from the efficiencies in inference runtime and compute graph optimizations provided by OpenVINO, in combination with the extensibility that SYCL brings in implementing custom or non-DNN components. Similarly, the use of compatible synchronization primitives allows the different runtimes to schedule work more efficiently on the hardware and avoid execution hiccups. In this talk, we will demonstrate the mechanisms and primitives provided by both SYCL and OpenVINO to optimize the dataflow between, and efficient execution of the workloads implemented in the respective APIs. We will provide an example and show the impact on the overall throughput and latency of the end-to-end processing pipeline. The audience will learn to recognize inefficiencies in their pipelines using profiling tools, and understand how to optimize those inefficiencies using an easy-to-follow optimization recipe. Finally, we will provide guidance to developers of inference engines other than OpenVINO on how to integrate similar interoperability features into their APIs, so that they too can offer optimized SYCL-enabled AI pipelines to their users.

传感器数据处理管道是基于特征工程和深度学习处理的“混合体”，如今已经变得非常普遍。例如，点云数据与RGB图像流的传感器融合在自主移动机器人和自动驾驶技术中很常见。从RGB数据中提取语义信息的计算机视觉领域的最新技术正在使用深度学习，最近基于深度学习的激光雷达里程计也取得了很大进展[x]。与此同时，“混合”管道中的其他处理组件仍然使用特征工程方法，而不依赖于深度神经网络。机器人系统中的嵌入式计算平台本质上是异构的，通常具有各种cpu、(集成)gpu、vpu等。这意味着在包含各种xpu的异构平台上实现“混合”管道的需求越来越大。我们希望这样的管道实现受益于数据和线程并行计算的最新进展，以及人工智能深度神经网络模型优化推理的最新进展。SYCL和OpenVINO是两个开放的、行业支持的api，允许开发人员这样做。不仅优化处理管道的各个组件很重要，优化数据流和最小化数据副本也同样重要。这提供了一种从OpenVINO提供的推理运行时和计算图优化的效率中获益的方法，并结合SYCL在实现自定义或非dnn组件时带来的可扩展性。类似地，使用兼容的同步原语允许不同的运行时更有效地在硬件上调度工作，并避免执行中断。在这次演讲中，我们将展示SYCL和OpenVINO提供的机制和原语，以优化各自api中实现的工作负载之间的数据流和有效执行。我们将提供一个示例，并展示对端到端处理管道的总体吞吐量和延迟的影响。听众将学会使用分析工具识别管道中的低效率，并了解如何使用易于遵循的优化配方来优化这些低效率。最后，我们将为OpenVINO以外的推理引擎开发人员提供指导，告诉他们如何将类似的互操作性特性集成到他们的api中，以便他们也可以为用户提供优化的支持sycl的AI管道。

{"title":"Optimize AI pipelines with SYCL and OpenVINO","authors":"Nico Galoppo","doi":"10.1145/3529538.3529561","DOIUrl":"https://doi.org/10.1145/3529538.3529561","url":null,"abstract":"Sensor data processing pipelines that are a ”mix” of feature-engineered and deep learning based processing have become prevalent today. For example, sensor fusion of point cloud data with RGB image streams is common in autonomous mobile robots and self-driving technology. The state-of-the-art in computer vision for extracting semantic information from RGB data is using deep learning today, and great advancements have been made recently in LiDAR odometry based on deep learning [x]. At the same time, other processing components in ”mixed” pipelines still use feature-engineered approaches that are not relying on deep neural nets. Embedded compute platforms in robotics systems are inherently heterogeneous in nature, often with a variety of CPUs, (integrated) GPUs, VPUs, and so on. This means that there is a growing need to implement ”mixed” pipelines on heterogeneous platforms that include a variety of xPUs. We want such pipeline implementations to benefit from the latest advancements in data- and thread-parallel computation, as well as state-of-the-art in optimized inference of AI DNN models. SYCL and OpenVINO are two open, industry supported APIs that allow a developer to do so. It is not only important to optimize the individual components of the processing pipeline - it is at least as important to also optimize the data flow and minimize data copies. This provides a way to benefit from the efficiencies in inference runtime and compute graph optimizations provided by OpenVINO, in combination with the extensibility that SYCL brings in implementing custom or non-DNN components. Similarly, the use of compatible synchronization primitives allows the different runtimes to schedule work more efficiently on the hardware and avoid execution hiccups. In this talk, we will demonstrate the mechanisms and primitives provided by both SYCL and OpenVINO to optimize the dataflow between, and efficient execution of the workloads implemented in the respective APIs. We will provide an example and show the impact on the overall throughput and latency of the end-to-end processing pipeline. The audience will learn to recognize inefficiencies in their pipelines using profiling tools, and understand how to optimize those inefficiencies using an easy-to-follow optimization recipe. Finally, we will provide guidance to developers of inference engines other than OpenVINO on how to integrate similar interoperability features into their APIs, so that they too can offer optimized SYCL-enabled AI pipelines to their users.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78305996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TAU Performance System TAU绩效系统

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529557

S. Shende

The TAU Performance System 1 is a versatile performance evaluation tool that supports OpenCL, DPC++/SYCL, OpenMP, and other GPU runtimes. It features a performance profiling and tracing module that is widely portable and can access hardware performance counter data at the GPU and CPU level. This talk will describe the usage and new features of TAU for performance evaluation of HPC and AI/ML workloads. TAU is integrated in the Extreme-Scale Scientific Software Stack (E4S) 2 and is available in containerized and cloud environments. The talk/tutorial will demonstrate the usage of TAU on uninstrumented applications.

TAU性能系统1是一个多功能的性能评估工具，支持OpenCL, dpc++ /SYCL, OpenMP和其他GPU运行时。它具有性能分析和跟踪模块，可广泛移植，可以访问GPU和CPU级别的硬件性能计数器数据。本讲座将介绍TAU在高性能计算和人工智能/机器学习工作负载性能评估中的用法和新特性。TAU集成在极端规模科学软件堆栈(E4S) 2中，可在容器化和云环境中使用。讲座/教程将演示TAU在非仪器化应用程序中的使用。

引用次数: 0

OpenCLML Integration with TVM OpenCLML与TVM的集成

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530003

Siva Rama Krishna Reddy, Hongqiang Wang, Alex Bourd, Adarsh Golikeri, Balaji Calidas

引用次数: 1

How to optimize Compute Drivers? Let’s start with writing good benchmarks! 如何优化计算驱动程序?让我们从编写好的基准开始吧!

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529569

Michał Mrozek

Writing efficient driver stack is the goal of every driver developer, but to see if your stack is performant, you need tools that will confirm this. You may try to run workloads and benchmarks and see how your driver perform, but this will only give you a summarized score, consisting of many pieces. To further optimize this, you need to take extensive steps in understanding the applications, figuring out what is the bottleneck and optimizing it, is quite time-consuming process involving a lot of effort. This created a need for driver team to write a tool, that would make performance work on the driver easier, so we created compute benchmarks. In this suite we test all aspects of driver stack to see if they do not have any bottlenecks. Each test checks only one thing and does this in isolation, so it is very easy to work on optimizing it and doesn’t require any extensive setup. Benchmarks focus on such subtle aspect of every driver as: API overhead of every call, submission latencies, resource creation costs, transfer bandwidths, multi-threaded contention, multi process execution and many others. Framework offers capabilities for multiple backends, currently we have OpenCL and Level Zero implementations in place, so it is very easy to compare how the same scenario is services with different drivers. It is also very easy to compare driver implementations between vendors, as tests written in OpenCL simply work across different GPU implementations. We also use this code to present good and bad coding practices, this is very useful to showcase how simple things can drastically improve performance and users can simply run those scenarios and see how performance changes on their own setups. It is also a great tool to prototype new extensions and further propose them as a part of OpenCL standard. We plan to Open Source this project in Q2 2022, it is expected to be already available during IWOCL.

编写高效的驱动程序堆栈是每个驱动程序开发人员的目标，但是要查看堆栈是否具有性能，您需要能够确认这一点的工具。您可以尝试运行工作负载和基准测试，看看驱动程序的性能如何，但这只能给您一个汇总分数，由许多部分组成。为了进一步优化这一点，您需要采取广泛的步骤来理解应用程序，找出瓶颈并对其进行优化，这是一个非常耗时的过程，涉及大量的工作。这就需要驱动程序团队编写一个工具，这将使驱动程序的性能工作更容易，所以我们创建了计算基准。在这个套件中，我们测试了驱动程序堆栈的各个方面，看看它们是否没有任何瓶颈。每个测试只检查一件事，并且是独立完成的，因此很容易对其进行优化，不需要任何广泛的设置。基准测试关注每个驱动程序的微妙方面，如:每个调用的API开销、提交延迟、资源创建成本、传输带宽、多线程争用、多进程执行等等。框架为多个后端提供了功能，目前我们有OpenCL和Level Zero实现，所以很容易比较不同驱动程序下相同场景的服务。比较不同厂商的驱动实现也很容易，因为用OpenCL编写的测试可以跨不同的GPU实现工作。我们还使用这些代码来展示好的和坏的编码实践，这对展示简单的事情如何极大地提高性能非常有用，用户可以简单地运行这些场景，看看性能在他们自己的设置上是如何变化的。它也是一个很好的工具，可以为新的扩展创建原型，并进一步将它们作为OpenCL标准的一部分提出。我们计划在2022年第二季度开源这个项目，预计在IWOCL期间已经可用。

{"title":"How to optimize Compute Drivers? Let’s start with writing good benchmarks!","authors":"Michał Mrozek","doi":"10.1145/3529538.3529569","DOIUrl":"https://doi.org/10.1145/3529538.3529569","url":null,"abstract":"Writing efficient driver stack is the goal of every driver developer, but to see if your stack is performant, you need tools that will confirm this. You may try to run workloads and benchmarks and see how your driver perform, but this will only give you a summarized score, consisting of many pieces. To further optimize this, you need to take extensive steps in understanding the applications, figuring out what is the bottleneck and optimizing it, is quite time-consuming process involving a lot of effort. This created a need for driver team to write a tool, that would make performance work on the driver easier, so we created compute benchmarks. In this suite we test all aspects of driver stack to see if they do not have any bottlenecks. Each test checks only one thing and does this in isolation, so it is very easy to work on optimizing it and doesn’t require any extensive setup. Benchmarks focus on such subtle aspect of every driver as: API overhead of every call, submission latencies, resource creation costs, transfer bandwidths, multi-threaded contention, multi process execution and many others. Framework offers capabilities for multiple backends, currently we have OpenCL and Level Zero implementations in place, so it is very easy to compare how the same scenario is services with different drivers. It is also very easy to compare driver implementations between vendors, as tests written in OpenCL simply work across different GPU implementations. We also use this code to present good and bad coding practices, this is very useful to showcase how simple things can drastically improve performance and users can simply run those scenarios and see how performance changes on their own setups. It is also a great tool to prototype new extensions and further propose them as a part of OpenCL standard. We plan to Open Source this project in Q2 2022, it is expected to be already available during IWOCL.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88400444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improved address space inference for SYCL programs 改进了SYCL程序的地址空间推断

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529998

Ross Brunton, V. Lomüller

SYCL[4, 6] is a single source C++ based programming model for heterogeneous programming. It enables the programmer to write or port code targeting heterogeneous accelerators using what appears to the programmer as standard C++. To achieve peak performance, however, it can be necessary to write the code in a form which allows the compiler to target specific hardware features. If the compiler can target these hardware features without requiring the programmer to consider them, then productivity and application performance can both be improved. One such example is accelerators with multiple address spaces, this technical talk will describe how a SYCL compiler can infer these address spaces without requiring the programmer to specify them in their application as well as describe some required specification evolution in order to better cope with the new SYCL 2020 features. Hardware devices can have multiple memory regions with different levels of visibility and performance. Similar to OpenCL C[5], SYCL abstracts them into a global memory visible to all work-items, a local memory visible to a single work-group, or a private memory only visible to a single work-item. In OpenCL C, the programmer expresses address spaces using type qualifiers in order to statically encode the memory region addressed by pointers thus ensuring that when a programmer does specify an address space the compiler can check whether the program is well-formed. But requiring programs to be written with explicit address spaces comes at the expense of usability, as these need to be integrated into the program design and are a barrier to integrate code not written with this in mind. Thus in OpenCL C 2.x/3 programmers can make use of the unnamed generic address space instead. On the other hand, SYCL does not extend the C++ language therefore programmers cannot express address spaces using a type qualifier (as the C++ standard does not define them). Thus in SYCL pointers and references can be lowered to this unnamed generic address space by the device compiler. This generic address space is a virtual address space that can represent several overlapping address spaces at the same time. The memory being addressed is no longer statically known by the compiler frontend and the SYCL implementation relies on the hardware, or software emulation, to correctly dispatch the loads and stores to the correct memory. On some hardware targets this flexibility comes with a performance cost, but this can be avoided when the compiler can infer a single address space for a given memory access. Additionally, the low-level compute APIs that are often used as backends to a SYCL 2020 implementation do not guarantee support for a generic address space, e.g. they are an optional feature in OpenCL 3.0 and non-existent in Vulkan. This means that a SYCL compiler that can infer all address spaces for a large set of programs can achieve better performance and target a wider range of backend compute APIs. Moreover, r

SYCL[4,6]是用于异构编程的基于c++的单源编程模型。它使程序员能够使用标准c++编写或移植针对异构加速器的代码。然而，为了达到最佳性能，可能有必要以一种允许编译器针对特定硬件特性的形式编写代码。如果编译器可以针对这些硬件特性而不需要程序员考虑它们，那么生产力和应用程序性能都可以得到提高。一个这样的例子是具有多个地址空间的加速器，本技术讲座将描述SYCL编译器如何推断这些地址空间，而不需要程序员在其应用程序中指定它们，并描述一些必要的规范演变，以便更好地应对新的SYCL 2020功能。硬件设备可以具有多个具有不同可见性和性能级别的内存区域。与OpenCL C[5]类似，SYCL将它们抽象为对所有工作项可见的全局内存，对单个工作组可见的局部内存，或者仅对单个工作项可见的私有内存。在OpenCL C中，程序员使用类型限定符来表示地址空间，以便对指针寻址的内存区域进行静态编码，从而确保当程序员指定地址空间时，编译器可以检查程序是否格式良好。但是要求使用显式地址空间编写程序是以牺牲可用性为代价的，因为这些需要集成到程序设计中，并且是集成没有考虑到这一点编写的代码的障碍。因此在openclc2中。X /3程序员可以使用未命名的通用地址空间。另一方面，SYCL不扩展c++语言，因此程序员不能使用类型限定符来表示地址空间(因为c++标准没有定义它们)。因此，在SYCL中，指针和引用可以被设备编译器降低到这个未命名的通用地址空间。这个通用地址空间是一个虚拟地址空间，它可以同时表示多个重叠的地址空间。编译器前端不再静态地知道正在寻址的内存，SYCL实现依赖于硬件或软件仿真来正确地将负载和存储分配到正确的内存。在某些硬件目标上，这种灵活性带来了性能成本，但是当编译器可以为给定的内存访问推断单个地址空间时，可以避免这种情况。此外，通常用作SYCL 2020实现后端的低级计算api并不能保证对通用地址空间的支持，例如，它们在OpenCL 3.0中是可选的功能，在Vulkan中不存在。这意味着能够推断大量程序的所有地址空间的SYCL编译器可以获得更好的性能，并针对更广泛的后端计算api。此外，最近为SYCL带来安全关键开发的努力意味着它还需要在Vulkan SC上运行，这使得有一个定义良好的规范来推断地址空间的能力仍然与SYCL相关。SYCL 1.2.1引入的规则对用户代码施加了重要的限制。一个引人注目的例子是“默认规则”:当指针声明没有初始化式时，即使在下一条语句中初始化，也假定指针指向私有内存。因此，在结构体中声明指针时，必须将其默认为私有地址空间。然而，在实践中，这些限制在1.2.1的环境中并不是一个重要的障碍，大型应用程序被移植到SYCL上运行，比如Eigen[3]，或者构建新的SYCL- blas[1]或SYCL- dnn[2]。SYCL 2020带来了重大变化，并为用户增加了灵活性。其中包括未命名的通用地址空间和统一共享内存(USM)指针。泛型地址空间允许解除1.2.1所述的限制，使得为2020年和泛型编写的程序不太可能在推理规则限制下编译。USM鼓励使用原始指针而不是访问器容器，因为这很快意味着通过结构体传递这些指针。由于USM指针实际上是在寻址全局内存区域，因此这会与推理规则产生冲突。本演讲将描述一个用于ComputeCpp的实验性编译器，Codeplay的SYCL实现。该编译器采用了一种改进的地址空间推理方法，可以有效地应对SYCL 2020的通用地址空间和统一共享内存(USM)指针等特性。谈话也涵盖了这种方法的局限性。

{"title":"Improved address space inference for SYCL programs","authors":"Ross Brunton, V. Lomüller","doi":"10.1145/3529538.3529998","DOIUrl":"https://doi.org/10.1145/3529538.3529998","url":null,"abstract":"SYCL[4, 6] is a single source C++ based programming model for heterogeneous programming. It enables the programmer to write or port code targeting heterogeneous accelerators using what appears to the programmer as standard C++. To achieve peak performance, however, it can be necessary to write the code in a form which allows the compiler to target specific hardware features. If the compiler can target these hardware features without requiring the programmer to consider them, then productivity and application performance can both be improved. One such example is accelerators with multiple address spaces, this technical talk will describe how a SYCL compiler can infer these address spaces without requiring the programmer to specify them in their application as well as describe some required specification evolution in order to better cope with the new SYCL 2020 features. Hardware devices can have multiple memory regions with different levels of visibility and performance. Similar to OpenCL C[5], SYCL abstracts them into a global memory visible to all work-items, a local memory visible to a single work-group, or a private memory only visible to a single work-item. In OpenCL C, the programmer expresses address spaces using type qualifiers in order to statically encode the memory region addressed by pointers thus ensuring that when a programmer does specify an address space the compiler can check whether the program is well-formed. But requiring programs to be written with explicit address spaces comes at the expense of usability, as these need to be integrated into the program design and are a barrier to integrate code not written with this in mind. Thus in OpenCL C 2.x/3 programmers can make use of the unnamed generic address space instead. On the other hand, SYCL does not extend the C++ language therefore programmers cannot express address spaces using a type qualifier (as the C++ standard does not define them). Thus in SYCL pointers and references can be lowered to this unnamed generic address space by the device compiler. This generic address space is a virtual address space that can represent several overlapping address spaces at the same time. The memory being addressed is no longer statically known by the compiler frontend and the SYCL implementation relies on the hardware, or software emulation, to correctly dispatch the loads and stores to the correct memory. On some hardware targets this flexibility comes with a performance cost, but this can be avoided when the compiler can infer a single address space for a given memory access. Additionally, the low-level compute APIs that are often used as backends to a SYCL 2020 implementation do not guarantee support for a generic address space, e.g. they are an optional feature in OpenCL 3.0 and non-existent in Vulkan. This means that a SYCL compiler that can infer all address spaces for a large set of programs can achieve better performance and target a wider range of backend compute APIs. Moreover, r","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77364467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SYCL Concurrency on GPU Platforms: Empirical Measurement GPU平台上的SYCL并发性:经验测量

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529989

T. Applencourt, Abhishek Bagusetty, Ajay Panyala, Aksel Alpay

引用次数: 0

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL 基于SYCL的结构网格显式和隐式数值求解的FPGA加速

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530007

Kamalavasan Kamalakkannan, G. Mudalige, I. Reguly, Suhaib A. Fahmy

We explore the design and development of structured-mesh-based solvers on Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multi-dimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core modules in a variety of real-world applications ranging from computational fluid dynamics to financial computing. A general, unified workflow is formulated for synthesizing these applications on Intel FPGAs together with predictive analytic models to explore the design space to obtain optimized performance. Performance of synthesized designs, using the above techniques, for two non-trivial applications on an Intel PAC D5005 FPGA card is benchmarked. Results are compared to the performance of optimized parallel implementations of the same applications on a Nvidia V100 GPU. Observed runtime results indicate the FPGA providing comparable or improved performance to the V100 GPU. However, more importantly the FPGA solutions consume 59%–76% less energy for their largest configurations. Our performance model predicts the runtime of designs with high accuracy with less than 5% error for all cases tested, demonstrating significant utility for design space exploration. With these tools and techniques, we discuss determinants for a given structured-mesh code to be amenable to FPGA implementation, providing insights into the feasibility and profitability FPGA implementation, how to code designs using SYCL, and the resulting performance.

我们探索使用SYCL编程模型在英特尔FPGA硬件上设计和开发基于结构化网格的求解器。针对两类应用:(1)基于显式数值方法的模板应用和(2)基于隐式方法的多维三对角线求解。这两类求解器在从计算流体动力学到金融计算的各种实际应用中都作为核心模块出现。为在英特尔fpga上综合这些应用程序制定了一个通用的、统一的工作流程，并结合预测分析模型来探索设计空间以获得最佳性能。在英特尔PAC D5005 FPGA卡上，使用上述技术对两个重要应用程序的综合设计性能进行了基准测试。结果与Nvidia V100 GPU上相同应用程序的优化并行实现的性能进行了比较。观察到的运行时结果表明，FPGA提供了与V100 GPU相当或更高的性能。然而，更重要的是，FPGA解决方案在其最大配置中消耗的能量减少了59%-76%。我们的性能模型以高精度预测设计的运行时，在所有测试情况下误差小于5%，证明了设计空间探索的重要实用性。通过这些工具和技术，我们讨论了给定结构化网格代码适用于FPGA实现的决定因素，提供了对FPGA实现的可行性和盈利能力的见解，如何使用SYCL编码设计，以及由此产生的性能。

{"title":"FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL","authors":"Kamalavasan Kamalakkannan, G. Mudalige, I. Reguly, Suhaib A. Fahmy","doi":"10.1145/3529538.3530007","DOIUrl":"https://doi.org/10.1145/3529538.3530007","url":null,"abstract":"We explore the design and development of structured-mesh-based solvers on Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multi-dimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core modules in a variety of real-world applications ranging from computational fluid dynamics to financial computing. A general, unified workflow is formulated for synthesizing these applications on Intel FPGAs together with predictive analytic models to explore the design space to obtain optimized performance. Performance of synthesized designs, using the above techniques, for two non-trivial applications on an Intel PAC D5005 FPGA card is benchmarked. Results are compared to the performance of optimized parallel implementations of the same applications on a Nvidia V100 GPU. Observed runtime results indicate the FPGA providing comparable or improved performance to the V100 GPU. However, more importantly the FPGA solutions consume 59%–76% less energy for their largest configurations. Our performance model predicts the runtime of designs with high accuracy with less than 5% error for all cases tested, demonstrating significant utility for design space exploration. With these tools and techniques, we discuss determinants for a given structured-mesh code to be amenable to FPGA implementation, providing insights into the feasibility and profitability FPGA implementation, how to code designs using SYCL, and the resulting performance.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89540801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

How much SYCL does a compiler need? Experiences from the implementation of SYCL as a library for nvc++ 编译器需要多少SYCL ?SYCL作为nvc++库的实现经验

International Workshop on OpenCL

Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529556

Aksel Alpay, V. Heuveline

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

International Workshop on OpenCL

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀