在hipSYCL中，编译器辅助的和范围并行的CPU实现

International Workshop on OpenCL Pub Date : 2022-05-10 DOI:10.1145/3529538.3530216

Joachim Meyer, Aksel Alpay, H. Fröning, V. Heuveline

{"title":"在hipSYCL中，编译器辅助的和范围并行的CPU实现","authors":"Joachim Meyer, Aksel Alpay, H. Fröning, V. Heuveline","doi":"10.1145/3529538.3530216","DOIUrl":null,"url":null,"abstract":"With heterogeneous programming continuously on the rise, performance portability is still to be improved. SYCL provides the nd-range parallel-for paradigm for writing data-parallel kernels. This model allows barriers for group-local synchronization, similar to CUDA and OpenCL kernels. GPUs provide efficient means to model this, but on CPUs the necessary forward-progress guarantees require the use of many (lightweight) threads in library-only SYCL implementations, rendering the nd-range parallel-for unacceptably inefficient. By adopting two compiler-based approaches solving this, the present work improves the performance of the nd-range parallel-for in hipSYCL for CPUs by up to multiple orders of magnitude on various CPU architectures. The two alternatives are compared with regard to their functional correctness and performance. By upstreaming one of the variants, hipSYCL is the first SYCL implementation to provide a well performing nd-range parallel-for on CPU, without requiring an available OpenCL runtime.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"54 79 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Compiler-aided nd-range parallel-for implementations on CPU in hipSYCL\",\"authors\":\"Joachim Meyer, Aksel Alpay, H. Fröning, V. Heuveline\",\"doi\":\"10.1145/3529538.3530216\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With heterogeneous programming continuously on the rise, performance portability is still to be improved. SYCL provides the nd-range parallel-for paradigm for writing data-parallel kernels. This model allows barriers for group-local synchronization, similar to CUDA and OpenCL kernels. GPUs provide efficient means to model this, but on CPUs the necessary forward-progress guarantees require the use of many (lightweight) threads in library-only SYCL implementations, rendering the nd-range parallel-for unacceptably inefficient. By adopting two compiler-based approaches solving this, the present work improves the performance of the nd-range parallel-for in hipSYCL for CPUs by up to multiple orders of magnitude on various CPU architectures. The two alternatives are compared with regard to their functional correctness and performance. By upstreaming one of the variants, hipSYCL is the first SYCL implementation to provide a well performing nd-range parallel-for on CPU, without requiring an available OpenCL runtime.\",\"PeriodicalId\":73497,\"journal\":{\"name\":\"International Workshop on OpenCL\",\"volume\":\"54 79 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3529538.3530216\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529538.3530216","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

随着异构编程的不断兴起，性能可移植性仍有待改进。SYCL为编写数据并行内核提供了极差的parallel-for范型。这个模型允许组本地同步的障碍，类似于CUDA和OpenCL内核。gpu提供了有效的建模方法，但是在cpu上，必要的前向进度保证需要在仅库的SYCL实现中使用许多(轻量级)线程，从而使范围内的并行效率低得令人无法接受。通过采用两种基于编译器的方法来解决这个问题，本研究在不同的CPU体系结构上将用于CPU的hipSYCL的近程并行性能提高了多个数量级。比较了这两种备选方案的功能正确性和性能。通过对其中一个变体进行上传到，hipSYCL是第一个在CPU上提供性能良好的远程并行处理的SYCL实现，而不需要可用的OpenCL运行时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Compiler-aided nd-range parallel-for implementations on CPU in hipSYCL

With heterogeneous programming continuously on the rise, performance portability is still to be improved. SYCL provides the nd-range parallel-for paradigm for writing data-parallel kernels. This model allows barriers for group-local synchronization, similar to CUDA and OpenCL kernels. GPUs provide efficient means to model this, but on CPUs the necessary forward-progress guarantees require the use of many (lightweight) threads in library-only SYCL implementations, rendering the nd-range parallel-for unacceptably inefficient. By adopting two compiler-based approaches solving this, the present work improves the performance of the nd-range parallel-for in hipSYCL for CPUs by up to multiple orders of magnitude on various CPU architectures. The two alternatives are compared with regard to their functional correctness and performance. By upstreaming one of the variants, hipSYCL is the first SYCL implementation to provide a well performing nd-range parallel-for on CPU, without requiring an available OpenCL runtime.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量

期刊最新文献

Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL