基于SYCL的无矩阵共轭梯度核性能分析

International Workshop on OpenCL Pub Date : 2022-05-10 DOI:10.1145/3529538.3529993

I. Baratta, C. Richardson, G. N. Wells

{"title":"基于SYCL的无矩阵共轭梯度核性能分析","authors":"I. Baratta, C. Richardson, G. N. Wells","doi":"10.1145/3529538.3529993","DOIUrl":null,"url":null,"abstract":"We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations on the CPU node. All considered kernels in the matrix-free implementation are memory bandwidth limited, and a simple performance model is applied to estimate the asymptotic memory bandwidth and the latency. Our experiments show that in most cases the considered SYCL implementations match the asymptotic performance of the reference implementations. However, for smaller but practically relevant problem sizes latency is observed to have a significant impact on performance. For some cases the SYCL latency is reasonably close to the reference (cuBLAS/MKL) implementation latency, but in other cases it is more than one order of magnitude greater. In particular, SYCL built-in reductions on the GPU and all operations for one of the SYCL implementations on the CPU exhibit high latency, and this latency limits performance at problem sizes that can in cases be representative of full application simulations, and can degrade strong scaling performance.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance analysis of matrix-free conjugate gradient kernels using SYCL\",\"authors\":\"I. Baratta, C. Richardson, G. N. Wells\",\"doi\":\"10.1145/3529538.3529993\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations on the CPU node. All considered kernels in the matrix-free implementation are memory bandwidth limited, and a simple performance model is applied to estimate the asymptotic memory bandwidth and the latency. Our experiments show that in most cases the considered SYCL implementations match the asymptotic performance of the reference implementations. However, for smaller but practically relevant problem sizes latency is observed to have a significant impact on performance. For some cases the SYCL latency is reasonably close to the reference (cuBLAS/MKL) implementation latency, but in other cases it is more than one order of magnitude greater. In particular, SYCL built-in reductions on the GPU and all operations for one of the SYCL implementations on the CPU exhibit high latency, and this latency limits performance at problem sizes that can in cases be representative of full application simulations, and can degrade strong scaling performance.\",\"PeriodicalId\":73497,\"journal\":{\"name\":\"International Workshop on OpenCL\",\"volume\":\"46 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3529538.3529993\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529538.3529993","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们研究了求解稀疏线性方程组的共轭梯度法的无矩阵SYCL实现的性能。在NVIDIA A100- 80gb设备和双插槽Intel冰湖CPU节点上使用不同的SYCL实现进行性能测试，并与A100 GPU上的CUDA BLAS (cuBLAS)实现和CPU节点上的MKL实现进行比较。在无矩阵实现中，所有考虑的核都是有限的内存带宽，并应用一个简单的性能模型来估计渐近内存带宽和延迟。我们的实验表明，在大多数情况下，考虑的SYCL实现与参考实现的渐近性能相匹配。但是，对于较小但实际相关的问题大小，可以观察到延迟对性能有重大影响。在某些情况下，SYCL延迟与参考(cuBLAS/MKL)实现延迟相当接近，但在其他情况下，它比参考(cuBLAS/MKL)实现延迟大一个数量级以上。特别是，GPU上的SYCL内置缩减和CPU上SYCL实现之一的所有操作都表现出高延迟，这种延迟限制了问题规模的性能，在某些情况下，这些问题规模可以代表完整的应用程序模拟，并且可能降低强大的扩展性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Performance analysis of matrix-free conjugate gradient kernels using SYCL

We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations on the CPU node. All considered kernels in the matrix-free implementation are memory bandwidth limited, and a simple performance model is applied to estimate the asymptotic memory bandwidth and the latency. Our experiments show that in most cases the considered SYCL implementations match the asymptotic performance of the reference implementations. However, for smaller but practically relevant problem sizes latency is observed to have a significant impact on performance. For some cases the SYCL latency is reasonably close to the reference (cuBLAS/MKL) implementation latency, but in other cases it is more than one order of magnitude greater. In particular, SYCL built-in reductions on the GPU and all operations for one of the SYCL implementations on the CPU exhibit high latency, and this latency limits performance at problem sizes that can in cases be representative of full application simulations, and can degrade strong scaling performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量

期刊最新文献

Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL