{"title":"基于SYCL的高参数化TRSM算法性能可移植性研究","authors":"T. Sabino, M. Goli","doi":"10.1145/3456669.3456694","DOIUrl":null,"url":null,"abstract":"Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators. While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world. Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"77 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL\",\"authors\":\"T. Sabino, M. Goli\",\"doi\":\"10.1145/3456669.3456694\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators. While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world. Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.\",\"PeriodicalId\":73497,\"journal\":{\"name\":\"International Workshop on OpenCL\",\"volume\":\"77 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3456669.3456694\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3456669.3456694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL
Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators. While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU, BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With a wide range of hardware available, each with a new kind of memory hierarchy, different cache line sizes, and various memory access patterns required for performance, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world. Written in SYCL, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms. This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a formulation that leverages a highly optimized GEMM routine already provided in SYCL-BLAS. Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.