Adrián Castelló, E. S. Quintana‐Ortí, Francisco D. Igual
{"title":"Anatomy of the BLIS Family of Algorithms for Matrix Multiplication","authors":"Adrián Castelló, E. S. Quintana‐Ortí, Francisco D. Igual","doi":"10.1109/pdp55904.2022.00023","DOIUrl":null,"url":null,"abstract":"The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “squarish” matrices. New computationally-intensive applications (e.g., in deep learning) are increasingly demanding high-performance BLAS (Basic Linear Algebra Subprograms) also for small operands in any of their dimensions. In this paper, we tackle this problem by refactoring the general matrix-matrix multiplication (GEMM) algorithm within a specific high-performance implementation of BLAS, named BLIS, proposing a complete family of algorithmic variants to implement GEMM with different strategies to exploit the target cache hierarchy, together with the changes to be applied to architecture-specific codes to instantiate a complete GEMM implementation. Experimental results on an ARM processor (NVIDIA Carmel) reveal significant performance differences between the members of the GEMM family, depending on the shape and dimension of the matrix operands.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"191 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/pdp55904.2022.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “squarish” matrices. New computationally-intensive applications (e.g., in deep learning) are increasingly demanding high-performance BLAS (Basic Linear Algebra Subprograms) also for small operands in any of their dimensions. In this paper, we tackle this problem by refactoring the general matrix-matrix multiplication (GEMM) algorithm within a specific high-performance implementation of BLAS, named BLIS, proposing a complete family of algorithmic variants to implement GEMM with different strategies to exploit the target cache hierarchy, together with the changes to be applied to architecture-specific codes to instantiate a complete GEMM implementation. Experimental results on an ARM processor (NVIDIA Carmel) reveal significant performance differences between the members of the GEMM family, depending on the shape and dimension of the matrix operands.