矩阵乘法的BLIS算法族剖析

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) Pub Date : 2022-03-01 DOI:10.1109/pdp55904.2022.00023

Adrián Castelló, E. S. Quintana‐Ortí, Francisco D. Igual

{"title":"矩阵乘法的BLIS算法族剖析","authors":"Adrián Castelló, E. S. Quintana‐Ortí, Francisco D. Igual","doi":"10.1109/pdp55904.2022.00023","DOIUrl":null,"url":null,"abstract":"The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “squarish” matrices. New computationally-intensive applications (e.g., in deep learning) are increasingly demanding high-performance BLAS (Basic Linear Algebra Subprograms) also for small operands in any of their dimensions. In this paper, we tackle this problem by refactoring the general matrix-matrix multiplication (GEMM) algorithm within a specific high-performance implementation of BLAS, named BLIS, proposing a complete family of algorithmic variants to implement GEMM with different strategies to exploit the target cache hierarchy, together with the changes to be applied to architecture-specific codes to instantiate a complete GEMM implementation. Experimental results on an ARM processor (NVIDIA Carmel) reveal significant performance differences between the members of the GEMM family, depending on the shape and dimension of the matrix operands.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"191 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Anatomy of the BLIS Family of Algorithms for Matrix Multiplication\",\"authors\":\"Adrián Castelló, E. S. Quintana‐Ortí, Francisco D. Igual\",\"doi\":\"10.1109/pdp55904.2022.00023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “squarish” matrices. New computationally-intensive applications (e.g., in deep learning) are increasingly demanding high-performance BLAS (Basic Linear Algebra Subprograms) also for small operands in any of their dimensions. In this paper, we tackle this problem by refactoring the general matrix-matrix multiplication (GEMM) algorithm within a specific high-performance implementation of BLAS, named BLIS, proposing a complete family of algorithmic variants to implement GEMM with different strategies to exploit the target cache hierarchy, together with the changes to be applied to architecture-specific codes to instantiate a complete GEMM implementation. Experimental results on an ARM processor (NVIDIA Carmel) reveal significant performance differences between the members of the GEMM family, depending on the shape and dimension of the matrix operands.\",\"PeriodicalId\":210759,\"journal\":{\"name\":\"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"volume\":\"191 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/pdp55904.2022.00023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/pdp55904.2022.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

科学界和硬件供应商在开发和优化线性代数代码方面的努力已经产生了高度调优的库，这些库仔细地适应底层处理器体系结构，具有出色的(接近峰值的)性能。然而，这些优化工作通常侧重于在涉及的操作数很大且“平方”矩阵时获得最佳性能。新的计算密集型应用(例如，在深度学习中)越来越需要高性能的BLAS(基本线性代数子程序)，也适用于任何维度的小操作数。在本文中，我们通过重构通用矩阵-矩阵乘法(GEMM)算法来解决这个问题，该算法在一个特定的高性能BLAS实现中被称为BLIS，提出了一个完整的算法变体家族，通过不同的策略来实现GEMM，以利用目标缓存层次结构，以及应用于特定架构代码的更改来实例化一个完整的GEMM实现。在ARM处理器(NVIDIA Carmel)上的实验结果揭示了GEMM家族成员之间的显著性能差异，这取决于矩阵操作数的形状和维度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Anatomy of the BLIS Family of Algorithms for Matrix Multiplication

The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “squarish” matrices. New computationally-intensive applications (e.g., in deep learning) are increasingly demanding high-performance BLAS (Basic Linear Algebra Subprograms) also for small operands in any of their dimensions. In this paper, we tackle this problem by refactoring the general matrix-matrix multiplication (GEMM) algorithm within a specific high-performance implementation of BLAS, named BLIS, proposing a complete family of algorithmic variants to implement GEMM with different strategies to exploit the target cache hierarchy, together with the changes to be applied to architecture-specific codes to instantiate a complete GEMM implementation. Experimental results on an ARM processor (NVIDIA Carmel) reveal significant performance differences between the members of the GEMM family, depending on the shape and dimension of the matrix operands.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

自引率

0.00%

发文量

期刊最新文献

Some Experiments on High Performance Anomaly Detection Advancing Database System Operators with Near-Data Processing A Parallel Approximation Algorithm for the Steiner Forest Problem NoaSci: A Numerical Object Array Library for I/O of Scientific Applications on Object Storage Load Balancing of the Parallel Execution of Two Dimensional Partitioned Cellular Automata