矩阵乘法的BLIS算法族剖析

Adrián Castelló, E. S. Quintana‐Ortí, Francisco D. Igual
{"title":"矩阵乘法的BLIS算法族剖析","authors":"Adrián Castelló, E. S. Quintana‐Ortí, Francisco D. Igual","doi":"10.1109/pdp55904.2022.00023","DOIUrl":null,"url":null,"abstract":"The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “squarish” matrices. New computationally-intensive applications (e.g., in deep learning) are increasingly demanding high-performance BLAS (Basic Linear Algebra Subprograms) also for small operands in any of their dimensions. In this paper, we tackle this problem by refactoring the general matrix-matrix multiplication (GEMM) algorithm within a specific high-performance implementation of BLAS, named BLIS, proposing a complete family of algorithmic variants to implement GEMM with different strategies to exploit the target cache hierarchy, together with the changes to be applied to architecture-specific codes to instantiate a complete GEMM implementation. Experimental results on an ARM processor (NVIDIA Carmel) reveal significant performance differences between the members of the GEMM family, depending on the shape and dimension of the matrix operands.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Anatomy of the BLIS Family of Algorithms for Matrix Multiplication\",\"authors\":\"Adrián Castelló, E. S. Quintana‐Ortí, Francisco D. Igual\",\"doi\":\"10.1109/pdp55904.2022.00023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “squarish” matrices. New computationally-intensive applications (e.g., in deep learning) are increasingly demanding high-performance BLAS (Basic Linear Algebra Subprograms) also for small operands in any of their dimensions. In this paper, we tackle this problem by refactoring the general matrix-matrix multiplication (GEMM) algorithm within a specific high-performance implementation of BLAS, named BLIS, proposing a complete family of algorithmic variants to implement GEMM with different strategies to exploit the target cache hierarchy, together with the changes to be applied to architecture-specific codes to instantiate a complete GEMM implementation. Experimental results on an ARM processor (NVIDIA Carmel) reveal significant performance differences between the members of the GEMM family, depending on the shape and dimension of the matrix operands.\",\"PeriodicalId\":210759,\"journal\":{\"name\":\"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/pdp55904.2022.00023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/pdp55904.2022.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

科学界和硬件供应商在开发和优化线性代数代码方面的努力已经产生了高度调优的库,这些库仔细地适应底层处理器体系结构,具有出色的(接近峰值的)性能。然而,这些优化工作通常侧重于在涉及的操作数很大且“平方”矩阵时获得最佳性能。新的计算密集型应用(例如,在深度学习中)越来越需要高性能的BLAS(基本线性代数子程序),也适用于任何维度的小操作数。在本文中,我们通过重构通用矩阵-矩阵乘法(GEMM)算法来解决这个问题,该算法在一个特定的高性能BLAS实现中被称为BLIS,提出了一个完整的算法变体家族,通过不同的策略来实现GEMM,以利用目标缓存层次结构,以及应用于特定架构代码的更改来实例化一个完整的GEMM实现。在ARM处理器(NVIDIA Carmel)上的实验结果揭示了GEMM家族成员之间的显著性能差异,这取决于矩阵操作数的形状和维度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Anatomy of the BLIS Family of Algorithms for Matrix Multiplication
The efforts of the scientific community and hardware vendors to develop and optimize linear algebra codes have historically led to highly-tuned libraries, carefully adapted to the underlying processor architecture, with excellent (near-peak) performance. These optimization efforts, however, are commonly focused on obtaining the best performance possible when the involved operands are large and “squarish” matrices. New computationally-intensive applications (e.g., in deep learning) are increasingly demanding high-performance BLAS (Basic Linear Algebra Subprograms) also for small operands in any of their dimensions. In this paper, we tackle this problem by refactoring the general matrix-matrix multiplication (GEMM) algorithm within a specific high-performance implementation of BLAS, named BLIS, proposing a complete family of algorithmic variants to implement GEMM with different strategies to exploit the target cache hierarchy, together with the changes to be applied to architecture-specific codes to instantiate a complete GEMM implementation. Experimental results on an ARM processor (NVIDIA Carmel) reveal significant performance differences between the members of the GEMM family, depending on the shape and dimension of the matrix operands.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Some Experiments on High Performance Anomaly Detection Advancing Database System Operators with Near-Data Processing A Parallel Approximation Algorithm for the Steiner Forest Problem NoaSci: A Numerical Object Array Library for I/O of Scientific Applications on Object Storage Load Balancing of the Parallel Execution of Two Dimensional Partitioned Cellular Automata
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1