Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs

A. Abdelfattah, S. Tomov, J. Dongarra
{"title":"Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs","authors":"A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1109/IPDPS.2019.00022","DOIUrl":null,"url":null,"abstract":"Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28

Abstract

Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于gpu半精度算法的小尺寸快速批处理矩阵乘法
矩阵乘法(GEMM)是密集线性代数中最重要的运算。由于GEMM是一种计算约束的操作,具有丰富的数据重用性,因此来自不同科学领域的许多应用程序都将其性能最关键的阶段用于使用GEMM。随着批处理线性代数的兴起,批处理GEMM操作在密集线性求解器以外的领域变得越来越流行,比如张量收缩、稀疏直接求解器和机器学习。特别是对于后者,降低精度的批处理GEMM(即FP16)已经成为许多深度学习框架的核心操作。本文介绍了一种基于图形处理单元(gpu)的FP16算法(HGEMM)的优化批处理GEMM。我们提供了一个详细的设计策略,利用了最近在支持cuda的gpu中引入的Tensor Core技术。开发的解决方案在优化设计中使用供应商提供的低级api,克服了硬件施加的限制(以离散配置的形式)。其结果是一个高度灵活的GPU内核,为开发人员提供了大量的控制,尽管前面提到的限制。本文还特别注意了不能完全占据张量核心单元的非常小的矩阵的乘法。我们的研究结果表明,在使用Tesla V100 GPU的情况下,所提出的设计可以比高度优化的供应商程序的性能高出1.2倍到10倍。对于非常小的矩阵,观察到的加速范围在1.8到26x之间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Distributed Weighted All Pairs Shortest Paths Through Pipelining SAFIRE: Scalable and Accurate Fault Injection for Parallel Multithreaded Applications Architecting Racetrack Memory Preshift through Pattern-Based Prediction Mechanisms Z-Dedup:A Case for Deduplicating Compressed Contents in Cloud Dual Pattern Compression Using Data-Preprocessing for Large-Scale GPU Architectures
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1