Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI:10.1109/IPDPS.2019.00022

A. Abdelfattah, S. Tomov, J. Dongarra

{"title":"Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs","authors":"A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1109/IPDPS.2019.00022","DOIUrl":null,"url":null,"abstract":"Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于gpu半精度算法的小尺寸快速批处理矩阵乘法

矩阵乘法(GEMM)是密集线性代数中最重要的运算。由于GEMM是一种计算约束的操作，具有丰富的数据重用性，因此来自不同科学领域的许多应用程序都将其性能最关键的阶段用于使用GEMM。随着批处理线性代数的兴起，批处理GEMM操作在密集线性求解器以外的领域变得越来越流行，比如张量收缩、稀疏直接求解器和机器学习。特别是对于后者，降低精度的批处理GEMM(即FP16)已经成为许多深度学习框架的核心操作。本文介绍了一种基于图形处理单元(gpu)的FP16算法(HGEMM)的优化批处理GEMM。我们提供了一个详细的设计策略，利用了最近在支持cuda的gpu中引入的Tensor Core技术。开发的解决方案在优化设计中使用供应商提供的低级api，克服了硬件施加的限制(以离散配置的形式)。其结果是一个高度灵活的GPU内核，为开发人员提供了大量的控制，尽管前面提到的限制。本文还特别注意了不能完全占据张量核心单元的非常小的矩阵的乘法。我们的研究结果表明，在使用Tesla V100 GPU的情况下，所提出的设计可以比高度优化的供应商程序的性能高出1.2倍到10倍。对于非常小的矩阵，观察到的加速范围在1.8到26x之间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量