Strassen’s Algorithm Reloaded on GPUs

ACM Transactions on Mathematical Software (TOMS) Pub Date : 2020-03-20 DOI:10.1145/3372419

Jianyu Huang, Chenhan D. Yu, R. Geijn

{"title":"Strassen’s Algorithm Reloaded on GPUs","authors":"Jianyu Huang, Chenhan D. Yu, R. Geijn","doi":"10.1145/3372419","DOIUrl":null,"url":null,"abstract":"Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, “squarish” matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from gemm, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for gemm and Strassen. Overall, our 1-level Strassen can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19× speedup with a crossover point at 7,680.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"54 1","pages":"1 - 22"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Mathematical Software (TOMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, “squarish” matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from gemm, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for gemm and Strassen. Overall, our 1-level Strassen can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19× speedup with a crossover point at 7,680.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Strassen算法在gpu上重新加载

传统图形处理单元(GPU)实现的Strassen算法(Strassen)依赖于现有的高性能矩阵乘法(gem)，以空间换取时间。因此，由于额外的内存开销，这种方法只能实现相对较大的“平方”矩阵的实际加速，并且由于相当大的工作空间，它们的使用受到限制。我们提出了新的Strassen原语的gpu，可以组成生成一个家族的Strassen算法。我们的算法利用gpu上的内存和线程层次结构，重用从gem继承的共享内存和注册文件，融合额外的操作，并避免额外的工作空间。我们通过批处理、流处理和原子操作进一步利用内核内部和内核间的并行性。我们建立了NVIDIA Volta gpu的性能模型，以选择合适的阻塞参数并预测gem和Strassen的性能。总的来说，与NVIDIA Tesla V100 GPU上的cublassgem相比，我们的1级Strassen可以实现高达1.11倍的加速，交叉点小至1536。有了额外的工作空间，我们的2级Strassen可以实现1.19倍的加速，交叉点为7680。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Mathematical Software (TOMS)

自引率

0.00%

发文量

期刊最新文献

Configurable Open-source Data Structure for Distributed Conforming Unstructured Homogeneous Meshes with GPU Support Algorithm 1027: NOMAD Version 4: Nonlinear Optimization with the MADS Algorithm Toward Accurate and Fast Summation Algorithm 1028: VTMOP: Solver for Blackbox Multiobjective Optimization Problems Parallel QR Factorization of Block Low-rank Matrices