SIMD2:一个广义矩阵指令集，用于加速超出GEMM的张量计算

Proceedings of the 49th Annual International Symposium on Computer Architecture Pub Date : 2022-05-03 DOI:10.1145/3470496.3527411

Yunan Zhang, Po-An Tsai, Hung-Wei Tseng

{"title":"SIMD2:一个广义矩阵指令集，用于加速超出GEMM的张量计算","authors":"Yunan Zhang, Po-An Tsai, Hung-Wei Tseng","doi":"10.1145/3470496.3527411","DOIUrl":null,"url":null,"abstract":"Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM\",\"authors\":\"Yunan Zhang, Po-An Tsai, Hung-Wei Tseng\",\"doi\":\"10.1145/3470496.3527411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.\",\"PeriodicalId\":337932,\"journal\":{\"name\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3470496.3527411\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470496.3527411","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

矩阵乘法单元(mxu)现在在每个计算平台中都很流行。使mxu如此成功的关键属性是半循环结构，它允许平铺并行性和数据重用。尽管如此，矩阵乘法并不是唯一具有这种属性的算法。我们发现许多算法具有相同的结构，不同的只是核心操作;例如，使用add-minimum而不是乘法-add。因此，具有半环结构的算法有可能通过通用矩阵操作体系结构而不是普通的mxu来加速。本文提出了一种新的编程范式SIMD2，它支持半环结构下的广义矩阵运算。除了矩阵乘法之外，SIMD2指令还加速了另外八种矩阵运算。由于SIMD2指令类似于矩阵乘法指令，因此我们能够在任何MXU体系结构之上构建SIMD2体系结构，只需进行最小的修改。我们开发了一个使用NVIDIA gpu和Tensor Cores模拟和验证SIMD2的框架。在8个应用程序中，SIMD2提供高达38.59倍的加速，比优化后的CUDA程序平均提供超过6.94倍的加速，而全芯片面积开销仅为5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 49th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量