DeMM：支持松弛结构稀疏性的解耦矩阵乘法引擎

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Computer Architecture Letters Pub Date : 2024-01-17 DOI:10.1109/LCA.2024.3355178

Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos

{"title":"DeMM：支持松弛结构稀疏性的解耦矩阵乘法引擎","authors":"Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/LCA.2024.3355178","DOIUrl":null,"url":null,"abstract":"Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n:128, or \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n:256, for small values of \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"17-20"},"PeriodicalIF":1.4000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity\",\"authors\":\"Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos\",\"doi\":\"10.1109/LCA.2024.3355178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n:128, or \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n:256, for small values of \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.\",\"PeriodicalId\":51248,\"journal\":{\"name\":\"IEEE Computer Architecture Letters\",\"volume\":\"23 1\",\"pages\":\"17-20\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2024-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Computer Architecture Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10402073/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10402073/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

深度学习（DL）在各种应用领域取得了前所未有的成功。与此同时，模型剪枝已成为一种可行的解决方案，可减少移动应用中深度学习模型的占用空间，同时又不影响其准确性。为了使为密集 DL 模型构建的矩阵引擎也能处理经过剪枝的对应模型，经过剪枝的 DL 模型遵循 1:4 或 2:4 的细粒度结构稀疏性模式，即在每组四个连续值中，至少有一个或两个值必须为非零。最近，结构稀疏性也发展到了更粗糙（宽松）的 N:128 或 N:256（N 值很小）的情况，目标是为 DL 模型提供更宽的稀疏性范围（10%-90%）。在这项工作中，我们设计了一种加速器，通过构造，它可以在具有宽松结构稀疏性的宽块上运行。与传统的收缩阵列原型不同，新引擎将收缩阵列的内存部分与乘加单元解耦。内存块包括 1 个写入端口和 N 个读取端口，读取端口的数量等于每行非零元素的数量。乘加单元直接连接到每个读取端口，并按行先乘积的顺序完成乘法运算。更重要的是，简单的重新配置可实现更密集的模式。实验评估结果表明，与目前最先进的针对细粒度和宽松结构稀疏性而构建的收缩阵列引擎相比，延迟得到了大幅改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity

Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of

$N$

:128, or

$N$

:256, for small values of

$N$

, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and

$N$

read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.60

自引率

4.30%

发文量

期刊介绍： IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.

期刊最新文献

DAWN: Efficient Distribution of Attention Workload in PIM-Enabled Systems for LLM Inference 2025 Reviewers List* Driving the Core Frontend With LiteBTB CTL: A Case for CXL Device-Managed Hugepages H3: Hybrid Architecture Using High Bandwidth Memory and High Bandwidth Flash for Cost-Efficient LLM Inference