Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors

IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Computers Pub Date : 2025-01-24 DOI:10.1109/TC.2025.3533083
Vasileios Titopoulos;Kosmas Alexandridis;Christodoulos Peltekis;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos
{"title":"Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors","authors":"Vasileios Titopoulos;Kosmas Alexandridis;Christodoulos Peltekis;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/TC.2025.3533083","DOIUrl":null,"url":null,"abstract":"Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called <monospace>vindexmac</monospace>, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33%, when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 4","pages":"1446-1460"},"PeriodicalIF":3.8000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10852517/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called vindexmac, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33%, when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
RISC-V矢量处理器中结构稀疏矩阵乘法优化
结构化稀疏性已被提出作为一种有效的方法来减少机器学习(ML)应用程序的复杂性,并简化硬件中稀疏数据的处理。加速ML模型,无论是用于训练还是推理,都严重依赖于可以在向量处理器或自定义矩阵引擎上有效执行的矩阵乘法。这项工作旨在将结构化稀疏性的简单性集成到向量执行中,以加快相应的矩阵乘法。首先,全面探讨了利用当前RISC-V指令集向量扩展实现结构化稀疏矩阵乘法。影响性能的关键参数,例如跨标量和矢量寄存器文件的数据分布的影响、数据位置以及循环展开的有效性,都进行了定性和定量分析。此外,还证明了添加一条新指令将获得更高的性能。新提出的指令被称为vindexmac,即向量索引-乘法-累加。它允许从矢量寄存器文件中间接读取,并且减少了每次矩阵乘法迭代执行的指令数量,而不会引入限制循环展开的额外依赖项。提出的新指令集成在一个解耦的RISC-V矢量处理器中,硬件成本可以忽略不计。实验结果证明了所引入的优化方法所提供的运行效率和可扩展性,以及执行最新卷积神经网络的新指令。更具体地说,与只使用当前定义的RISC-V指令的高度优化的矢量化内核相比,添加自定义指令可以提高25%和33%的运行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Computers
IEEE Transactions on Computers 工程技术-工程:电子与电气
CiteScore
6.60
自引率
5.40%
发文量
199
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.
期刊最新文献
Optimized NTT Architecture Based on the Plantard Algorithm for ML-KEM and ML-DSA Secure and Efficient Read-Write Synchronization in Re-Sharding Via Lightweight Global State Tree FireFly-T: High-Throughput Sparsity Exploitation for Spiking Transformer Acceleration With Dual-Engine Overlay Architecture Securing IoT Authentication Against Modeling Attacks by PUF-Protocol Co-Design APACHE: A Processing-Near-Memory Architecture for Multi-Scheme Fully Homomorphic Encryption
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1