{"title":"Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors","authors":"Vasileios Titopoulos;Kosmas Alexandridis;Christodoulos Peltekis;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/TC.2025.3533083","DOIUrl":null,"url":null,"abstract":"Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called <monospace>vindexmac</monospace>, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33%, when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 4","pages":"1446-1460"},"PeriodicalIF":3.6000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10852517/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called vindexmac, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33%, when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.