速度：一个可扩展的RISC-V矢量处理器，实现高效的多精度DNN推理

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-10-04 DOI:10.1109/TVLSI.2024.3466224

Chuanning Wang;Chao Fang;Xiao Wu;Zhongfeng Wang;Jun Lin

{"title":"速度：一个可扩展的RISC-V矢量处理器，实现高效的多精度DNN推理","authors":"Chuanning Wang;Chao Fang;Xiao Wu;Zhongfeng Wang;Jun Lin","doi":"10.1109/TVLSI.2024.3466224","DOIUrl":null,"url":null,"abstract":"Deploying deep neural networks (DNNs) on those resource-constrained edge platforms is hindered by their substantial computation and storage demands. Quantized multiprecision DNNs (MP-DNNs), denoted as MP-DNNs, offer a promising solution for these limitations but pose challenges for the existing RISC-V processors due to complex instructions, suboptimal parallel processing, and inefficient dataflow mapping. To tackle the challenges mentioned above, SPEED, a scalable RISC-V vector (RVV) processor, is proposed to enable efficient MP-DNN inference, incorporating innovations in customized instructions, hardware architecture, and dataflow mapping. First, some dedicated customized RISC-V instructions are introduced based on RVV extensions to reduce the instruction complexity, allowing SPEED to support processing precision ranging from 4- to 16-bit with minimized hardware overhead. Second, a parameterized multiprecision tensor unit (MPTU) is developed and integrated within the scalable module to enhance parallel processing capability by providing reconfigurable parallelism that matches the computation patterns of diverse MP-DNNs. Finally, a flexible mixed dataflow method is adopted to improve computational and energy efficiency according to the computing patterns of different DNN operators. The synthesis of SPEED is conducted on TSMC 28-nm technology. Experimental results show that SPEED achieves a peak throughput of 737.9 GOPS and an energy efficiency of 1383.4 GOPS/W for 4-bit operators. Furthermore, SPEED exhibits superior area efficiency compared with prior RVV processors, with the enhancements of \n<inline-formula> <tex-math>$5.9\\sim 26.9\\times $ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$8.2\\sim 18.5\\times $ </tex-math></inline-formula>\n for 8-bit operator and best integer performance, respectively, which highlights SPEED’s significant potential for efficient MP-DNN inference.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"207-220"},"PeriodicalIF":2.8000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multiprecision DNN Inference\",\"authors\":\"Chuanning Wang;Chao Fang;Xiao Wu;Zhongfeng Wang;Jun Lin\",\"doi\":\"10.1109/TVLSI.2024.3466224\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deploying deep neural networks (DNNs) on those resource-constrained edge platforms is hindered by their substantial computation and storage demands. Quantized multiprecision DNNs (MP-DNNs), denoted as MP-DNNs, offer a promising solution for these limitations but pose challenges for the existing RISC-V processors due to complex instructions, suboptimal parallel processing, and inefficient dataflow mapping. To tackle the challenges mentioned above, SPEED, a scalable RISC-V vector (RVV) processor, is proposed to enable efficient MP-DNN inference, incorporating innovations in customized instructions, hardware architecture, and dataflow mapping. First, some dedicated customized RISC-V instructions are introduced based on RVV extensions to reduce the instruction complexity, allowing SPEED to support processing precision ranging from 4- to 16-bit with minimized hardware overhead. Second, a parameterized multiprecision tensor unit (MPTU) is developed and integrated within the scalable module to enhance parallel processing capability by providing reconfigurable parallelism that matches the computation patterns of diverse MP-DNNs. Finally, a flexible mixed dataflow method is adopted to improve computational and energy efficiency according to the computing patterns of different DNN operators. The synthesis of SPEED is conducted on TSMC 28-nm technology. Experimental results show that SPEED achieves a peak throughput of 737.9 GOPS and an energy efficiency of 1383.4 GOPS/W for 4-bit operators. Furthermore, SPEED exhibits superior area efficiency compared with prior RVV processors, with the enhancements of \\n<inline-formula> <tex-math>$5.9\\\\sim 26.9\\\\times $ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$8.2\\\\sim 18.5\\\\times $ </tex-math></inline-formula>\\n for 8-bit operator and best integer performance, respectively, which highlights SPEED’s significant potential for efficient MP-DNN inference.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 1\",\"pages\":\"207-220\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10705106/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10705106/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

在这些资源受限的边缘平台上部署深度神经网络（dnn）受到其大量计算和存储需求的阻碍。量化多精度dnn (mp - dnn)，表示为mp - dnn，为这些限制提供了一个有希望的解决方案，但由于复杂的指令，次优并行处理和低效的数据流映射，给现有的RISC-V处理器带来了挑战。为了应对上述挑战，提出了一种可扩展的RISC-V向量（RVV）处理器SPEED，以实现高效的MP-DNN推理，结合定制指令、硬件架构和数据流映射方面的创新。首先，基于RVV扩展引入了一些专用的定制RISC-V指令，以降低指令复杂性，使SPEED能够以最小的硬件开销支持4到16位的处理精度。其次，在可扩展模块中开发并集成了参数化多精度张量单元（MPTU），通过提供与不同mp - dnn的计算模式相匹配的可重构并行性来增强并行处理能力。最后，根据不同深度神经网络算子的计算模式，采用灵活的混合数据流方法提高计算效率和能量效率。采用台积电28纳米工艺合成了SPEED。实验结果表明，对于4位运营商，SPEED的峰值吞吐量为737.9 GOPS，能量效率为1383.4 GOPS/W。此外，与先前的RVV处理器相比，SPEED显示出卓越的区域效率，8位运算符和最佳整数性能分别提高了5.9\sim 26.9\times $和8.2\sim 18.5\times $，这凸显了SPEED在高效MP-DNN推理方面的巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multiprecision DNN Inference

Deploying deep neural networks (DNNs) on those resource-constrained edge platforms is hindered by their substantial computation and storage demands. Quantized multiprecision DNNs (MP-DNNs), denoted as MP-DNNs, offer a promising solution for these limitations but pose challenges for the existing RISC-V processors due to complex instructions, suboptimal parallel processing, and inefficient dataflow mapping. To tackle the challenges mentioned above, SPEED, a scalable RISC-V vector (RVV) processor, is proposed to enable efficient MP-DNN inference, incorporating innovations in customized instructions, hardware architecture, and dataflow mapping. First, some dedicated customized RISC-V instructions are introduced based on RVV extensions to reduce the instruction complexity, allowing SPEED to support processing precision ranging from 4- to 16-bit with minimized hardware overhead. Second, a parameterized multiprecision tensor unit (MPTU) is developed and integrated within the scalable module to enhance parallel processing capability by providing reconfigurable parallelism that matches the computation patterns of diverse MP-DNNs. Finally, a flexible mixed dataflow method is adopted to improve computational and energy efficiency according to the computing patterns of different DNN operators. The synthesis of SPEED is conducted on TSMC 28-nm technology. Experimental results show that SPEED achieves a peak throughput of 737.9 GOPS and an energy efficiency of 1383.4 GOPS/W for 4-bit operators. Furthermore, SPEED exhibits superior area efficiency compared with prior RVV processors, with the enhancements of

$5.9\sim 26.9\times $

and

$8.2\sim 18.5\times $

for 8-bit operator and best integer performance, respectively, which highlights SPEED’s significant potential for efficient MP-DNN inference.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.

期刊最新文献

Table of Contents IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information Table of Contents IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information