Chuanning Wang;Chao Fang;Xiao Wu;Zhongfeng Wang;Jun Lin
{"title":"速度:一个可扩展的RISC-V矢量处理器,实现高效的多精度DNN推理","authors":"Chuanning Wang;Chao Fang;Xiao Wu;Zhongfeng Wang;Jun Lin","doi":"10.1109/TVLSI.2024.3466224","DOIUrl":null,"url":null,"abstract":"Deploying deep neural networks (DNNs) on those resource-constrained edge platforms is hindered by their substantial computation and storage demands. Quantized multiprecision DNNs (MP-DNNs), denoted as MP-DNNs, offer a promising solution for these limitations but pose challenges for the existing RISC-V processors due to complex instructions, suboptimal parallel processing, and inefficient dataflow mapping. To tackle the challenges mentioned above, SPEED, a scalable RISC-V vector (RVV) processor, is proposed to enable efficient MP-DNN inference, incorporating innovations in customized instructions, hardware architecture, and dataflow mapping. First, some dedicated customized RISC-V instructions are introduced based on RVV extensions to reduce the instruction complexity, allowing SPEED to support processing precision ranging from 4- to 16-bit with minimized hardware overhead. Second, a parameterized multiprecision tensor unit (MPTU) is developed and integrated within the scalable module to enhance parallel processing capability by providing reconfigurable parallelism that matches the computation patterns of diverse MP-DNNs. Finally, a flexible mixed dataflow method is adopted to improve computational and energy efficiency according to the computing patterns of different DNN operators. The synthesis of SPEED is conducted on TSMC 28-nm technology. Experimental results show that SPEED achieves a peak throughput of 737.9 GOPS and an energy efficiency of 1383.4 GOPS/W for 4-bit operators. Furthermore, SPEED exhibits superior area efficiency compared with prior RVV processors, with the enhancements of \n<inline-formula> <tex-math>$5.9\\sim 26.9\\times $ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$8.2\\sim 18.5\\times $ </tex-math></inline-formula>\n for 8-bit operator and best integer performance, respectively, which highlights SPEED’s significant potential for efficient MP-DNN inference.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"207-220"},"PeriodicalIF":2.8000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SPEED: A Scalable RISC-V Vector Processor Enabling Efficient Multiprecision DNN Inference\",\"authors\":\"Chuanning Wang;Chao Fang;Xiao Wu;Zhongfeng Wang;Jun Lin\",\"doi\":\"10.1109/TVLSI.2024.3466224\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deploying deep neural networks (DNNs) on those resource-constrained edge platforms is hindered by their substantial computation and storage demands. Quantized multiprecision DNNs (MP-DNNs), denoted as MP-DNNs, offer a promising solution for these limitations but pose challenges for the existing RISC-V processors due to complex instructions, suboptimal parallel processing, and inefficient dataflow mapping. To tackle the challenges mentioned above, SPEED, a scalable RISC-V vector (RVV) processor, is proposed to enable efficient MP-DNN inference, incorporating innovations in customized instructions, hardware architecture, and dataflow mapping. First, some dedicated customized RISC-V instructions are introduced based on RVV extensions to reduce the instruction complexity, allowing SPEED to support processing precision ranging from 4- to 16-bit with minimized hardware overhead. Second, a parameterized multiprecision tensor unit (MPTU) is developed and integrated within the scalable module to enhance parallel processing capability by providing reconfigurable parallelism that matches the computation patterns of diverse MP-DNNs. Finally, a flexible mixed dataflow method is adopted to improve computational and energy efficiency according to the computing patterns of different DNN operators. The synthesis of SPEED is conducted on TSMC 28-nm technology. Experimental results show that SPEED achieves a peak throughput of 737.9 GOPS and an energy efficiency of 1383.4 GOPS/W for 4-bit operators. Furthermore, SPEED exhibits superior area efficiency compared with prior RVV processors, with the enhancements of \\n<inline-formula> <tex-math>$5.9\\\\sim 26.9\\\\times $ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$8.2\\\\sim 18.5\\\\times $ </tex-math></inline-formula>\\n for 8-bit operator and best integer performance, respectively, which highlights SPEED’s significant potential for efficient MP-DNN inference.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 1\",\"pages\":\"207-220\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10705106/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10705106/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Deploying deep neural networks (DNNs) on those resource-constrained edge platforms is hindered by their substantial computation and storage demands. Quantized multiprecision DNNs (MP-DNNs), denoted as MP-DNNs, offer a promising solution for these limitations but pose challenges for the existing RISC-V processors due to complex instructions, suboptimal parallel processing, and inefficient dataflow mapping. To tackle the challenges mentioned above, SPEED, a scalable RISC-V vector (RVV) processor, is proposed to enable efficient MP-DNN inference, incorporating innovations in customized instructions, hardware architecture, and dataflow mapping. First, some dedicated customized RISC-V instructions are introduced based on RVV extensions to reduce the instruction complexity, allowing SPEED to support processing precision ranging from 4- to 16-bit with minimized hardware overhead. Second, a parameterized multiprecision tensor unit (MPTU) is developed and integrated within the scalable module to enhance parallel processing capability by providing reconfigurable parallelism that matches the computation patterns of diverse MP-DNNs. Finally, a flexible mixed dataflow method is adopted to improve computational and energy efficiency according to the computing patterns of different DNN operators. The synthesis of SPEED is conducted on TSMC 28-nm technology. Experimental results show that SPEED achieves a peak throughput of 737.9 GOPS and an energy efficiency of 1383.4 GOPS/W for 4-bit operators. Furthermore, SPEED exhibits superior area efficiency compared with prior RVV processors, with the enhancements of
$5.9\sim 26.9\times $
and
$8.2\sim 18.5\times $
for 8-bit operator and best integer performance, respectively, which highlights SPEED’s significant potential for efficient MP-DNN inference.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.