一种用于细粒度混合精度量化神经网络推理的3 TOPS/W RISC-V并行聚类

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2023-06-20 DOI:10.1109/ISVLSI59464.2023.10238679

Alessandro Nadalini, Georg Rutishauser, A. Burrello, Nazareno Bruschi, Angelo Garofalo, L. Benini, Francesco Conti, D. Rossi

{"title":"一种用于细粒度混合精度量化神经网络推理的3 TOPS/W RISC-V并行聚类","authors":"Alessandro Nadalini, Georg Rutishauser, A. Burrello, Nazareno Bruschi, Angelo Garofalo, L. Benini, Francesco Conti, D. Rossi","doi":"10.1109/ISVLSI59464.2023.10238679","DOIUrl":null,"url":null,"abstract":"The emerging trend of deploying complex algorithms, such as Deep Neural networks (DNNs), increasingly poses strict memory and energy efficiency requirements on Internet-of-Things (IoT) end-nodes. Mixed-precision quantization has been proposed as a technique to minimize a DNN’s memory footprint and maximize its execution efficiency, with negligible end-to-end precision degradation. In this work, we present a novel hardware and software stack for energy-efficient inference of mixed-precision Quantized Neural Networks (QNNs). We introduce Flex-V, a processor based on the RISC-V Instruction Set Architecture (ISA) that features fused Mac&Load mixed-precision dot product instructions; to avoid the exponential growth of the encoding space due to mixed-precision variants, we encode formats into the Control-Status Registers (CSRs). Flex-V core is integrated into a tightly-coupled cluster of eight processors; in addition, we provide a full framework for the end-to-end deployment of DNNs including a compiler, optimized libraries, and a memory-aware deployment flow. Our results show up to 91.5 MAC/cycle and 3.26 TOPS/W on the cluster, implemented in a commercial 22nm FDX technology, with up to $ 8.5 \\times$ speed-up, and an area overhead of only 5.6% with respect to the baseline. To demonstrate the capabilities of the architecture, we benchmark it with end-to-end real-life QNNs, improving performance by $ 2 \\times-2.5 \\times$ with respect to existing solutions using fully flexible programmable processors.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A 3 TOPS/W RISC-V Parallel Cluster for Inference of Fine-Grain Mixed-Precision Quantized Neural Networks\",\"authors\":\"Alessandro Nadalini, Georg Rutishauser, A. Burrello, Nazareno Bruschi, Angelo Garofalo, L. Benini, Francesco Conti, D. Rossi\",\"doi\":\"10.1109/ISVLSI59464.2023.10238679\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The emerging trend of deploying complex algorithms, such as Deep Neural networks (DNNs), increasingly poses strict memory and energy efficiency requirements on Internet-of-Things (IoT) end-nodes. Mixed-precision quantization has been proposed as a technique to minimize a DNN’s memory footprint and maximize its execution efficiency, with negligible end-to-end precision degradation. In this work, we present a novel hardware and software stack for energy-efficient inference of mixed-precision Quantized Neural Networks (QNNs). We introduce Flex-V, a processor based on the RISC-V Instruction Set Architecture (ISA) that features fused Mac&Load mixed-precision dot product instructions; to avoid the exponential growth of the encoding space due to mixed-precision variants, we encode formats into the Control-Status Registers (CSRs). Flex-V core is integrated into a tightly-coupled cluster of eight processors; in addition, we provide a full framework for the end-to-end deployment of DNNs including a compiler, optimized libraries, and a memory-aware deployment flow. Our results show up to 91.5 MAC/cycle and 3.26 TOPS/W on the cluster, implemented in a commercial 22nm FDX technology, with up to $ 8.5 \\\\times$ speed-up, and an area overhead of only 5.6% with respect to the baseline. To demonstrate the capabilities of the architecture, we benchmark it with end-to-end real-life QNNs, improving performance by $ 2 \\\\times-2.5 \\\\times$ with respect to existing solutions using fully flexible programmable processors.\",\"PeriodicalId\":199371,\"journal\":{\"name\":\"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISVLSI59464.2023.10238679\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVLSI59464.2023.10238679","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

随着深度神经网络(dnn)等复杂算法的部署，对物联网(IoT)终端节点的内存和能效要求越来越高。混合精度量化作为一种最小化深度神经网络内存占用和最大化其执行效率的技术，可以忽略端到端精度退化。在这项工作中，我们提出了一种新的硬件和软件堆栈，用于混合精度量化神经网络(QNNs)的节能推理。我们介绍Flex-V，一个基于RISC-V指令集架构(ISA)的处理器，它融合了mac和load混合精度点积指令;为了避免由于混合精度变量导致的编码空间的指数增长，我们将格式编码到控制状态寄存器(CSRs)中。Flex-V核心集成到一个紧密耦合的8个处理器集群中;此外，我们还为dnn的端到端部署提供了一个完整的框架，包括编译器、优化库和内存感知部署流。我们的研究结果显示，在商用22nm FDX技术中实现的集群上，高达91.5 MAC/cycle和3.26 TOPS/W，加速高达8.5倍，面积开销仅为5.6%。为了展示该架构的能力，我们使用端到端的实际qnn对其进行基准测试，相对于使用完全灵活的可编程处理器的现有解决方案，性能提高了2倍至2.5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A 3 TOPS/W RISC-V Parallel Cluster for Inference of Fine-Grain Mixed-Precision Quantized Neural Networks

The emerging trend of deploying complex algorithms, such as Deep Neural networks (DNNs), increasingly poses strict memory and energy efficiency requirements on Internet-of-Things (IoT) end-nodes. Mixed-precision quantization has been proposed as a technique to minimize a DNN’s memory footprint and maximize its execution efficiency, with negligible end-to-end precision degradation. In this work, we present a novel hardware and software stack for energy-efficient inference of mixed-precision Quantized Neural Networks (QNNs). We introduce Flex-V, a processor based on the RISC-V Instruction Set Architecture (ISA) that features fused Mac&Load mixed-precision dot product instructions; to avoid the exponential growth of the encoding space due to mixed-precision variants, we encode formats into the Control-Status Registers (CSRs). Flex-V core is integrated into a tightly-coupled cluster of eight processors; in addition, we provide a full framework for the end-to-end deployment of DNNs including a compiler, optimized libraries, and a memory-aware deployment flow. Our results show up to 91.5 MAC/cycle and 3.26 TOPS/W on the cluster, implemented in a commercial 22nm FDX technology, with up to $ 8.5 \times$ speed-up, and an area overhead of only 5.6% with respect to the baseline. To demonstrate the capabilities of the architecture, we benchmark it with end-to-end real-life QNNs, improving performance by $ 2 \times-2.5 \times$ with respect to existing solutions using fully flexible programmable processors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

自引率

0.00%

发文量