DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

IEEE Open Journal of the Solid-State Circuits Society Pub Date : 2022-09-27 DOI:10.1109/OJSSCS.2022.3210082

Angelo Garofalo;Yvan Tortorella;Matteo Perotti;Luca Valente;Alessandro Nadalini;Luca Benini;Davide Rossi;Francesco Conti

{"title":"DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training","authors":"Angelo Garofalo;Yvan Tortorella;Matteo Perotti;Luca Valente;Alessandro Nadalini;Luca Benini;Davide Rossi;Francesco Conti","doi":"10.1109/OJSSCS.2022.3210082","DOIUrl":null,"url":null,"abstract":"On-chip deep neural network (DNN) inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy, and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of eight RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost the performance and efficiency on key compute-intensive DNN kernels, the cluster is enriched with three digital accelerators: 1) a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); 2) a minimal overhead datamover to marshal 1–32-b data on-the-fly; and 3) a 16-b floating-point tensor product engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65-nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency—enough to enable on-chip floating-point training at competitive speed coupled with ultralow power quantized inference.","PeriodicalId":100633,"journal":{"name":"IEEE Open Journal of the Solid-State Circuits Society","volume":"2 ","pages":"231-243"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8782712/9733783/09903915.pdf","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Solid-State Circuits Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/9903915/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

On-chip deep neural network (DNN) inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy, and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of eight RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost the performance and efficiency on key compute-intensive DNN kernels, the cluster is enriched with three digital accelerators: 1) a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); 2) a minimal overhead datamover to marshal 1–32-b data on-the-fly; and 3) a 16-b floating-point tensor product engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65-nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency—enough to enable on-chip floating-point training at competitive speed coupled with ultralow power quantized inference.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DARKSIDE：一个用于极端边缘片上DNN推理和训练的异构RISC-V计算集群

片上深度神经网络（DNN）推理和极限边缘训练（TinyML）对延迟、吞吐量、准确性和灵活性提出了严格的要求。异构集群将DSP增强内核的灵活性与专用加速器的性能和能量提升相结合，是应对这一挑战的有前景的解决方案。我们介绍了DARKSIDE，这是一种片上系统，具有八个RISC-V核心的异构集群，并使用2-b到32-b混合精度整数算法进行了增强。为了提高关键计算密集型DNN内核的性能和效率，该集群配备了三个数字加速器：1）用于低数据重用深度卷积内核的专用引擎（高达30 MAC/周期）；2）一个最小开销的数据移动器，用于动态编组1–32-b数据；以及3）用于拼接矩阵乘法加速的16-b浮点张量积引擎（TPE）。DARKSIDE采用65nm CMOS技术实现。当处理2-b整数DNN内核时，该集群实现了65 GOPS的峰值整数性能和835 GOPS/W的峰值效率。当针对浮点张量运算时，TPE提供高达18.2 GFLOPS的性能或300 GFLOPS/W的效率，足以实现具有竞争力的速度下的片上浮点训练以及超低功耗量化推理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助