Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2024-11-06 DOI:10.1109/TCAD.2024.3443718

Moritz Scherer;Luka Macan;Victor J. B. Jung;Philip Wiese;Luca Bompani;Alessio Burrello;Francesco Conti;Luca Benini

{"title":"Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers","authors":"Moritz Scherer;Luka Macan;Victor J. B. Jung;Philip Wiese;Luca Bompani;Alessio Burrello;Francesco Conti;Luca Benini","doi":"10.1109/TCAD.2024.3443718","DOIUrl":null,"url":null,"abstract":"With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores’ instruction extensions and the NPU. We achieve leading-edge energy and throughput of \n<inline-formula> <tex-math>$490 \\; \\mu $ </tex-math></inline-formula>\nJ per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4009-4020"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10745806/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores’ instruction extensions and the NPU. We achieve leading-edge energy and throughput of

$490 \; \mu $

J per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Deeploy：在异构微控制器上实现小型语言模型的高能效部署

随着嵌入式基础模型（EFM），特别是小语言模型（SLM）的兴起，为边缘应用改造变形器已成为一个非常活跃的研究领域。然而，在微控制器（MCU）级芯片上实现 SLM 的端到端部署而无需高带宽的片外主内存访问仍然是一个公开的挑战。在本文中，我们展示了在多核 RISC-V (RV32) MCU 上高效部署端到端 SLM 的情况，该 MCU 增强了 ML 指令扩展和硬件神经处理单元 (NPU)。为了自动探索在异构（多核+NPU）资源上积极部署 SLM 所涉及的受限多维内存与计算之间的权衡，我们引入了 Deeploy，这是一种新型深度神经网络（DNN）编译器，可生成高度优化的 C 代码，只需最少的运行时支持。我们演示了 Deeploy 生成执行 SLM 的端到端代码，充分利用了 RV32 内核的指令扩展和 NPU。我们在TinyStories数据集上训练的SLM首次在没有外部存储器的MCU级设备上运行时，以每秒340个令牌的速度，实现了每令牌490美元的领先能耗和吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.

期刊最新文献

Table of Contents IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems society information IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems publication information Table of Contents IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems publication information