面向边缘设备的低功耗流式语音增强加速器

IF 2.4 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE open journal of circuits and systems Pub Date : 2024-04-11 DOI:10.1109/OJCAS.2024.3387849

Ci-Hao Wu;Tian-Sheuan Chang

{"title":"面向边缘设备的低功耗流式语音增强加速器","authors":"Ci-Hao Wu;Tian-Sheuan Chang","doi":"10.1109/OJCAS.2024.3387849","DOIUrl":null,"url":null,"abstract":"Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"5 ","pages":"128-140"},"PeriodicalIF":2.4000,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10496994","citationCount":"0","resultStr":"{\"title\":\"A Low-Power Streaming Speech Enhancement Accelerator for Edge Devices\",\"authors\":\"Ci-Hao Wu;Tian-Sheuan Chang\",\"doi\":\"10.1109/OJCAS.2024.3387849\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.\",\"PeriodicalId\":93442,\"journal\":{\"name\":\"IEEE open journal of circuits and systems\",\"volume\":\"5 \",\"pages\":\"128-140\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10496994\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE open journal of circuits and systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10496994/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of circuits and systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10496994/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

基于变压器的语音增强模型产生了令人印象深刻的结果。然而，它们的异构和复杂结构限制了模型压缩的潜力，导致复杂性增加和硬件效率降低。此外，这些模型并不适合流媒体和低功耗应用。针对这些挑战，本文通过模型和硬件优化，提出了一种低功耗流式语音增强加速器。通过模型压缩和目标应用的协同设计，提出的高性能模型针对硬件执行进行了优化，通过提出的领域感知和流感知剪枝技术，减少了 93.9% 的模型大小。基于批量归一化的转换器进一步降低了所需的延迟。此外，我们还采用了无软最大关注（softmax-free attention），并辅以额外的批量归一化，从而简化了硬件设计。量身定制的硬件可将这些不同的计算模式分解为元素乘法和累加（MAC）。这是通过一维处理阵列，利用可配置的 SRAM 寻址来实现的，从而最大限度地降低了硬件复杂性，简化了跳零过程。利用台积电 40nm CMOS 工艺，最终实现仅需 207.8K 门和 53.75KB SRAM。在 62.5MHz 频率下进行实时推理时，功耗仅为 8.08 mW。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Low-Power Streaming Speech Enhancement Accelerator for Edge Devices

Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE open journal of circuits and systems

自引率

0.00%

发文量

审稿时长

19 weeks