A Low-Power Streaming Speech Enhancement Accelerator for Edge Devices

IF 2.4 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE open journal of circuits and systems Pub Date : 2024-04-11 DOI:10.1109/OJCAS.2024.3387849
Ci-Hao Wu;Tian-Sheuan Chang
{"title":"A Low-Power Streaming Speech Enhancement Accelerator for Edge Devices","authors":"Ci-Hao Wu;Tian-Sheuan Chang","doi":"10.1109/OJCAS.2024.3387849","DOIUrl":null,"url":null,"abstract":"Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10496994","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of circuits and systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10496994/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向边缘设备的低功耗流式语音增强加速器
基于变压器的语音增强模型产生了令人印象深刻的结果。然而,它们的异构和复杂结构限制了模型压缩的潜力,导致复杂性增加和硬件效率降低。此外,这些模型并不适合流媒体和低功耗应用。针对这些挑战,本文通过模型和硬件优化,提出了一种低功耗流式语音增强加速器。通过模型压缩和目标应用的协同设计,提出的高性能模型针对硬件执行进行了优化,通过提出的领域感知和流感知剪枝技术,减少了 93.9% 的模型大小。基于批量归一化的转换器进一步降低了所需的延迟。此外,我们还采用了无软最大关注(softmax-free attention),并辅以额外的批量归一化,从而简化了硬件设计。量身定制的硬件可将这些不同的计算模式分解为元素乘法和累加(MAC)。这是通过一维处理阵列,利用可配置的 SRAM 寻址来实现的,从而最大限度地降低了硬件复杂性,简化了跳零过程。利用台积电 40nm CMOS 工艺,最终实现仅需 207.8K 门和 53.75KB SRAM。在 62.5MHz 频率下进行实时推理时,功耗仅为 8.08 mW。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
19 weeks
期刊最新文献
Double MAC on a Cell: A 22-nm 8T-SRAM-Based Analog In-Memory Accelerator for Binary/Ternary Neural Networks Featuring Split Wordline A Companding Technique to Reduce Peak-to-Average Ratio in Discrete Multitone Wireline Receivers Low-Power On-Chip Energy Harvesting: From Interface Circuits Perspective A 10 GHz Dual-Loop PLL With Active Cycle-Jitter Correction Achieving 12dB Spur and 29% Jitter Reduction A 45Gb/s Analog Multi-Tone Receiver Utilizing a 6-Tap MIMO-FFE in 22nm FDSOI
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1