A Background-Noise and Process-Variation-Tolerant 109nW Acoustic Feature Extractor Based on Spike-Domain Divisive-Energy Normalization for an Always-On Keyword Spotting Device

2021 IEEE International Solid- State Circuits Conference (ISSCC) Pub Date : 2021-02-13 DOI:10.1109/ISSCC42613.2021.9365969

Dewei Wang, S. Kim, Minhao Yang, A. Lazar, Mingoo Seok

{"title":"A Background-Noise and Process-Variation-Tolerant 109nW Acoustic Feature Extractor Based on Spike-Domain Divisive-Energy Normalization for an Always-On Keyword Spotting Device","authors":"Dewei Wang, S. Kim, Minhao Yang, A. Lazar, Mingoo Seok","doi":"10.1109/ISSCC42613.2021.9365969","DOIUrl":null,"url":null,"abstract":"In mobile and edge devices, always-on keyword spotting (KWS) is an essential function to detect wake-up words. Recent works achieved extremely low power dissipation down to $\\sim500$ nW [1]. However, most of them adopt noise-dependent training, i.e. training for a specific signal-to-noise ratio (SNR) and noise type [1], and therefore their accuracies degrade for different SNR levels and noise types that are not targeted in the training (Fig. 9.9.1, top left). To improve robustness, so-called noise-independent training can be considered, which is to use the training data that includes all the possible SNR levels and noise types [2]. But, this approach is challenging for an ultra-low-power device since it demands a large neural network to learn all the possible features. A neural network of a fixed size has its own memory capacity limit and reaches a plateau in accuracy if it has to learn more than its limit (Fig. 9.9.1, top right). On the other hand, it is known that biological acoustic systems employ a simpler process, called divisive energy normalization (DN), to maintain accuracy even in varying noise conditions [3]. In this work, therefore, by adopting such a DN, we prototype a normalized acoustic feature extractor chip (NAFE) in 65nm. The NAFE can take an acoustic signal from a microphone and produce spike-rate coded features. We pair NAFE with a spiking neural network (SNN) classifier chip [4], creating the end-to-end KWS system. The proposed system achieves 89-to-94% accuracy across -5 to 20dB SNRs and four different noise types on HeySnips [5], while the baseline without DN achieves a much lower accuracy of 71-87%. NAFE consumes up to 109nW and the KWS system 570nW.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"38 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42613.2021.9365969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

In mobile and edge devices, always-on keyword spotting (KWS) is an essential function to detect wake-up words. Recent works achieved extremely low power dissipation down to $\sim500$ nW [1]. However, most of them adopt noise-dependent training, i.e. training for a specific signal-to-noise ratio (SNR) and noise type [1], and therefore their accuracies degrade for different SNR levels and noise types that are not targeted in the training (Fig. 9.9.1, top left). To improve robustness, so-called noise-independent training can be considered, which is to use the training data that includes all the possible SNR levels and noise types [2]. But, this approach is challenging for an ultra-low-power device since it demands a large neural network to learn all the possible features. A neural network of a fixed size has its own memory capacity limit and reaches a plateau in accuracy if it has to learn more than its limit (Fig. 9.9.1, top right). On the other hand, it is known that biological acoustic systems employ a simpler process, called divisive energy normalization (DN), to maintain accuracy even in varying noise conditions [3]. In this work, therefore, by adopting such a DN, we prototype a normalized acoustic feature extractor chip (NAFE) in 65nm. The NAFE can take an acoustic signal from a microphone and produce spike-rate coded features. We pair NAFE with a spiking neural network (SNN) classifier chip [4], creating the end-to-end KWS system. The proposed system achieves 89-to-94% accuracy across -5 to 20dB SNRs and four different noise types on HeySnips [5], while the baseline without DN achieves a much lower accuracy of 71-87%. NAFE consumes up to 109nW and the KWS system 570nW.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种基于峰值域分能量归一化的恒在线关键词识别设备的背景噪声和过程变化容忍109nW声学特征提取方法

在移动和边缘设备中，始终在线的关键字识别(KWS)是检测唤醒词的基本功能。最近的研究成果实现了极低的功耗，低至$ $ sim500$ nW[1]。然而，它们大多采用的是依赖噪声的训练，即针对特定的信噪比和噪声类型进行训练[1]，因此对于不同的信噪比水平和训练中未针对的噪声类型，它们的准确率会下降(图9.9.1，左上)。为了提高鲁棒性，可以考虑所谓的噪声无关训练，即使用包含所有可能的信噪比水平和噪声类型的训练数据[2]。但是，这种方法对于超低功耗设备来说是具有挑战性的，因为它需要一个大的神经网络来学习所有可能的特征。固定规模的神经网络有其自身的记忆容量限制，如果它必须学习超过其限制的内容，则其准确性会达到平台期(图9.9.1，右上)。另一方面，众所周知，生物声学系统采用一种更简单的过程，称为分裂能归一化(DN)，即使在不同的噪声条件下也能保持精度[3]。因此，在这项工作中，通过采用这种DN，我们在65nm尺度上原型化了一种归一化声学特征提取芯片(NAFE)。NAFE可以接收来自麦克风的声音信号，并产生尖峰率编码特征。我们将NAFE与尖峰神经网络(SNN)分类器芯片配对[4]，创建端到端的KWS系统。本文提出的系统在HeySnips上的-5到20dB信噪比和四种不同噪声类型下的精度达到89- 94%[5]，而没有DN的基线的精度要低得多，为71-87%。NAFE系统耗电高达109nW, KWS系统耗电高达570nW。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE International Solid- State Circuits Conference (ISSCC)

自引率

0.00%

发文量