Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li
{"title":"sVAD:利用尖峰神经网络进行鲁棒、低功耗和轻量级语音活动检测","authors":"Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li","doi":"10.1109/icassp48485.2024.10446945","DOIUrl":null,"url":null,"abstract":"Speech applications are expected to be low-power and robust under noisy conditions. An effective Voice Activity Detection (VAD) front-end lowers the computational need. Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. However, SNN-based VADs have yet to achieve noise robustness and often require large models for high performance. This paper introduces a novel SNN-based VAD model, referred to as sVAD, which features an auditory encoder with an SNN-based attention mechanism. Particularly, it provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms. The classifier utilizes Spiking Recurrent Neural Networks (sRNN) to exploit temporal speech information. Experimental results demonstrate that our sVAD achieves remarkable noise robustness and meanwhile maintains low power consumption and a small footprint, making it a promising solution for real-world VAD applications.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"8 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks\",\"authors\":\"Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li\",\"doi\":\"10.1109/icassp48485.2024.10446945\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech applications are expected to be low-power and robust under noisy conditions. An effective Voice Activity Detection (VAD) front-end lowers the computational need. Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. However, SNN-based VADs have yet to achieve noise robustness and often require large models for high performance. This paper introduces a novel SNN-based VAD model, referred to as sVAD, which features an auditory encoder with an SNN-based attention mechanism. Particularly, it provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms. The classifier utilizes Spiking Recurrent Neural Networks (sRNN) to exploit temporal speech information. Experimental results demonstrate that our sVAD achieves remarkable noise robustness and meanwhile maintains low power consumption and a small footprint, making it a promising solution for real-world VAD applications.\",\"PeriodicalId\":513202,\"journal\":{\"name\":\"ArXiv\",\"volume\":\"8 2\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icassp48485.2024.10446945\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icassp48485.2024.10446945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
语音应用需要低功耗,并能在噪声条件下保持稳定。有效的语音活动检测(VAD)前端可降低计算需求。众所周知,尖峰神经网络(SNN)具有生物合理性和高能效。然而,基于 SNN 的 VAD 尚未实现噪声鲁棒性,而且通常需要大型模型才能实现高性能。本文介绍了一种新颖的基于 SNN 的 VAD 模型(简称为 sVAD),其特点是听觉编码器具有基于 SNN 的注意机制。特别是,它通过 SincNet 和一维卷积提供了有效的听觉特征表示,并通过注意力机制提高了噪声鲁棒性。分类器利用尖峰递归神经网络(sRNN)来利用时态语音信息。实验结果表明,我们的 sVAD 具有显著的噪声鲁棒性,同时功耗低、占用空间小,是一种很有前途的实际 VAD 应用解决方案。
sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks
Speech applications are expected to be low-power and robust under noisy conditions. An effective Voice Activity Detection (VAD) front-end lowers the computational need. Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. However, SNN-based VADs have yet to achieve noise robustness and often require large models for high performance. This paper introduces a novel SNN-based VAD model, referred to as sVAD, which features an auditory encoder with an SNN-based attention mechanism. Particularly, it provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms. The classifier utilizes Spiking Recurrent Neural Networks (sRNN) to exploit temporal speech information. Experimental results demonstrate that our sVAD achieves remarkable noise robustness and meanwhile maintains low power consumption and a small footprint, making it a promising solution for real-world VAD applications.