Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-12-14 DOI:10.1016/j.specom.2023.103024
Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang
{"title":"Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network","authors":"Nan Li ,&nbsp;Longbiao Wang ,&nbsp;Meng Ge ,&nbsp;Masashi Unoki ,&nbsp;Sheng Li ,&nbsp;Jianwu Dang","doi":"10.1016/j.specom.2023.103024","DOIUrl":null,"url":null,"abstract":"<div><p><span><span>Deep learning<span> has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency </span></span>cepstral coefficients, to deep </span>neural networks<span><span> often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional </span>attention network<span> (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.</span></span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001589","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional attention network (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用基于卷积注意力网络的听觉启发式掩蔽调制编码器进行鲁棒语音活动检测
深度学习为语音活动检测(VAD)带来了革命性的变化,提供了前景广阔的解决方案。然而,将原始波形和梅尔频率共振频率系数等传统特征直接应用于深度神经网络,往往会因噪声干扰而导致 VAD 性能下降。相比之下,人类拥有在复杂和嘈杂环境中辨别语音的非凡能力,这促使我们从人类听觉系统中汲取灵感。我们提出了一种稳健的 VAD 算法,称为基于听觉启发的掩蔽调制编码器卷积注意网络(AMME-CANet),它将我们的 AMME 与 CANet 集成在一起。首先,我们研究了作为深度学习编码器(AME)的听觉启发调制特征的设计,有效地模拟了声音信号传输到内耳毛细胞以及神经细胞随后进行调制过滤的过程。其次,基于在人类听觉系统中观察到的掩蔽效应,我们通过在 AMME 中加入掩蔽机制来增强我们的听觉启发调制编码器。AMME 可放大较纯净的语音频率,同时抑制噪声成分。第三,受人类听觉机制的启发并利用上下文信息,我们利用注意力机制进行 VAD。这种方法利用注意力机制,为包含更丰富、更翔实线索的上下文信息分配更高的权重。通过广泛的实验和评估,我们证明了 AMME-CANet 在具有挑战性的噪声条件下增强 VAD 的卓越性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments Forms, factors and functions of phonetic convergence: Editorial Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study Zero-shot voice conversion based on feature disentanglement Multi-modal co-learning for silent speech recognition based on ultrasound tongue images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1