Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network

IF 3 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-02-01 Epub Date: 2023-12-14 DOI:10.1016/j.specom.2023.103024

Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang

{"title":"Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network","authors":"Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang","doi":"10.1016/j.specom.2023.103024","DOIUrl":null,"url":null,"abstract":"<div>Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional attention network (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.</div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103024"},"PeriodicalIF":3.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001589","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/14 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional attention network (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用基于卷积注意力网络的听觉启发式掩蔽调制编码器进行鲁棒语音活动检测

深度学习为语音活动检测（VAD）带来了革命性的变化，提供了前景广阔的解决方案。然而，将原始波形和梅尔频率共振频率系数等传统特征直接应用于深度神经网络，往往会因噪声干扰而导致 VAD 性能下降。相比之下，人类拥有在复杂和嘈杂环境中辨别语音的非凡能力，这促使我们从人类听觉系统中汲取灵感。我们提出了一种稳健的 VAD 算法，称为基于听觉启发的掩蔽调制编码器卷积注意网络（AMME-CANet），它将我们的 AMME 与 CANet 集成在一起。首先，我们研究了作为深度学习编码器（AME）的听觉启发调制特征的设计，有效地模拟了声音信号传输到内耳毛细胞以及神经细胞随后进行调制过滤的过程。其次，基于在人类听觉系统中观察到的掩蔽效应，我们通过在 AMME 中加入掩蔽机制来增强我们的听觉启发调制编码器。AMME 可放大较纯净的语音频率，同时抑制噪声成分。第三，受人类听觉机制的启发并利用上下文信息，我们利用注意力机制进行 VAD。这种方法利用注意力机制，为包含更丰富、更翔实线索的上下文信息分配更高的权重。通过广泛的实验和评估，我们证明了 AMME-CANet 在具有挑战性的噪声条件下增强 VAD 的卓越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.