EarSE

IF 4.5 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Pub Date : 2024-01-12 DOI:10.1145/3631447

Di Duan, Yongliang Chen, Weitao Xu, Tianxing Li

{"title":"EarSE","authors":"Di Duan, Yongliang Chen, Weitao Xu, Tianxing Li","doi":"10.1145/3631447","DOIUrl":null,"url":null,"abstract":"Speech enhancement is regarded as the key to the quality of digital communication and is gaining increasing attention in the research field of audio processing. In this paper, we present EarSE, the first robust, hands-free, multi-modal speech enhancement solution using commercial off-the-shelf headphones. The key idea of EarSE is a novel hardware setting---leveraging the form factor of headphones equipped with a boom microphone to establish a stable acoustic sensing field across the user's face. Furthermore, we designed a sensing methodology based on Frequency-Modulated Continuous-Wave, which is an ultrasonic modality sensitive to capture subtle facial articulatory gestures of users when speaking. Moreover, we design a fully attention-based deep neural network to self-adaptively solve the user diversity problem by introducing the Vision Transformer network. We enhance the collaboration between the speech and ultrasonic modalities using a multi-head attention mechanism and a Factorized Bilinear Pooling gate. Extensive experiments demonstrate that EarSE achieves remarkable performance as increasing SiSDR by 14.61 dB and reducing the word error rate of user speech recognition by 22.45--66.41% in real-world application. EarSE not only outperforms seven baselines by 38.0% in SiSNR, 12.4% in STOI, and 20.5% in PESQ on average but also maintains practicality.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"10 9","pages":"1 - 33"},"PeriodicalIF":4.5000,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3631447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Speech enhancement is regarded as the key to the quality of digital communication and is gaining increasing attention in the research field of audio processing. In this paper, we present EarSE, the first robust, hands-free, multi-modal speech enhancement solution using commercial off-the-shelf headphones. The key idea of EarSE is a novel hardware setting---leveraging the form factor of headphones equipped with a boom microphone to establish a stable acoustic sensing field across the user's face. Furthermore, we designed a sensing methodology based on Frequency-Modulated Continuous-Wave, which is an ultrasonic modality sensitive to capture subtle facial articulatory gestures of users when speaking. Moreover, we design a fully attention-based deep neural network to self-adaptively solve the user diversity problem by introducing the Vision Transformer network. We enhance the collaboration between the speech and ultrasonic modalities using a multi-head attention mechanism and a Factorized Bilinear Pooling gate. Extensive experiments demonstrate that EarSE achieves remarkable performance as increasing SiSDR by 14.61 dB and reducing the word error rate of user speech recognition by 22.45--66.41% in real-world application. EarSE not only outperforms seven baselines by 38.0% in SiSNR, 12.4% in STOI, and 20.5% in PESQ on average but also maintains practicality.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

耳塞

语音增强被视为数字通信质量的关键，在音频处理研究领域日益受到关注。在本文中，我们介绍了 EarSE，这是首个使用商用现成耳机的稳健、免提、多模态语音增强解决方案。EarSE 的核心理念是一种新颖的硬件设置--充分利用耳机配备吊杆麦克风的外形优势，在用户面部建立一个稳定的声学感应场。此外，我们还设计了一种基于频率调制连续波的传感方法，这是一种敏感的超声波模式，可以捕捉用户说话时细微的面部发音手势。此外，我们还设计了一种完全基于注意力的深度神经网络，通过引入视觉转换器网络来自适应地解决用户多样性问题。我们利用多头注意力机制和因子化双线性池化门增强了语音和超声波模式之间的协作。广泛的实验证明，EarSE在实际应用中取得了显著的性能，将SiSDR提高了14.61 dB，并将用户语音识别的词错误率降低了22.45%-66.41%。EarSE 不仅在 SiSNR、STOI 和 PESQ 方面平均分别比七种基线方法高出 38.0%、12.4% 和 20.5%，而且保持了实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊