Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals

Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems Pub Date : 2021-11-15 DOI:10.1145/3485730.3485945

Tiantian Liu, Ming Gao, Feng Lin, Chao Wang, Zhongjie Ba, Jinsong Han, Wenyao Xu, K. Ren

{"title":"Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals","authors":"Tiantian Liu, Ming Gao, Feng Lin, Chao Wang, Zhongjie Ba, Jinsong Han, Wenyao Xu, K. Ren","doi":"10.1145/3485730.3485945","DOIUrl":null,"url":null,"abstract":"With the advance in automatic speech recognition, voice user interface has gained popularity recently. Since the COVID-19 pandemic, VUI is increasingly preferred in online communication due to its non-contact. Additionally, various ambient noise impedes the public applications of voice user interfaces due to the requirement of audio-only speech recognition methods for a high signal-to-noise ratio. In this paper, we present Wavoice, the first noise-resistant multi-modal speech recognition system that fuses two distinct voice sensing modalities, i.e., millimeter-wave (mmWave) signals and audio signals from a microphone, together. One key contribution is that we model the inherent correlation between mmWave and audio signals. Based on it, Wavoice facilitates the real-time noise-resistant voice activity detection and user targeting from multiple speakers. Furthermore, we elaborate on two novel modules into the neural attention mechanism for multi-modal signals fusion, and result in accurate speech recognition. Extensive experiments verify Wavoice's effectiveness under various conditions with the character recognition error rate below 1% in a range of 7 meters. Wavoice outperforms existing audio-only speech recognition methods with lower character error rate and word error rate. The evaluation in complex scenes validates the robustness of Wavoice.","PeriodicalId":356322,"journal":{"name":"Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems","volume":"28 7","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3485730.3485945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

With the advance in automatic speech recognition, voice user interface has gained popularity recently. Since the COVID-19 pandemic, VUI is increasingly preferred in online communication due to its non-contact. Additionally, various ambient noise impedes the public applications of voice user interfaces due to the requirement of audio-only speech recognition methods for a high signal-to-noise ratio. In this paper, we present Wavoice, the first noise-resistant multi-modal speech recognition system that fuses two distinct voice sensing modalities, i.e., millimeter-wave (mmWave) signals and audio signals from a microphone, together. One key contribution is that we model the inherent correlation between mmWave and audio signals. Based on it, Wavoice facilitates the real-time noise-resistant voice activity detection and user targeting from multiple speakers. Furthermore, we elaborate on two novel modules into the neural attention mechanism for multi-modal signals fusion, and result in accurate speech recognition. Extensive experiments verify Wavoice's effectiveness under various conditions with the character recognition error rate below 1% in a range of 7 meters. Wavoice outperforms existing audio-only speech recognition methods with lower character error rate and word error rate. The evaluation in complex scenes validates the robustness of Wavoice.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

融合毫米波和音频信号的抗噪声多模态语音识别系统

随着自动语音识别技术的发展，语音用户界面得到了广泛的应用。自新冠肺炎疫情以来，虚拟用户界面因其非接触性而日益成为在线交流的首选。此外，由于纯音频语音识别方法对高信噪比的要求，各种环境噪声阻碍了语音用户界面的公共应用。在本文中，我们介绍了Wavoice，这是第一个抗噪声多模态语音识别系统，它融合了两种不同的语音感知模式，即毫米波(mmWave)信号和来自麦克风的音频信号。一个关键的贡献是，我们建立了毫米波和音频信号之间的内在相关性模型。基于此，Wavoice实现了实时的抗噪语音活动检测和多个扬声器的用户定位。在此基础上，我们详细介绍了两个新的多模态信号融合神经注意机制模块，从而实现准确的语音识别。大量的实验验证了Wavoice在各种条件下的有效性，在7米范围内的字符识别错误率低于1%。Wavoice比现有的纯音频语音识别方法具有更低的字符错误率和单词错误率。在复杂场景下的评估验证了Wavoice的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems

自引率

0.00%

发文量

期刊最新文献

Adaptive Video Transmission Strategy Based on Ising Machine Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals Experimental Scalability Study of Consortium Blockchains with BFT Consensus for IoT Automotive Use Case MoRe-Fi: Motion-robust and Fine-grained Respiration Monitoring via Deep-Learning UWB Radar FedMask