mSilent: Towards General Corpus Silent Speech Recognition Using COTS mmWave Radar

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Pub Date : 2023-01-01 DOI:10.1145/3580838

Shangcui Zeng, Hao Wan, Shuyu Shi, Wei Wang

{"title":"mSilent: Towards General Corpus Silent Speech Recognition Using COTS mmWave Radar","authors":"Shangcui Zeng, Hao Wan, Shuyu Shi, Wei Wang","doi":"10.1145/3580838","DOIUrl":null,"url":null,"abstract":"Silent speech recognition (SSR) allows users to speak to the device without making a sound, avoiding being overheard or disturbing others. Compared to the video-based approach, wireless signal-based SSR can work when the user is wearing a mask and has fewer privacy concerns. However, previous wireless-based systems are still far from well-studied, e.g. they are only evaluated in corpus with highly limited size, making them only feasible for interaction with dozens of deterministic commands. In this paper, we present mSilent, a millimeter-wave (mmWave) based SSR system that can work in the general corpus containing thousands of daily conversation sentences. With the strong recognition capability, mSilent not only supports the more complex interaction with assistants, but also enables more general applications in daily life such as communication and input. To extract fine-grained articulatory features, we build a signal processing pipeline that uses a clustering-selection algorithm to separate articulatory gestures and generates a multi-scale detrended spectrogram (MSDS). To handle the complexity of the general corpus, we design an end-to-end deep neural network that consists of a multi-branch convolutional front-end and a Transformer-based sequence-to-sequence back-end. We collect a general corpus dataset of 1,000 daily conversation sentences that contains 21K samples of bi-modality data (mmWave and video). Our evaluation shows that mSilent achieves a 9.5% average word error rate (WER) at a distance of 1.5m, which is comparable to the performance of the state-of-the-art video-based approach. We also explore deploying mSilent in two typical scenarios of text entry and in-car assistant, and the less than 6% average WER demonstrates the potential of mSilent in general daily applications. CCS Concepts: • Human-centered computing → Ubiquitous and mobile computing systems and tools ;","PeriodicalId":20463,"journal":{"name":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.","volume":"3 1","pages":"39:1-39:28"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3580838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Silent speech recognition (SSR) allows users to speak to the device without making a sound, avoiding being overheard or disturbing others. Compared to the video-based approach, wireless signal-based SSR can work when the user is wearing a mask and has fewer privacy concerns. However, previous wireless-based systems are still far from well-studied, e.g. they are only evaluated in corpus with highly limited size, making them only feasible for interaction with dozens of deterministic commands. In this paper, we present mSilent, a millimeter-wave (mmWave) based SSR system that can work in the general corpus containing thousands of daily conversation sentences. With the strong recognition capability, mSilent not only supports the more complex interaction with assistants, but also enables more general applications in daily life such as communication and input. To extract fine-grained articulatory features, we build a signal processing pipeline that uses a clustering-selection algorithm to separate articulatory gestures and generates a multi-scale detrended spectrogram (MSDS). To handle the complexity of the general corpus, we design an end-to-end deep neural network that consists of a multi-branch convolutional front-end and a Transformer-based sequence-to-sequence back-end. We collect a general corpus dataset of 1,000 daily conversation sentences that contains 21K samples of bi-modality data (mmWave and video). Our evaluation shows that mSilent achieves a 9.5% average word error rate (WER) at a distance of 1.5m, which is comparable to the performance of the state-of-the-art video-based approach. We also explore deploying mSilent in two typical scenarios of text entry and in-car assistant, and the less than 6% average WER demonstrates the potential of mSilent in general daily applications. CCS Concepts: • Human-centered computing → Ubiquitous and mobile computing systems and tools ;

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于COTS毫米波雷达的通用语料库无声语音识别

无声语音识别(SSR)允许用户在不发出声音的情况下对设备说话，避免被偷听或打扰他人。与基于视频的方法相比，基于无线信号的SSR可以在用户戴着面具的情况下工作，并且隐私问题较少。然而，以前的基于无线的系统还远远没有得到充分的研究，例如，它们只在高度有限的语料库中进行评估，这使得它们只能与几十个确定性命令进行交互。在本文中，我们提出了一种基于毫米波(mmWave)的SSR系统mSilent，该系统可以在包含数千个日常会话句子的一般语料库中工作。mSilent具有强大的识别能力，不仅可以支持与助手更复杂的交互，还可以实现日常生活中更通用的应用，如交流和输入。为了提取细粒度的发音特征，我们构建了一个信号处理管道，该管道使用聚类选择算法分离发音手势并生成多尺度去趋势谱图(MSDS)。为了处理通用语料库的复杂性，我们设计了一个端到端的深度神经网络，该网络由多分支卷积前端和基于transformer的序列到序列后端组成。我们收集了1000个日常会话句子的通用语料库数据集，其中包含21K双模态数据样本(毫米波和视频)。我们的评估表明，mSilent在1.5米的距离上实现了9.5%的平均单词错误率(WER)，这与最先进的基于视频的方法的性能相当。我们还探索了在文本输入和车载助手这两种典型场景中部署mSilent的可能性，低于6%的平均WER显示了mSilent在一般日常应用程序中的潜力。CCS概念:•以人为中心的计算→无处不在的移动计算系统和工具;

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.

自引率

0.00%

发文量