M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

arXiv - CS - Sound Pub Date : 2024-09-14 DOI:arxiv-2409.09284

Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li

引用次数: 0

Abstract

With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modeled by text and speech, has become the paradigm of device-directed speech detection. However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR).To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. Experimental results show that M$^{3}$V significantly outperforms models trained using only single or multi-modality and surpasses human judgment performance on ASR error data for the first time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

M$^{3}$V：用于设备导向语音检测的多模态多视角方法

为了能与虚拟语音助手进行更自然、更像人的交互，该领域最近的研究重点是不依赖重复唤醒词的全双工交互模式。这就要求在声源复杂的场景中，语音助手必须将语音分为面向设备或不面向设备。由文本和语音共同建模的双编码器结构已成为设备导向语音检测的典范。为了应对这一挑战，我们提出了一种用于设备导向语音检测的多模态多视图方法--M$^{3}$V，它将问题框架化为一项多视图学习任务，在多模态之外，在网络中引入了单模态视图和文本-音频对齐视图。实验结果表明，M$^{3}$V 显著优于仅使用单模态或多模态训练的模型，并首次在 ASR 错误数据上超越了人类的判断性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量