Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li
{"title":"M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection","authors":"Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li","doi":"arxiv-2409.09284","DOIUrl":null,"url":null,"abstract":"With the goal of more natural and human-like interaction with virtual voice\nassistants, recent research in the field has focused on full duplex interaction\nmode without relying on repeated wake-up words. This requires that in scenes\nwith complex sound sources, the voice assistant must classify utterances as\ndevice-oriented or non-device-oriented. The dual-encoder structure, which is\njointly modeled by text and speech, has become the paradigm of device-directed\nspeech detection. However, in practice, these models often produce incorrect\npredictions for unaligned input pairs due to the unavoidable errors of\nautomatic speech recognition (ASR).To address this challenge, we propose\nM$^{3}$V, a multi-modal multi-view approach for device-directed speech\ndetection, which frames we frame the problem as a multi-view learning task that\nintroduces unimodal views and a text-audio alignment view in the network\nbesides the multi-modal. Experimental results show that M$^{3}$V significantly\noutperforms models trained using only single or multi-modality and surpasses\nhuman judgment performance on ASR error data for the first time.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the goal of more natural and human-like interaction with virtual voice
assistants, recent research in the field has focused on full duplex interaction
mode without relying on repeated wake-up words. This requires that in scenes
with complex sound sources, the voice assistant must classify utterances as
device-oriented or non-device-oriented. The dual-encoder structure, which is
jointly modeled by text and speech, has become the paradigm of device-directed
speech detection. However, in practice, these models often produce incorrect
predictions for unaligned input pairs due to the unavoidable errors of
automatic speech recognition (ASR).To address this challenge, we propose
M$^{3}$V, a multi-modal multi-view approach for device-directed speech
detection, which frames we frame the problem as a multi-view learning task that
introduces unimodal views and a text-audio alignment view in the network
besides the multi-modal. Experimental results show that M$^{3}$V significantly
outperforms models trained using only single or multi-modality and surpasses
human judgment performance on ASR error data for the first time.
为了能与虚拟语音助手进行更自然、更像人的交互,该领域最近的研究重点是不依赖重复唤醒词的全双工交互模式。这就要求在声源复杂的场景中,语音助手必须将语音分为面向设备或不面向设备。由文本和语音共同建模的双编码器结构已成为设备导向语音检测的典范。为了应对这一挑战,我们提出了一种用于设备导向语音检测的多模态多视图方法--M$^{3}$V,它将问题框架化为一项多视图学习任务,在多模态之外,在网络中引入了单模态视图和文本-音频对齐视图。实验结果表明,M$^{3}$V 显著优于仅使用单模态或多模态训练的模型,并首次在 ASR 错误数据上超越了人类的判断性能。