VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-11183

Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, J. Dang

引用次数: 8

Abstract

Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues. Experimental results on the real-world Lip Reading Sentences 3 (LRS3) database demonstrate that our proposed VCSE network consistently outperforms other state-of-the-art baselines.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

VCSE:时域视觉语境说话人提取网络

说话人提取寻求在给定辅助参考的多说话人场景中提取目标语音。这种参考可以是听觉的，即预先录制的讲话，视觉的，即嘴唇运动，或上下文的，即语音序列。不同形式的引用提供了不同的和互补的信息，可以融合形成对目标说话人自上而下的注意。以前的研究在单一模型中引入了视觉和上下文模式。本文提出了一种两阶段时域视觉语境说话人提取网络VCSE，该网络分阶段融合视觉语境线索和自注册语境线索，充分利用每一种情态。在第一阶段，我们使用视觉线索预提取目标语音，并估计潜在的语音序列。在第二阶段，我们使用自注册上下文线索对预提取的目标语音进行细化。在现实世界唇读句子3 (LRS3)数据库上的实验结果表明，我们提出的VCSE网络始终优于其他最先进的基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Interspeech

自引率

0.00%

发文量