Environmentally robust audio-visual speaker identification

2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI:10.1109/SLT.2016.7846282

Lea Schonherr, Dennis Orth, M. Heckmann, D. Kolossa

引用次数: 9

Abstract

To improve the accuracy of audio-visual speaker identification, we propose a new approach, which achieves an optimal combination of the different modalities on the score level. We use the i-vector method for the acoustics and the local binary pattern (LBP) for the visual speaker recognition. Regarding the input data of both modalities, multiple confidence measures are utilized to calculate an optimal weight for the fusion. Thus, oracle weights are chosen in such a way as to maximize the difference between the score of the genuine speaker and the person with the best competing score. Based on these oracle weights a mapping function for weight estimation is learned. To test the approach, various combinations of noise levels for the acoustic and visual data are considered. We show that the weighted multimodal identification is far less influenced by the presence of noise or distortions in acoustic or visual observations in comparison to an unweighted combination.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

环境稳健的视听扬声器识别

为了提高说话人识别的准确性，本文提出了一种新的方法，该方法在评分水平上实现了不同模式的最佳组合。我们使用i向量方法进行声学识别，使用局部二值模式(LBP)进行视觉说话人识别。对于两种模态的输入数据，使用多个置信度度量来计算融合的最佳权重。因此，选择oracle权重的方式是使真正的演讲者和具有最佳竞争分数的人的分数之间的差异最大化。基于这些oracle权值，学习一个用于权值估计的映射函数。为了测试该方法，考虑了声学和视觉数据的各种噪声水平组合。我们表明，与未加权的组合相比，加权的多模态识别受声学或视觉观测中存在的噪声或扭曲的影响要小得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量