A conditional random field approach for audio-visual people diarization

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2014-05-04 DOI:10.1109/ICASSP.2014.6853569

P. Gay, E. Khoury, S. Meignier, J. Odobez, P. Deléglise

引用次数: 16

Abstract

We investigate the problem of audio-visual (AV) person diarization in broadcast data. That is, automatically associate the faces and voices of people and determine when they appear or speak in the video. The contributions are twofolds. First, we formulate the problem within a novel CRF framework that simultaneously performs the AV association of voices and face clusters to build AV person models, and the joint segmentation of the audio and visual streams using a set of AV cues and their association strength. Secondly, we use for this AV association strength a score that does not only rely on lips activity, but also on contextual visual information (face size, position, number of detected faces,...) that leads to more reliable association measures. Experiments on 6 hours of broadcast data show that our framework is able to improve the AV-person diarization especially for speaker segments erroneously labeled in the mono-modal case.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种条件随机场方法在视听人物分类中的应用

我们研究了广播数据中的视听个性化问题。也就是说，自动将人们的面孔和声音联系起来，并确定他们在视频中出现或说话的时间。贡献是双重的。首先，我们在一个新的CRF框架中提出了这个问题，该框架同时执行声音和面部聚类的AV关联以构建AV人物模型，并使用一组AV线索及其关联强度对音频和视觉流进行联合分割。其次，我们对AV关联强度使用的评分不仅依赖于嘴唇活动，还依赖于上下文视觉信息(面部大小、位置、检测到的面部数量等)，从而产生更可靠的关联测量。在6小时的广播数据上的实验表明，我们的框架能够改善自动驾驶人的二化，特别是对于在单模态情况下被错误标记的说话人片段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量