Temporal Attention and Consistency Measuring for Video Question Answering

Proceedings of the 2020 International Conference on Multimodal Interaction Pub Date : 2020-10-21 DOI:10.1145/3382507.3418886

Lingyu Zhang, R. Radke

引用次数: 3

Abstract

Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视频问答的时间注意与一致性测量

社会信号处理算法在解决群体讨论的视听记录中定义明确的预测和估计问题方面变得越来越好。然而，人类的许多行为和交流都没有那么结构化，而且更加微妙。在本文中，我们从人类互动的各种视听记录中解决了通用问题的回答问题。目标是为一个关于视频中人类互动的自由文本问题选择正确的自由文本答案。我们提出了一个基于rnn的模型，该模型具有两个新颖的思想:一个时间注意模块，突出显示问题和候选答案中的关键词和短语;一个一致性测量模块，对多模态数据、问题和候选答案之间的相似性进行评分。这个小的一致性分数集构成了最终问答阶段的输入，从而产生轻量级模型。我们证明了我们的模型在包含数百个视频和问题/答案对的Social-IQ数据集上达到了最先进的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量

期刊最新文献

OpenSense: A Platform for Multimodal Data Acquisition and Behavior Perception Human-centered Multimodal Machine Intelligence Touch Recognition with Attentive End-to-End Model MORSE: MultimOdal sentiment analysis for Real-life SEttings Temporal Attention and Consistency Measuring for Video Question Answering