HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering

Fei Liu, Jing Liu, Weining Wang, Hanqing Lu
{"title":"HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering","authors":"Fei Liu, Jing Liu, Weining Wang, Hanqing Lu","doi":"10.1109/ICCV48922.2021.00172","DOIUrl":null,"url":null,"abstract":"Relational reasoning is at the heart of video question answering. However, existing approaches suffer from several common limitations: (1) they only focus on either object-level or frame-level relational reasoning, and fail to integrate the both; and (2) they neglect to leverage semantic knowledge for relational reasoning. In this work, we propose a Hierarchical VisuAl-Semantic RelatIonal Reasoning (HAIR) framework to address these limitations. Specifically, we present a novel graph memory mechanism to perform relational reasoning, and further develop two types of graph memory: a) visual graph memory that leverages visual information of video for relational reasoning; b) semantic graph memory that is specifically designed to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform relational reasoning in the semantic space. Taking advantage of both graph memory mechanisms, we build a hierarchical framework to enable visual-semantic relational reasoning from object level to frame level. Experiments on four challenging benchmark datasets show that the proposed framework leads to state-of-the-art performance, with fewer parameters and faster inference speed. Besides, our approach also shows superior performance on other video+language task.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"7 1","pages":"1678-1687"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.00172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29

Abstract

Relational reasoning is at the heart of video question answering. However, existing approaches suffer from several common limitations: (1) they only focus on either object-level or frame-level relational reasoning, and fail to integrate the both; and (2) they neglect to leverage semantic knowledge for relational reasoning. In this work, we propose a Hierarchical VisuAl-Semantic RelatIonal Reasoning (HAIR) framework to address these limitations. Specifically, we present a novel graph memory mechanism to perform relational reasoning, and further develop two types of graph memory: a) visual graph memory that leverages visual information of video for relational reasoning; b) semantic graph memory that is specifically designed to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform relational reasoning in the semantic space. Taking advantage of both graph memory mechanisms, we build a hierarchical framework to enable visual-semantic relational reasoning from object level to frame level. Experiments on four challenging benchmark datasets show that the proposed framework leads to state-of-the-art performance, with fewer parameters and faster inference speed. Besides, our approach also shows superior performance on other video+language task.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
毛发:视频问答的层次视觉语义关系推理
关系推理是视频问答的核心。然而,现有的方法存在几个常见的局限性:(1)它们只关注对象级或框架级的关系推理,而不能将两者集成;(2)他们忽略了利用语义知识进行关系推理。在这项工作中,我们提出了一个层次视觉语义关系推理(HAIR)框架来解决这些限制。具体来说,我们提出了一种新的图记忆机制来执行关系推理,并进一步发展了两种类型的图记忆:a)利用视频的视觉信息进行关系推理的视觉图记忆;B)语义图内存,专门设计用于显式地利用包含在视频对象的类和属性中的语义知识,并在语义空间中执行关系推理。利用这两种图形记忆机制,我们构建了一个分层框架,以实现从对象级到框架级的视觉语义关系推理。在四个具有挑战性的基准数据集上的实验表明,所提出的框架具有最先进的性能,参数更少,推理速度更快。此外,我们的方法在其他视频+语言任务上也表现出优异的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Naturalistic Physical Adversarial Patch for Object Detectors Polarimetric Helmholtz Stereopsis Deep Transport Network for Unsupervised Video Object Segmentation Real-time Vanishing Point Detector Integrating Under-parameterized RANSAC and Hough Transform Adaptive Label Noise Cleaning with Meta-Supervision for Deep Face Recognition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1