Few-Shot Referring Relationships in Videos

Yogesh Kumar, Anand Mishra
{"title":"Few-Shot Referring Relationships in Videos","authors":"Yogesh Kumar, Anand Mishra","doi":"10.1109/CVPR52729.2023.00227","DOIUrl":null,"url":null,"abstract":"Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame-level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52729.2023.00227","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame-level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
视频中的少量引用关系
解释视觉关系是全面理解视频的一个核心方面。给定一个查询视觉关系和一个测试视频,我们的目标是定位通过谓词连接的主题和对象。考虑到现代视觉语言理解能力,只要有大规模的带注释的训练样例可用,解决这个问题是可以实现的。然而,对主语、宾语和谓语的每一个组合进行注释既麻烦又昂贵,而且可能不可行。因此,需要一种模型,它可以学习在空间和时间上定位通过看不见的谓词连接的主题和对象,只使用少数共享公共谓词的支持集视频。我们首次解决了这个具有挑战性的问题,即视频中的少镜头引用关系。为了达到这个目的,我们把这个问题看作是在t部随机场上定义的目标函数的最小化。这里,随机场的顶点对应于主体和客体的候选边界框,T表示测试视频中的帧数。该目标函数由帧级和视觉关系相似势组成。为了学习这些潜力,我们使用了一个关系网络,该网络将查询条件的翻译关系嵌入作为输入,并以情景方式使用支持集视频进行元训练。进一步,利用基于信念传播的随机场信息最小化目标函数,获得主客体轨迹的时空定位。我们使用两个公共基准(即ImageNet-VidVRD和VidOR)进行了广泛的实验,并将所提出的方法与竞争性基准进行比较,以评估其有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
L-CoIns: Language-based Colorization With Instance Awareness Neural Texture Synthesis with Guided Correspondence LOGO: A Long-Form Video Dataset for Group Action Quality Assessment ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer Target-referenced Reactive Grasping for Dynamic Objects
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1