Few-Shot Referring Relationships in Videos

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2023-06-01 DOI:10.1109/CVPR52729.2023.00227

Yogesh Kumar, Anand Mishra

{"title":"Few-Shot Referring Relationships in Videos","authors":"Yogesh Kumar, Anand Mishra","doi":"10.1109/CVPR52729.2023.00227","DOIUrl":null,"url":null,"abstract":"Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame-level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52729.2023.00227","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame-level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视频中的少量引用关系

解释视觉关系是全面理解视频的一个核心方面。给定一个查询视觉关系和一个测试视频，我们的目标是定位通过谓词连接的主题和对象。考虑到现代视觉语言理解能力，只要有大规模的带注释的训练样例可用，解决这个问题是可以实现的。然而，对主语、宾语和谓语的每一个组合进行注释既麻烦又昂贵，而且可能不可行。因此，需要一种模型，它可以学习在空间和时间上定位通过看不见的谓词连接的主题和对象，只使用少数共享公共谓词的支持集视频。我们首次解决了这个具有挑战性的问题，即视频中的少镜头引用关系。为了达到这个目的，我们把这个问题看作是在t部随机场上定义的目标函数的最小化。这里，随机场的顶点对应于主体和客体的候选边界框，T表示测试视频中的帧数。该目标函数由帧级和视觉关系相似势组成。为了学习这些潜力，我们使用了一个关系网络，该网络将查询条件的翻译关系嵌入作为输入，并以情景方式使用支持集视频进行元训练。进一步，利用基于信念传播的随机场信息最小化目标函数，获得主客体轨迹的时空定位。我们使用两个公共基准(即ImageNet-VidVRD和VidOR)进行了广泛的实验，并将所提出的方法与竞争性基准进行比较，以评估其有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量

期刊最新文献

L-CoIns: Language-based Colorization With Instance Awareness Neural Texture Synthesis with Guided Correspondence LOGO: A Long-Form Video Dataset for Group Action Quality Assessment ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer Target-referenced Reactive Grasping for Dynamic Objects