Entity Dependency Learning Network With Relation Prediction for Video Visual Relation Detection

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-08-02 DOI:10.1109/TCSVT.2024.3437437

Guoguang Zhang;Yepeng Tang;Chunjie Zhang;Xiaolong Zheng;Yao Zhao

{"title":"Entity Dependency Learning Network With Relation Prediction for Video Visual Relation Detection","authors":"Guoguang Zhang;Yepeng Tang;Chunjie Zhang;Xiaolong Zheng;Yao Zhao","doi":"10.1109/TCSVT.2024.3437437","DOIUrl":null,"url":null,"abstract":"Video Visual Relation Detection (VidVRD) is a pivotal task in the field of video analysis. It involves detecting object trajectories in videos, predicting potential dynamic relation between these trajectories, and ultimately representing these relationships in the form of <subject,> triplets. Correct prediction of relation is vital for VidVRD. Existing methods mostly adopt the simple fusion of visual and language features of entity trajectories as the feature representation for relation predicates. However, these methods do not take into account the dependency information between the relation predication and the subject and object within the triplet. To address this issue, we propose the entity dependency learning network(EDLN), which can capture the dependency information between relation predicates and subjects, objects, and subject-object pairs. It adaptively integrates these dependency information into the feature representation of relation predicates. Additionally, to effectively model the features of the relation existing between various object entities pairs, in the context encoding phase for relation predicate features, we introduce a fully convolutional encoding approach as a substitute for the self-attention mechanism in the Transformer. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed EDLN.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"12425-12436"},"PeriodicalIF":11.1000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10621651/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Video Visual Relation Detection (VidVRD) is a pivotal task in the field of video analysis. It involves detecting object trajectories in videos, predicting potential dynamic relation between these trajectories, and ultimately representing these relationships in the form of triplets. Correct prediction of relation is vital for VidVRD. Existing methods mostly adopt the simple fusion of visual and language features of entity trajectories as the feature representation for relation predicates. However, these methods do not take into account the dependency information between the relation predication and the subject and object within the triplet. To address this issue, we propose the entity dependency learning network(EDLN), which can capture the dependency information between relation predicates and subjects, objects, and subject-object pairs. It adaptively integrates these dependency information into the feature representation of relation predicates. Additionally, to effectively model the features of the relation existing between various object entities pairs, in the context encoding phase for relation predicate features, we introduce a fully convolutional encoding approach as a substitute for the self-attention mechanism in the Transformer. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed EDLN.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于视频视觉关系检测的带有关系预测功能的实体依赖性学习网络

视频视觉关系检测（VidVRD）是视频分析领域的一项关键任务。它包括检测视频中的物体轨迹，预测这些轨迹之间潜在的动态关系，并最终以三元组的形式表示这些关系。正确的关系预测对VidVRD至关重要。现有方法多采用简单融合实体轨迹的视觉特征和语言特征作为关系谓词的特征表示。但是，这些方法没有考虑关系预测与三元组中的主语和宾语之间的依赖信息。为了解决这一问题，我们提出了实体依赖学习网络（EDLN），该网络可以捕获关系谓词与主语、对象和主客体对之间的依赖信息。它自适应地将这些依赖信息集成到关系谓词的特征表示中。此外，为了有效地对各种对象实体对之间存在的关系特征进行建模，在关系谓词特征的上下文编码阶段，我们引入了一种全卷积编码方法来替代Transformer中的自关注机制。在两个公共数据集上的大量实验证明了所提出的EDLN的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.