Fully exploring object relation interaction and hidden state attention for video captioning

IF 9.1 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2025-03-01 Epub Date: 2024-10-28 DOI:10.1016/j.patcog.2024.111138

Feiniu Yuan , Sipei Gu , Xiangfen Zhang , Zhijun Fang

{"title":"Fully exploring object relation interaction and hidden state attention for video captioning","authors":"Feiniu Yuan , Sipei Gu , Xiangfen Zhang , Zhijun Fang","doi":"10.1016/j.patcog.2024.111138","DOIUrl":null,"url":null,"abstract":"<div><div>Video Captioning (VC) is a challenging task of automatically generating natural language sentences for describing video contents. As a video often contains multiple objects, it is comprehensively crucial to identify multiple objects and model relationships between them. Previous models usually adopt Graph Convolutional Networks (GCN) to infer relational information via object nodes, but there exist uncertainty and over-smoothing issues of relational reasoning. To tackle these issues, we propose a Knowledge Graph based Video Captioning Network (KG-VCN) by fully exploring object relation interaction, hidden state and attention enhancement. In encoding stages, we present a Graph and Convolution Hybrid Encoder (GCHE), which uses an object detector to find visual objects with bounding boxes for Knowledge Graph (KG) and Convolutional Neural Network (CNN). To model intrinsic relations between detected objects, we propose a knowledge graph based Object Relation Graph Interaction (ORGI) module. In ORGI, we design triplets (<em>head, relation, tail</em>) to efficiently mine object relations, and create a global node to enable adequate information flow among all graph nodes for avoiding possibly missed relations. To produce accurate and rich captions, we propose a hidden State and Attention Enhanced Decoder (SAED) by integrating hidden states and dynamically updated attention features. Our SAED accepts both relational and visual features, adopts Long Short-Term Memory (LSTM) to produce hidden states, and dynamically update attention features. Unlike existing methods, we concatenate state and attention features to predict next word sequentially. To demonstrate the effectiveness of our model, we conduct experiments on three well-known datasets (MSVD, MSR-VTT, VaTeX), and our model achieves impressive results significantly outperforming existing state-of-the-art models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111138"},"PeriodicalIF":9.1000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008896","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video Captioning (VC) is a challenging task of automatically generating natural language sentences for describing video contents. As a video often contains multiple objects, it is comprehensively crucial to identify multiple objects and model relationships between them. Previous models usually adopt Graph Convolutional Networks (GCN) to infer relational information via object nodes, but there exist uncertainty and over-smoothing issues of relational reasoning. To tackle these issues, we propose a Knowledge Graph based Video Captioning Network (KG-VCN) by fully exploring object relation interaction, hidden state and attention enhancement. In encoding stages, we present a Graph and Convolution Hybrid Encoder (GCHE), which uses an object detector to find visual objects with bounding boxes for Knowledge Graph (KG) and Convolutional Neural Network (CNN). To model intrinsic relations between detected objects, we propose a knowledge graph based Object Relation Graph Interaction (ORGI) module. In ORGI, we design triplets (head, relation, tail) to efficiently mine object relations, and create a global node to enable adequate information flow among all graph nodes for avoiding possibly missed relations. To produce accurate and rich captions, we propose a hidden State and Attention Enhanced Decoder (SAED) by integrating hidden states and dynamically updated attention features. Our SAED accepts both relational and visual features, adopts Long Short-Term Memory (LSTM) to produce hidden states, and dynamically update attention features. Unlike existing methods, we concatenate state and attention features to predict next word sequentially. To demonstrate the effectiveness of our model, we conduct experiments on three well-known datasets (MSVD, MSR-VTT, VaTeX), and our model achieves impressive results significantly outperforming existing state-of-the-art models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

充分探索视频字幕的对象关系互动和隐藏状态关注

视频字幕制作（VC）是一项具有挑战性的任务，需要自动生成描述视频内容的自然语言句子。由于视频通常包含多个对象，因此识别多个对象并建立它们之间的关系模型至关重要。以往的模型通常采用图卷积网络（GCN）通过对象节点推断关系信息，但存在关系推理的不确定性和过度平滑问题。针对这些问题，我们提出了基于知识图谱的视频字幕网络（KG-VCN），充分挖掘了对象关系的交互性、隐藏状态和注意力增强。在编码阶段，我们提出了图与卷积混合编码器（GCHE），它使用对象检测器为知识图谱（KG）和卷积神经网络（CNN）找到带有边界框的视觉对象。为了对检测到的物体之间的内在关系建模，我们提出了基于知识图谱的物体关系图交互（ORGI）模块。在 ORGI 中，我们设计了三元组（头部、关系、尾部）来有效挖掘对象关系，并创建了一个全局节点，使所有图节点之间的信息流充分流动，避免可能遗漏的关系。为了生成准确而丰富的字幕，我们提出了一种隐藏状态和注意力增强解码器（SAED），它整合了隐藏状态和动态更新的注意力特征。我们的 SAED 同时接受关系和视觉特征，采用长短时记忆（LSTM）生成隐藏状态，并动态更新注意力特征。与现有方法不同的是，我们将状态和注意力特征串联起来，按顺序预测下一个单词。为了证明我们的模型的有效性，我们在三个著名的数据集（MSVD、MSR-VTT、VaTeX）上进行了实验，我们的模型取得了令人印象深刻的结果，大大超过了现有的最先进模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.