从网格到伪区域：基于对偶关系变换的动态记忆增强图像字幕

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Expert Systems with Applications Pub Date : 2025-05-10 Epub Date: 2025-02-19 DOI:10.1016/j.eswa.2025.126850

Wei Zhou, Weitao Jiang, Zhijie Zheng, Jianchao Li, Tao Su, Haifeng Hu

{"title":"从网格到伪区域：基于对偶关系变换的动态记忆增强图像字幕","authors":"Wei Zhou, Weitao Jiang, Zhijie Zheng, Jianchao Li, Tao Su, Haifeng Hu","doi":"10.1016/j.eswa.2025.126850","DOIUrl":null,"url":null,"abstract":"<div><div>Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"273 ","pages":"Article 126850"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer\",\"authors\":\"Wei Zhou, Weitao Jiang, Zhijie Zheng, Jianchao Li, Tao Su, Haifeng Hu\",\"doi\":\"10.1016/j.eswa.2025.126850\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"273 \",\"pages\":\"Article 126850\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425004725\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/19 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425004725","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/19 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

图像字幕旨在为给定图像自动生成自然语言描述。现有方法通常利用网格级或区域级特征对视觉信息进行编码。然而，通过目标检测器提取区域特征计算成本高且不灵活，而区域特征因缺乏细粒度细节和背景信息而受到批评。此外，目前基于变形金刚的字幕模型只关注标记特征的两两相似性，难以充分理解图像中复杂的场景关系。为了解决这些问题，我们引入了一种可以端到端训练的新型双关系转换器（DRTran）模型。具体来说，在编码阶段，我们首先采用聚类算法生成伪区域特征，不需要额外进行昂贵的标注来训练目标检测器。然后，为了结合网格特征和伪区域特征的优点，设计了一种新的双关系增强（dual relation enhancement， DRE）编码器，从两个不同的视觉特征中捕获目标之间的相关性。此外，我们设计了一种新的动态记忆（DM）模块，利用外部动态记忆向量学习先验知识。该模型通过在视觉关系建模中加入先验知识，学习复杂场景的表示，提高标题的准确性。在解码阶段，我们在语言解码器中设计了一个新的跨模态注意力融合（CAF）模块，自适应地确定增强网格和伪区域特征在每个时间步长的注意力权重。在MS-COCO和Flickr30K数据集上的大量实验表明，我们的DRTran模型比当前的图像字幕方法性能更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer

Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.