从网格到伪区域:基于对偶关系变换的动态记忆增强图像字幕

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Expert Systems with Applications Pub Date : 2025-05-10 Epub Date: 2025-02-19 DOI:10.1016/j.eswa.2025.126850
Wei Zhou, Weitao Jiang, Zhijie Zheng, Jianchao Li, Tao Su, Haifeng Hu
{"title":"从网格到伪区域:基于对偶关系变换的动态记忆增强图像字幕","authors":"Wei Zhou,&nbsp;Weitao Jiang,&nbsp;Zhijie Zheng,&nbsp;Jianchao Li,&nbsp;Tao Su,&nbsp;Haifeng Hu","doi":"10.1016/j.eswa.2025.126850","DOIUrl":null,"url":null,"abstract":"<div><div>Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"273 ","pages":"Article 126850"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer\",\"authors\":\"Wei Zhou,&nbsp;Weitao Jiang,&nbsp;Zhijie Zheng,&nbsp;Jianchao Li,&nbsp;Tao Su,&nbsp;Haifeng Hu\",\"doi\":\"10.1016/j.eswa.2025.126850\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"273 \",\"pages\":\"Article 126850\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425004725\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/19 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425004725","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/19 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

图像字幕旨在为给定图像自动生成自然语言描述。现有方法通常利用网格级或区域级特征对视觉信息进行编码。然而,通过目标检测器提取区域特征计算成本高且不灵活,而区域特征因缺乏细粒度细节和背景信息而受到批评。此外,目前基于变形金刚的字幕模型只关注标记特征的两两相似性,难以充分理解图像中复杂的场景关系。为了解决这些问题,我们引入了一种可以端到端训练的新型双关系转换器(DRTran)模型。具体来说,在编码阶段,我们首先采用聚类算法生成伪区域特征,不需要额外进行昂贵的标注来训练目标检测器。然后,为了结合网格特征和伪区域特征的优点,设计了一种新的双关系增强(dual relation enhancement, DRE)编码器,从两个不同的视觉特征中捕获目标之间的相关性。此外,我们设计了一种新的动态记忆(DM)模块,利用外部动态记忆向量学习先验知识。该模型通过在视觉关系建模中加入先验知识,学习复杂场景的表示,提高标题的准确性。在解码阶段,我们在语言解码器中设计了一个新的跨模态注意力融合(CAF)模块,自适应地确定增强网格和伪区域特征在每个时间步长的注意力权重。在MS-COCO和Flickr30K数据集上的大量实验表明,我们的DRTran模型比当前的图像字幕方法性能更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer
Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
期刊最新文献
PDGAGRN: Graph diffusion pretraining and dynamic graph learning for gene regulatory network inference from single-cell RNA-sequencing data LDGC3: Learnable deep graph contrastive clustering with triple cluster-structure awareness MOSS‑GAN: a GAN‑enhanced Mamba model with spatial‑spectral co‑optimization for nearshore green tide detection in UAV hyperspectral imagery Iterative conceptual query expansion for biomedical information retrieval MS-Bi-PRM: A dynamic-ready bidirectional probabilistic roadmap algorithm with multi-strategy sampling for high-efficiency robotic manipulator path planning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1