Wei Zhou, Weitao Jiang, Zhijie Zheng, Jianchao Li, Tao Su, Haifeng Hu
{"title":"从网格到伪区域:基于对偶关系变换的动态记忆增强图像字幕","authors":"Wei Zhou, Weitao Jiang, Zhijie Zheng, Jianchao Li, Tao Su, Haifeng Hu","doi":"10.1016/j.eswa.2025.126850","DOIUrl":null,"url":null,"abstract":"<div><div>Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"273 ","pages":"Article 126850"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer\",\"authors\":\"Wei Zhou, Weitao Jiang, Zhijie Zheng, Jianchao Li, Tao Su, Haifeng Hu\",\"doi\":\"10.1016/j.eswa.2025.126850\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"273 \",\"pages\":\"Article 126850\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425004725\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/19 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425004725","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/19 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer
Image captioning aims to automatically generate a description in natural language for a given image. The existing methods typically exploit grid-level or region-level features to encode visual information. However, extracting region features by an object detector is computationally expensive and inflexible, while region features are criticized for lacking fine-grained details and background information. Besides, current Transformer-based captioning models only focus on the pairwise similarity of token features, which makes it difficult to fully understand the complex scene relationships in images. To address these issues, we introduce a novel Dual Relation Transformer (DRTran) model that can be trained end-to-end. Concretely, in the encoding phase, we first adopt a clustering algorithm to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, in order to combine the advantages of grid and pseudo-region features, we design a new dual relation enhancement (DRE) encoder to capture the correlation between objects from two different visual features. Furthermore, we devise a novel dynamic memory (DM) module to learn prior knowledge with external dynamic memory vectors. By adding prior knowledge in visual relationship modeling, the model learns complex scene representations to improve caption accuracy. During the decoding stage, we design a new cross-modal attention fusion (CAF) module in the language decoder to adaptively decide the attention weights of enhanced grid and pseudo-region features at each time step. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our DRTran model performs better than current image captioning methods.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.