End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration

Jingkuan Song, Pengpeng Zeng, Jiayang Gu, Jinkuan Zhu, Lianli Gao
{"title":"End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration","authors":"Jingkuan Song, Pengpeng Zeng, Jiayang Gu, Jinkuan Zhu, Lianli Gao","doi":"10.21655/ijsi.1673-7288.00316","DOIUrl":null,"url":null,"abstract":"PDF HTML XML Export Cite reminder End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration DOI: 10.21655/ijsi.1673-7288.00316 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance. Reference Related Cited by","PeriodicalId":479632,"journal":{"name":"International Journal of Software and Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Software and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21655/ijsi.1673-7288.00316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

PDF HTML XML Export Cite reminder End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration DOI: 10.21655/ijsi.1673-7288.00316 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance. Reference Related Cited by
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于视觉区域聚合和双层协作的端到端图像字幕
PDF HTML XML导出引用提醒基于视觉区域聚合和双层协作的端到端图像字幕DOI: 10.21655/ijsi.1673-7288.00316作者:隶属单位:Clc编号:基金项目:摘要:迄今为止,基于transformer的预训练模型已经展示了强大的模态表示能力,导致向多模态下游任务(如图像字幕)的完全端到端范式转变,并实现了更好的性能和更快的推理。然而,使用预训练模型提取的网格特征缺乏区域视觉信息,导致模型对目标内容的描述不准确。因此,使用预训练模型进行图像字幕的适用性在很大程度上仍未得到探索。为此,本文提出了一种基于视觉区域聚合和双层协作(VRADC)的端到端图像字幕方法。具体而言,为了学习区域视觉信息,本文设计了一种视觉区域聚合方法,将语义相似的网格特征聚合在一起,得到紧凑的视觉区域表示。接下来,双级协作使用交叉注意机制从两个视觉特征中学习更具代表性的语义信息,进而生成更细粒度的描述。在MSCOCO和Flickr30k数据集上的实验结果表明,所提出的VRADC方法可以显著提高图像字幕的质量,达到最先进的性能。相关参考文献
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
UC-based Approximate Incremental Reachability GC-MCR: Directed Graph Constraint-guided Concurrent Bug Detection Method Refinement-based Modeling and Formal Verification for Multiple Secure Partitions of TrustZone Preface to the Special Issue on Constraint Solving and Theorem Proving Consequence-based Axiom Pinpointing for Expressive Description Logic Ontologies
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1