ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor

IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Displays Pub Date : 2024-07-24 DOI:10.1016/j.displa.2024.102798
Md. Bipul Hossen, Zhongfu Ye, Amr Abdussalam, Mohammad Alamgir Hossain
{"title":"ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor","authors":"Md. Bipul Hossen,&nbsp;Zhongfu Ye,&nbsp;Amr Abdussalam,&nbsp;Mohammad Alamgir Hossain","doi":"10.1016/j.displa.2024.102798","DOIUrl":null,"url":null,"abstract":"<div><p>Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102798"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938224001628","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ICEAP:带有增强型属性预测器的高级细粒度图像字幕网络
细粒度图像字幕是视觉转语言任务中的一个焦点,在生成准确且与上下文相关的图像字幕方面引起了广泛关注。有效的属性预测及其利用在提高图像标题性能方面起着至关重要的作用。尽管之前与属性相关的方法取得了进展,但这些方法要么侧重于预测与输入图像相关的属性,要么侧重于在语言模型的每个时间步骤中预测与语言上下文相关的属性。然而,这些方法往往忽视了平衡视觉和语言上下文的重要性,从而导致语义信息的无效利用和随之而来的性能下降。为了解决这些问题,我们引入了独立属性预测器(IAP),通过利用视觉对象和属性嵌入之间的关系来精确预测与输入图像相关的属性。随后,又提出了增强型属性预测器(EAP),首先预测与语言上下文相关的属性,然后利用 IAP 模块的先验概率重新平衡图像和语言上下文相关属性,从而生成更稳健、更增强的属性概率。这些经过改进的属性随后被整合到语言 LSTM 层,以确保在每个时间步骤中进行准确的单词预测。在我们提出的图像字幕增强属性预测器(ICEAP)模型中,IAP 和 EAP 模块的集成有效地整合了高层语义细节,从而提高了模型的整体性能。通过交叉熵优化,ICEAP 的表现优于同类模型,其在 MS-COCO 数据集、Flickr30K 数据集和 Flickr8K 数据集上的 CIDEr-D 得分平均提高了 10.62%,Flickr30K 数据集提高了 9.63%,Flickr8K 数据集提高了 7.74%,定性分析证实了其生成细粒度标题的能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Displays
Displays 工程技术-工程:电子与电气
CiteScore
4.60
自引率
25.60%
发文量
138
审稿时长
92 days
期刊介绍: Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.
期刊最新文献
Mambav3d: A mamba-based virtual 3D module stringing semantic information between layers of medical image slices Luminance decomposition and Transformer based no-reference tone-mapped image quality assessment GLDBF: Global and local dual-branch fusion network for no-reference point cloud quality assessment Virtual reality in medical education: Effectiveness of Immersive Virtual Anatomy Laboratory (IVAL) compared to traditional learning approaches Weighted ensemble deep learning approach for classification of gastrointestinal diseases in colonoscopy images aided by explainable AI
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1