ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor

IF 3.7 2区工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Displays Pub Date : 2024-07-24 DOI:10.1016/j.displa.2024.102798

Md. Bipul Hossen, Zhongfu Ye, Amr Abdussalam, Mohammad Alamgir Hossain

{"title":"ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor","authors":"Md. Bipul Hossen, Zhongfu Ye, Amr Abdussalam, Mohammad Alamgir Hossain","doi":"10.1016/j.displa.2024.102798","DOIUrl":null,"url":null,"abstract":"<div><p>Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102798"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938224001628","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Fine-grained image captioning is a focal point in the vision-to-language task and has attracted considerable attention for generating accurate and contextually relevant image captions. Effective attribute prediction and their utilization play a crucial role in enhancing image captioning performance. Despite progress in prior attribute-related methods, they either focus on predicting attributes related to the input image or concentrate on predicting linguistic context-related attributes at each time step in the language model. However, these approaches often overlook the importance of balancing visual and linguistic contexts, leading to ineffective exploitation of semantic information and a subsequent decline in performance. To address these issues, an Independent Attribute Predictor (IAP) is introduced to precisely predict attributes related to the input image by leveraging relationships between visual objects and attribute embeddings. Following this, an Enhanced Attribute Predictor (EAP) is proposed, initially predicting linguistic context-related attributes and then using prior probabilities from the IAP module to rebalance image and linguistic context-related attributes, thereby generating more robust and enhanced attribute probabilities. These refined attributes are then integrated into the language LSTM layer to ensure accurate word prediction at each time step. The integration of the IAP and EAP modules in our proposed image captioning with the enhanced attribute predictor (ICEAP) model effectively incorporates high-level semantic details, enhancing overall model performance. The ICEAP outperforms contemporary models, yielding significant average improvements of 10.62% in CIDEr-D scores for MS-COCO, 9.63% for Flickr30K and 7.74% for Flickr8K datasets using cross-entropy optimization, with qualitative analysis confirming its ability to generate fine-grained captions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ICEAP：带有增强型属性预测器的高级细粒度图像字幕网络

细粒度图像字幕是视觉转语言任务中的一个焦点，在生成准确且与上下文相关的图像字幕方面引起了广泛关注。有效的属性预测及其利用在提高图像标题性能方面起着至关重要的作用。尽管之前与属性相关的方法取得了进展，但这些方法要么侧重于预测与输入图像相关的属性，要么侧重于在语言模型的每个时间步骤中预测与语言上下文相关的属性。然而，这些方法往往忽视了平衡视觉和语言上下文的重要性，从而导致语义信息的无效利用和随之而来的性能下降。为了解决这些问题，我们引入了独立属性预测器（IAP），通过利用视觉对象和属性嵌入之间的关系来精确预测与输入图像相关的属性。随后，又提出了增强型属性预测器（EAP），首先预测与语言上下文相关的属性，然后利用 IAP 模块的先验概率重新平衡图像和语言上下文相关属性，从而生成更稳健、更增强的属性概率。这些经过改进的属性随后被整合到语言 LSTM 层，以确保在每个时间步骤中进行准确的单词预测。在我们提出的图像字幕增强属性预测器（ICEAP）模型中，IAP 和 EAP 模块的集成有效地整合了高层语义细节，从而提高了模型的整体性能。通过交叉熵优化，ICEAP 的表现优于同类模型，其在 MS-COCO 数据集、Flickr30K 数据集和 Flickr8K 数据集上的 CIDEr-D 得分平均提高了 10.62%，Flickr30K 数据集提高了 9.63%，Flickr8K 数据集提高了 7.74%，定性分析证实了其生成细粒度标题的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Displays 工程技术-工程：电子与电气

CiteScore

4.60

自引率

25.60%

发文量

138

审稿时长

92 days

期刊介绍： Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.