Eye-movement-prompted large image captioning model

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2024-11-01 DOI:10.1016/j.patcog.2024.111097

Zheng Yang , Bing Han , Xinbo Gao , Zhi-Hui Zhan

{"title":"Eye-movement-prompted large image captioning model","authors":"Zheng Yang , Bing Han , Xinbo Gao , Zhi-Hui Zhan","doi":"10.1016/j.patcog.2024.111097","DOIUrl":null,"url":null,"abstract":"<div><div>Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, <em>i.e.</em>, 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111097"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008483","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, i.e., 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

眼动提示大图像字幕模型

预训练的大型视觉语言模型在图像字幕任务中表现出色。然而，由于对图像特征的解码不足，现有的大型模型有时会丢失重要信息，如物体、场景及其关系。此外，这些模型复杂的 "黑箱 "性质使其机制难以解释。研究表明，人类学习到的表征比机器更丰富，这启发我们结合人类的观察模式来提高大型图像字幕模型的准确性和可解释性。我们建立了一个新的数据集，称为 "图像标题中的显著性"（SIC），以探索人类视觉与语言表征之间的关系。我们选择了一千幅具有丰富语境信息的图像作为 SIC 的图像数据。每幅图像都标注了五个标题标签和五个眼动标签。通过对眼动数据的分析，我们发现人类在观察过程中能有效地捕捉到全面的图像标题信息。因此，我们提出了眼动提示大型图像字幕模型，该模型包含两个精心设计的模块：眼动模拟模块（EMS）和眼动分析模块（EMA）。EMS 结合人类观察模式来模拟眼球运动特征，包括眼球固定的位置和扫描路径。EMA 是一个基于图神经网络（GNN）的模块，可解码图形化眼动数据，并将图像特征抽象为有向图。通过对生成的图进行解码，可以预测出更准确的描述。为了验证我们的模型，我们在 MS-COCO 和 NoCaps 数据集上进行了广泛的实验。实验结果表明，我们的网络是可解释的，与最先进的方法相比，它能取得更优越的结果，即在 MS-COCO Karpathy 测试分集上，BLEU-4 为 84.2%，CIDEr-D 为 145.1%，这表明它在图像字幕方面具有强大的应用潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.