眼动提示大图像字幕模型

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2024-11-01 DOI:10.1016/j.patcog.2024.111097
Zheng Yang , Bing Han , Xinbo Gao , Zhi-Hui Zhan
{"title":"眼动提示大图像字幕模型","authors":"Zheng Yang ,&nbsp;Bing Han ,&nbsp;Xinbo Gao ,&nbsp;Zhi-Hui Zhan","doi":"10.1016/j.patcog.2024.111097","DOIUrl":null,"url":null,"abstract":"<div><div>Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, <em>i.e.</em>, 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111097"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Eye-movement-prompted large image captioning model\",\"authors\":\"Zheng Yang ,&nbsp;Bing Han ,&nbsp;Xinbo Gao ,&nbsp;Zhi-Hui Zhan\",\"doi\":\"10.1016/j.patcog.2024.111097\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, <em>i.e.</em>, 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"159 \",\"pages\":\"Article 111097\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320324008483\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008483","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

预训练的大型视觉语言模型在图像字幕任务中表现出色。然而,由于对图像特征的解码不足,现有的大型模型有时会丢失重要信息,如物体、场景及其关系。此外,这些模型复杂的 "黑箱 "性质使其机制难以解释。研究表明,人类学习到的表征比机器更丰富,这启发我们结合人类的观察模式来提高大型图像字幕模型的准确性和可解释性。我们建立了一个新的数据集,称为 "图像标题中的显著性"(SIC),以探索人类视觉与语言表征之间的关系。我们选择了一千幅具有丰富语境信息的图像作为 SIC 的图像数据。每幅图像都标注了五个标题标签和五个眼动标签。通过对眼动数据的分析,我们发现人类在观察过程中能有效地捕捉到全面的图像标题信息。因此,我们提出了眼动提示大型图像字幕模型,该模型包含两个精心设计的模块:眼动模拟模块(EMS)和眼动分析模块(EMA)。EMS 结合人类观察模式来模拟眼球运动特征,包括眼球固定的位置和扫描路径。EMA 是一个基于图神经网络(GNN)的模块,可解码图形化眼动数据,并将图像特征抽象为有向图。通过对生成的图进行解码,可以预测出更准确的描述。为了验证我们的模型,我们在 MS-COCO 和 NoCaps 数据集上进行了广泛的实验。实验结果表明,我们的网络是可解释的,与最先进的方法相比,它能取得更优越的结果,即在 MS-COCO Karpathy 测试分集上,BLEU-4 为 84.2%,CIDEr-D 为 145.1%,这表明它在图像字幕方面具有强大的应用潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Eye-movement-prompted large image captioning model
Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, i.e., 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
期刊最新文献
Learning accurate and enriched features for stereo image super-resolution Semi-supervised multi-view feature selection with adaptive similarity fusion and learning DyConfidMatch: Dynamic thresholding and re-sampling for 3D semi-supervised learning CAST: An innovative framework for Cross-dimensional Attention Structure in Transformers Embedded feature selection for robust probability learning machines
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1