图像标注的视觉和视觉语言领域的几何敏感语义建模

IF 8 2区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Engineering Applications of Artificial Intelligence Pub Date : 2025-05-01 Epub Date: 2025-02-25 DOI:10.1016/j.engappai.2025.110330
Wencai Zhu, Zetao Jiang, Yuting He
{"title":"图像标注的视觉和视觉语言领域的几何敏感语义建模","authors":"Wencai Zhu,&nbsp;Zetao Jiang,&nbsp;Yuting He","doi":"10.1016/j.engappai.2025.110330","DOIUrl":null,"url":null,"abstract":"<div><div>Transformer-based models with grid features as visual representations perform well in image captioning. However, the division and flattening operations increase the difficulty of capturing objects and their relationships via pure semantic modeling. Furthermore, the natural language generated by the current Transformer model still suffers from semantic overconcentration. In this paper, we aim to improve the attention modules in two ways to solve the above issues. We first propose a Geometry-Sensitive Self-Attention (GSSA) module, subdivide geometric signals in the visual domain into relative position and distance, and assist the semantic modeling process according to their unique characteristics. It compensates for the lack of objects and their relationships in the grid features. Then, we propose a Geometry-Sensitive Cross-Attention (GSCA) module, which perceives the source neighboring relationships between images and text in the visual-language domain from a geometric perspective and uses these relationships to adjust the semantic correspondences between the two dynamically. It spreads overly focused attention to surrounding grids to improve understanding of full image content during captioning. To prove our designs, we apply GSSA and GSCA to a standard Transformer to construct a novel Geometry-Sensitive Transformer Network (GSTNet), which conducts geometry-sensitive semantic modeling in visual and visual-language domains. Extensive experiments are conducted to verify the effectiveness of our proposal. The results show that our GSTNet achieves superior performance compared to many state-of-the-art image captioning models on the Microsoft Common Objects in Context (MSCOCO) dataset. Besides, the generalization of GSTNet is also verified on the Flickr30k dataset.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"147 ","pages":"Article 110330"},"PeriodicalIF":8.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Geometry-sensitive semantic modeling in visual and visual-language domains for image captioning\",\"authors\":\"Wencai Zhu,&nbsp;Zetao Jiang,&nbsp;Yuting He\",\"doi\":\"10.1016/j.engappai.2025.110330\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Transformer-based models with grid features as visual representations perform well in image captioning. However, the division and flattening operations increase the difficulty of capturing objects and their relationships via pure semantic modeling. Furthermore, the natural language generated by the current Transformer model still suffers from semantic overconcentration. In this paper, we aim to improve the attention modules in two ways to solve the above issues. We first propose a Geometry-Sensitive Self-Attention (GSSA) module, subdivide geometric signals in the visual domain into relative position and distance, and assist the semantic modeling process according to their unique characteristics. It compensates for the lack of objects and their relationships in the grid features. Then, we propose a Geometry-Sensitive Cross-Attention (GSCA) module, which perceives the source neighboring relationships between images and text in the visual-language domain from a geometric perspective and uses these relationships to adjust the semantic correspondences between the two dynamically. It spreads overly focused attention to surrounding grids to improve understanding of full image content during captioning. To prove our designs, we apply GSSA and GSCA to a standard Transformer to construct a novel Geometry-Sensitive Transformer Network (GSTNet), which conducts geometry-sensitive semantic modeling in visual and visual-language domains. Extensive experiments are conducted to verify the effectiveness of our proposal. The results show that our GSTNet achieves superior performance compared to many state-of-the-art image captioning models on the Microsoft Common Objects in Context (MSCOCO) dataset. Besides, the generalization of GSTNet is also verified on the Flickr30k dataset.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"147 \",\"pages\":\"Article 110330\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625003306\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/25 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625003306","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

以网格特征作为视觉表征的基于变压器的模型在图像字幕中表现良好。然而,分割和平面化操作增加了通过纯语义建模捕获对象及其关系的难度。此外,由当前Transformer模型生成的自然语言仍然存在语义过度集中的问题。为了解决以上问题,本文将从两个方面对注意力模块进行改进。首先提出几何敏感自注意(GSSA)模块,将视觉域中的几何信号细分为相对位置和距离,并根据其独特的特征辅助语义建模处理。它弥补了网格特征中对象及其关系的缺失。然后,我们提出了一个几何敏感交叉注意(GSCA)模块,该模块从几何角度感知视觉语言域中图像和文本之间的源相邻关系,并利用这些关系动态调整两者之间的语义对应关系。它将过度集中的注意力分散到周围的网格上,以提高对字幕过程中完整图像内容的理解。为了验证我们的设计,我们将GSSA和GSCA应用于标准变压器,构建了一种新的几何敏感变压器网络(GSTNet),该网络在视觉和视觉语言领域进行几何敏感语义建模。进行了大量的实验来验证我们建议的有效性。结果表明,与微软公共对象上下文(MSCOCO)数据集上的许多最先进的图像字幕模型相比,我们的GSTNet实现了卓越的性能。此外,在Flickr30k数据集上验证了GSTNet的泛化效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Geometry-sensitive semantic modeling in visual and visual-language domains for image captioning
Transformer-based models with grid features as visual representations perform well in image captioning. However, the division and flattening operations increase the difficulty of capturing objects and their relationships via pure semantic modeling. Furthermore, the natural language generated by the current Transformer model still suffers from semantic overconcentration. In this paper, we aim to improve the attention modules in two ways to solve the above issues. We first propose a Geometry-Sensitive Self-Attention (GSSA) module, subdivide geometric signals in the visual domain into relative position and distance, and assist the semantic modeling process according to their unique characteristics. It compensates for the lack of objects and their relationships in the grid features. Then, we propose a Geometry-Sensitive Cross-Attention (GSCA) module, which perceives the source neighboring relationships between images and text in the visual-language domain from a geometric perspective and uses these relationships to adjust the semantic correspondences between the two dynamically. It spreads overly focused attention to surrounding grids to improve understanding of full image content during captioning. To prove our designs, we apply GSSA and GSCA to a standard Transformer to construct a novel Geometry-Sensitive Transformer Network (GSTNet), which conducts geometry-sensitive semantic modeling in visual and visual-language domains. Extensive experiments are conducted to verify the effectiveness of our proposal. The results show that our GSTNet achieves superior performance compared to many state-of-the-art image captioning models on the Microsoft Common Objects in Context (MSCOCO) dataset. Besides, the generalization of GSTNet is also verified on the Flickr30k dataset.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Engineering Applications of Artificial Intelligence
Engineering Applications of Artificial Intelligence 工程技术-工程:电子与电气
CiteScore
9.60
自引率
10.00%
发文量
505
审稿时长
68 days
期刊介绍: Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.
期刊最新文献
Novel artificial intelligence driven model-based control framework for solar–electric vehicle home energy optimization Exploring semantic dependency for reasoning over temporal knowledge graph Enhancing the safety assessment of open-pit mine slopes with interpretable, data-driven stacking learning and three-dimensional stability analysis Dual-domain data enhancement and lightweight deep architecture for robust powder bed defect detection Modeling function-level relationships for vulnerability detection in graph neural networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1