图像标注的视觉和视觉语言领域的几何敏感语义建模

IF 8 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Engineering Applications of Artificial Intelligence Pub Date : 2025-05-01 Epub Date: 2025-02-25 DOI:10.1016/j.engappai.2025.110330

Wencai Zhu, Zetao Jiang, Yuting He

{"title":"图像标注的视觉和视觉语言领域的几何敏感语义建模","authors":"Wencai Zhu, Zetao Jiang, Yuting He","doi":"10.1016/j.engappai.2025.110330","DOIUrl":null,"url":null,"abstract":"<div><div>Transformer-based models with grid features as visual representations perform well in image captioning. However, the division and flattening operations increase the difficulty of capturing objects and their relationships via pure semantic modeling. Furthermore, the natural language generated by the current Transformer model still suffers from semantic overconcentration. In this paper, we aim to improve the attention modules in two ways to solve the above issues. We first propose a Geometry-Sensitive Self-Attention (GSSA) module, subdivide geometric signals in the visual domain into relative position and distance, and assist the semantic modeling process according to their unique characteristics. It compensates for the lack of objects and their relationships in the grid features. Then, we propose a Geometry-Sensitive Cross-Attention (GSCA) module, which perceives the source neighboring relationships between images and text in the visual-language domain from a geometric perspective and uses these relationships to adjust the semantic correspondences between the two dynamically. It spreads overly focused attention to surrounding grids to improve understanding of full image content during captioning. To prove our designs, we apply GSSA and GSCA to a standard Transformer to construct a novel Geometry-Sensitive Transformer Network (GSTNet), which conducts geometry-sensitive semantic modeling in visual and visual-language domains. Extensive experiments are conducted to verify the effectiveness of our proposal. The results show that our GSTNet achieves superior performance compared to many state-of-the-art image captioning models on the Microsoft Common Objects in Context (MSCOCO) dataset. Besides, the generalization of GSTNet is also verified on the Flickr30k dataset.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"147 ","pages":"Article 110330"},"PeriodicalIF":8.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Geometry-sensitive semantic modeling in visual and visual-language domains for image captioning\",\"authors\":\"Wencai Zhu, Zetao Jiang, Yuting He\",\"doi\":\"10.1016/j.engappai.2025.110330\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Transformer-based models with grid features as visual representations perform well in image captioning. However, the division and flattening operations increase the difficulty of capturing objects and their relationships via pure semantic modeling. Furthermore, the natural language generated by the current Transformer model still suffers from semantic overconcentration. In this paper, we aim to improve the attention modules in two ways to solve the above issues. We first propose a Geometry-Sensitive Self-Attention (GSSA) module, subdivide geometric signals in the visual domain into relative position and distance, and assist the semantic modeling process according to their unique characteristics. It compensates for the lack of objects and their relationships in the grid features. Then, we propose a Geometry-Sensitive Cross-Attention (GSCA) module, which perceives the source neighboring relationships between images and text in the visual-language domain from a geometric perspective and uses these relationships to adjust the semantic correspondences between the two dynamically. It spreads overly focused attention to surrounding grids to improve understanding of full image content during captioning. To prove our designs, we apply GSSA and GSCA to a standard Transformer to construct a novel Geometry-Sensitive Transformer Network (GSTNet), which conducts geometry-sensitive semantic modeling in visual and visual-language domains. Extensive experiments are conducted to verify the effectiveness of our proposal. The results show that our GSTNet achieves superior performance compared to many state-of-the-art image captioning models on the Microsoft Common Objects in Context (MSCOCO) dataset. Besides, the generalization of GSTNet is also verified on the Flickr30k dataset.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"147 \",\"pages\":\"Article 110330\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625003306\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/25 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625003306","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

以网格特征作为视觉表征的基于变压器的模型在图像字幕中表现良好。然而，分割和平面化操作增加了通过纯语义建模捕获对象及其关系的难度。此外，由当前Transformer模型生成的自然语言仍然存在语义过度集中的问题。为了解决以上问题，本文将从两个方面对注意力模块进行改进。首先提出几何敏感自注意（GSSA）模块，将视觉域中的几何信号细分为相对位置和距离，并根据其独特的特征辅助语义建模处理。它弥补了网格特征中对象及其关系的缺失。然后，我们提出了一个几何敏感交叉注意（GSCA）模块，该模块从几何角度感知视觉语言域中图像和文本之间的源相邻关系，并利用这些关系动态调整两者之间的语义对应关系。它将过度集中的注意力分散到周围的网格上，以提高对字幕过程中完整图像内容的理解。为了验证我们的设计，我们将GSSA和GSCA应用于标准变压器，构建了一种新的几何敏感变压器网络（GSTNet），该网络在视觉和视觉语言领域进行几何敏感语义建模。进行了大量的实验来验证我们建议的有效性。结果表明，与微软公共对象上下文（MSCOCO）数据集上的许多最先进的图像字幕模型相比，我们的GSTNet实现了卓越的性能。此外，在Flickr30k数据集上验证了GSTNet的泛化效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Geometry-sensitive semantic modeling in visual and visual-language domains for image captioning

Transformer-based models with grid features as visual representations perform well in image captioning. However, the division and flattening operations increase the difficulty of capturing objects and their relationships via pure semantic modeling. Furthermore, the natural language generated by the current Transformer model still suffers from semantic overconcentration. In this paper, we aim to improve the attention modules in two ways to solve the above issues. We first propose a Geometry-Sensitive Self-Attention (GSSA) module, subdivide geometric signals in the visual domain into relative position and distance, and assist the semantic modeling process according to their unique characteristics. It compensates for the lack of objects and their relationships in the grid features. Then, we propose a Geometry-Sensitive Cross-Attention (GSCA) module, which perceives the source neighboring relationships between images and text in the visual-language domain from a geometric perspective and uses these relationships to adjust the semantic correspondences between the two dynamically. It spreads overly focused attention to surrounding grids to improve understanding of full image content during captioning. To prove our designs, we apply GSSA and GSCA to a standard Transformer to construct a novel Geometry-Sensitive Transformer Network (GSTNet), which conducts geometry-sensitive semantic modeling in visual and visual-language domains. Extensive experiments are conducted to verify the effectiveness of our proposal. The results show that our GSTNet achieves superior performance compared to many state-of-the-art image captioning models on the Microsoft Common Objects in Context (MSCOCO) dataset. Besides, the generalization of GSTNet is also verified on the Flickr30k dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Engineering Applications of Artificial Intelligence 工程技术-工程：电子与电气

CiteScore

9.60

自引率

10.00%

发文量

505

审稿时长

68 days

期刊介绍： Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.