利用位置-信道语义融合为遥感图像添加字幕的增强变换器

IF 2.6 3区工程技术 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Electronics Pub Date : 2024-09-11 DOI:10.3390/electronics13183605

An Zhao, Wenzhong Yang, Danny Chen, Fuyuan Wei

{"title":"利用位置-信道语义融合为遥感图像添加字幕的增强变换器","authors":"An Zhao, Wenzhong Yang, Danny Chen, Fuyuan Wei","doi":"10.3390/electronics13183605","DOIUrl":null,"url":null,"abstract":"Remote-sensing image captioning (RSIC) aims to generate descriptive sentences for ages by capturing both local and global semantic information. This task is challenging due to the diverse object types and varying scenes in ages. To address these challenges, we propose a positional-channel semantic fusion transformer (PCSFTr). The PCSFTr model employs scene classification to initially extract visual features and learn semantic information. A novel positional-channel multi-headed self-attention (PCMSA) block captures spatial and channel dependencies simultaneously, enriching the semantic information. The feature fusion (FF) module further enhances the understanding of semantic relationships. Experimental results show that PCSFTr significantly outperforms existing methods. Specifically, the BLEU-4 index reached 78.42% in UCM-caption, 54.42% in RSICD, and 69.01% in NWPU-captions. This research provides new insights into RSIC by offering a more comprehensive understanding of semantic information and relationships within images and improving the performance of image captioning models.","PeriodicalId":11646,"journal":{"name":"Electronics","volume":"32 1","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion\",\"authors\":\"An Zhao, Wenzhong Yang, Danny Chen, Fuyuan Wei\",\"doi\":\"10.3390/electronics13183605\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Remote-sensing image captioning (RSIC) aims to generate descriptive sentences for ages by capturing both local and global semantic information. This task is challenging due to the diverse object types and varying scenes in ages. To address these challenges, we propose a positional-channel semantic fusion transformer (PCSFTr). The PCSFTr model employs scene classification to initially extract visual features and learn semantic information. A novel positional-channel multi-headed self-attention (PCMSA) block captures spatial and channel dependencies simultaneously, enriching the semantic information. The feature fusion (FF) module further enhances the understanding of semantic relationships. Experimental results show that PCSFTr significantly outperforms existing methods. Specifically, the BLEU-4 index reached 78.42% in UCM-caption, 54.42% in RSICD, and 69.01% in NWPU-captions. This research provides new insights into RSIC by offering a more comprehensive understanding of semantic information and relationships within images and improving the performance of image captioning models.\",\"PeriodicalId\":11646,\"journal\":{\"name\":\"Electronics\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Electronics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.3390/electronics13183605\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3390/electronics13183605","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

遥感图像字幕（RSIC）旨在通过捕捉局部和全局语义信息，生成描述年龄的句子。由于物体类型多样，年龄场景各异，这项任务极具挑战性。为了应对这些挑战，我们提出了位置信道语义融合转换器（PCSFTr）。PCSFTr 模型采用场景分类来初步提取视觉特征并学习语义信息。一个新颖的位置-信道多头自注意（PCMSA）模块可同时捕捉空间和信道依赖性，从而丰富语义信息。特征融合（FF）模块进一步增强了对语义关系的理解。实验结果表明，PCSFTr 明显优于现有方法。具体来说，在 UCM 字幕中的 BLEU-4 指数达到了 78.42%，在 RSICD 中达到了 54.42%，在 NWPU 字幕中达到了 69.01%。这项研究通过更全面地了解图像中的语义信息和关系，提高了图像字幕模型的性能，从而为 RSIC 提供了新的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Remote-sensing image captioning (RSIC) aims to generate descriptive sentences for ages by capturing both local and global semantic information. This task is challenging due to the diverse object types and varying scenes in ages. To address these challenges, we propose a positional-channel semantic fusion transformer (PCSFTr). The PCSFTr model employs scene classification to initially extract visual features and learn semantic information. A novel positional-channel multi-headed self-attention (PCMSA) block captures spatial and channel dependencies simultaneously, enriching the semantic information. The feature fusion (FF) module further enhances the understanding of semantic relationships. Experimental results show that PCSFTr significantly outperforms existing methods. Specifically, the BLEU-4 index reached 78.42% in UCM-caption, 54.42% in RSICD, and 69.01% in NWPU-captions. This research provides new insights into RSIC by offering a more comprehensive understanding of semantic information and relationships within images and improving the performance of image captioning models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Electronics Computer Science-Computer Networks and Communications

CiteScore

1.10

自引率

10.30%

发文量

3515

审稿时长

16.71 days

期刊介绍： Electronics (ISSN 2079-9292; CODEN: ELECGJ) is an international, open access journal on the science of electronics and its applications published quarterly online by MDPI.