用于遥感图像字幕的NWPU字幕数据集和MLCA网络

IF 7.5 1区 地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2022-08-24 DOI:10.1109/TGRS.2022.3201474
Qimin Cheng;Haiyan Huang;Yuan Xu;Yuzhuo Zhou;Huanying Li;Zhongyuan Wang
{"title":"用于遥感图像字幕的NWPU字幕数据集和MLCA网络","authors":"Qimin Cheng;Haiyan Huang;Yuan Xu;Yuzhuo Zhou;Huanying Li;Zhongyuan Wang","doi":"10.1109/TGRS.2022.3201474","DOIUrl":null,"url":null,"abstract":"Recently, the burgeoning demands for captioning-related applications have inspired great endeavors in the remote sensing community. However, current benchmark datasets are deficient in data volume, category variety, and description richness, which hinders the advancement of new remote sensing image captioning approaches, especially those based on deep learning. To overcome this limitation, we present a larger and more challenging benchmark dataset termed NWPU-Captions is available at \n<uri>https://github.com/HaiyanHuang98/NWPU-Captions</uri>\n. NWPU-Captions contains 157 500 sentences, with all 31 500 images annotated manually by seven experienced volunteers. The superiority of NWPU-Captions over current publicly available benchmark datasets not only lies in its much larger scale but also in its wider coverage of complex scenes and the richness and variety of describing vocabularies. Furthermore, a novel encoder–decoder architecture, multilevel and contextual attention network (MLCA-Net), is proposed. MLCA-Net employs a multilevel attention module to adaptively aggregate image features of specific spatial regions and scales and introduces a contextual attention module to explore the latent context hidden in remote sensing images. MLCA-Net improves the flexibility and diversity of the generated captions while keeping their accuracy and conciseness by exploring the properties of scale variations and semantic ambiguity. Finally, the effectiveness, robustness, and generalization of MLCA-Net are proved through extensive experiments on existing datasets and NWPU-Captions.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"60 ","pages":"1-19"},"PeriodicalIF":7.5000,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning\",\"authors\":\"Qimin Cheng;Haiyan Huang;Yuan Xu;Yuzhuo Zhou;Huanying Li;Zhongyuan Wang\",\"doi\":\"10.1109/TGRS.2022.3201474\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, the burgeoning demands for captioning-related applications have inspired great endeavors in the remote sensing community. However, current benchmark datasets are deficient in data volume, category variety, and description richness, which hinders the advancement of new remote sensing image captioning approaches, especially those based on deep learning. To overcome this limitation, we present a larger and more challenging benchmark dataset termed NWPU-Captions is available at \\n<uri>https://github.com/HaiyanHuang98/NWPU-Captions</uri>\\n. NWPU-Captions contains 157 500 sentences, with all 31 500 images annotated manually by seven experienced volunteers. The superiority of NWPU-Captions over current publicly available benchmark datasets not only lies in its much larger scale but also in its wider coverage of complex scenes and the richness and variety of describing vocabularies. Furthermore, a novel encoder–decoder architecture, multilevel and contextual attention network (MLCA-Net), is proposed. MLCA-Net employs a multilevel attention module to adaptively aggregate image features of specific spatial regions and scales and introduces a contextual attention module to explore the latent context hidden in remote sensing images. MLCA-Net improves the flexibility and diversity of the generated captions while keeping their accuracy and conciseness by exploring the properties of scale variations and semantic ambiguity. Finally, the effectiveness, robustness, and generalization of MLCA-Net are proved through extensive experiments on existing datasets and NWPU-Captions.\",\"PeriodicalId\":13213,\"journal\":{\"name\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"volume\":\"60 \",\"pages\":\"1-19\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2022-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/9866055/\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/9866055/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 15

摘要

最近,对字幕相关应用的快速增长的需求激发了遥感界的巨大努力。然而,当前的基准数据集在数据量、类别多样性和描述丰富性方面都存在不足,这阻碍了新的遥感图像字幕方法的发展,尤其是基于深度学习的方法。为了克服这一限制,我们提出了一个更大、更具挑战性的基准数据集,称为NWPU Captionshttps://github.com/HaiyanHuang98/NWPU-Captions.NWPU字幕包含157500个句子,所有31500张图片由7名经验丰富的志愿者手动注释。与当前公开的基准数据集相比,NWPU Captions的优势不仅在于其更大的规模,还在于其对复杂场景的更广泛覆盖以及描述词汇的丰富性和多样性。此外,提出了一种新的编码器-解码器架构,即多级上下文注意力网络(MLCA-Net)。MLCA-Net采用多级注意力模块自适应地聚合特定空间区域和尺度的图像特征,并引入上下文注意力模块来探索遥感图像中隐藏的潜在上下文。MLCA-Net通过探索尺度变化和语义歧义的特性,提高了生成字幕的灵活性和多样性,同时保持了字幕的准确性和简洁性。最后,通过对现有数据集和NWPU字幕的大量实验,证明了MLCA-Net的有效性、鲁棒性和泛化能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
Recently, the burgeoning demands for captioning-related applications have inspired great endeavors in the remote sensing community. However, current benchmark datasets are deficient in data volume, category variety, and description richness, which hinders the advancement of new remote sensing image captioning approaches, especially those based on deep learning. To overcome this limitation, we present a larger and more challenging benchmark dataset termed NWPU-Captions is available at https://github.com/HaiyanHuang98/NWPU-Captions . NWPU-Captions contains 157 500 sentences, with all 31 500 images annotated manually by seven experienced volunteers. The superiority of NWPU-Captions over current publicly available benchmark datasets not only lies in its much larger scale but also in its wider coverage of complex scenes and the richness and variety of describing vocabularies. Furthermore, a novel encoder–decoder architecture, multilevel and contextual attention network (MLCA-Net), is proposed. MLCA-Net employs a multilevel attention module to adaptively aggregate image features of specific spatial regions and scales and introduces a contextual attention module to explore the latent context hidden in remote sensing images. MLCA-Net improves the flexibility and diversity of the generated captions while keeping their accuracy and conciseness by exploring the properties of scale variations and semantic ambiguity. Finally, the effectiveness, robustness, and generalization of MLCA-Net are proved through extensive experiments on existing datasets and NWPU-Captions.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Geoscience and Remote Sensing
IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理
CiteScore
11.50
自引率
28.00%
发文量
1912
审稿时长
4.0 months
期刊介绍: IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.
期刊最新文献
FAA-Det: Feature Augmentation and Alignment for Anchor-Free Oriented Object Detection Dual-model Collaboration Consistency Semi-Supervised Learning for Few-shot Lithology Interpretation VSDM: Variable Scale Diffusion Model based on Dynamic Condition Guidance for Pansharpening ESMS-Net: Enhancing Semantic-Mask Segmentation Network with Pyramid Atrousformer for Remote Sensing Image Infrared Small Targets Detection via Nested U-Structure with Attention and Multiscale Feature Pyramid
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1