Qimin Cheng;Haiyan Huang;Yuan Xu;Yuzhuo Zhou;Huanying Li;Zhongyuan Wang
{"title":"用于遥感图像字幕的NWPU字幕数据集和MLCA网络","authors":"Qimin Cheng;Haiyan Huang;Yuan Xu;Yuzhuo Zhou;Huanying Li;Zhongyuan Wang","doi":"10.1109/TGRS.2022.3201474","DOIUrl":null,"url":null,"abstract":"Recently, the burgeoning demands for captioning-related applications have inspired great endeavors in the remote sensing community. However, current benchmark datasets are deficient in data volume, category variety, and description richness, which hinders the advancement of new remote sensing image captioning approaches, especially those based on deep learning. To overcome this limitation, we present a larger and more challenging benchmark dataset termed NWPU-Captions is available at \n<uri>https://github.com/HaiyanHuang98/NWPU-Captions</uri>\n. NWPU-Captions contains 157 500 sentences, with all 31 500 images annotated manually by seven experienced volunteers. The superiority of NWPU-Captions over current publicly available benchmark datasets not only lies in its much larger scale but also in its wider coverage of complex scenes and the richness and variety of describing vocabularies. Furthermore, a novel encoder–decoder architecture, multilevel and contextual attention network (MLCA-Net), is proposed. MLCA-Net employs a multilevel attention module to adaptively aggregate image features of specific spatial regions and scales and introduces a contextual attention module to explore the latent context hidden in remote sensing images. MLCA-Net improves the flexibility and diversity of the generated captions while keeping their accuracy and conciseness by exploring the properties of scale variations and semantic ambiguity. Finally, the effectiveness, robustness, and generalization of MLCA-Net are proved through extensive experiments on existing datasets and NWPU-Captions.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"60 ","pages":"1-19"},"PeriodicalIF":7.5000,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning\",\"authors\":\"Qimin Cheng;Haiyan Huang;Yuan Xu;Yuzhuo Zhou;Huanying Li;Zhongyuan Wang\",\"doi\":\"10.1109/TGRS.2022.3201474\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, the burgeoning demands for captioning-related applications have inspired great endeavors in the remote sensing community. However, current benchmark datasets are deficient in data volume, category variety, and description richness, which hinders the advancement of new remote sensing image captioning approaches, especially those based on deep learning. To overcome this limitation, we present a larger and more challenging benchmark dataset termed NWPU-Captions is available at \\n<uri>https://github.com/HaiyanHuang98/NWPU-Captions</uri>\\n. NWPU-Captions contains 157 500 sentences, with all 31 500 images annotated manually by seven experienced volunteers. The superiority of NWPU-Captions over current publicly available benchmark datasets not only lies in its much larger scale but also in its wider coverage of complex scenes and the richness and variety of describing vocabularies. Furthermore, a novel encoder–decoder architecture, multilevel and contextual attention network (MLCA-Net), is proposed. MLCA-Net employs a multilevel attention module to adaptively aggregate image features of specific spatial regions and scales and introduces a contextual attention module to explore the latent context hidden in remote sensing images. MLCA-Net improves the flexibility and diversity of the generated captions while keeping their accuracy and conciseness by exploring the properties of scale variations and semantic ambiguity. Finally, the effectiveness, robustness, and generalization of MLCA-Net are proved through extensive experiments on existing datasets and NWPU-Captions.\",\"PeriodicalId\":13213,\"journal\":{\"name\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"volume\":\"60 \",\"pages\":\"1-19\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2022-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/9866055/\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/9866055/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning
Recently, the burgeoning demands for captioning-related applications have inspired great endeavors in the remote sensing community. However, current benchmark datasets are deficient in data volume, category variety, and description richness, which hinders the advancement of new remote sensing image captioning approaches, especially those based on deep learning. To overcome this limitation, we present a larger and more challenging benchmark dataset termed NWPU-Captions is available at
https://github.com/HaiyanHuang98/NWPU-Captions
. NWPU-Captions contains 157 500 sentences, with all 31 500 images annotated manually by seven experienced volunteers. The superiority of NWPU-Captions over current publicly available benchmark datasets not only lies in its much larger scale but also in its wider coverage of complex scenes and the richness and variety of describing vocabularies. Furthermore, a novel encoder–decoder architecture, multilevel and contextual attention network (MLCA-Net), is proposed. MLCA-Net employs a multilevel attention module to adaptively aggregate image features of specific spatial regions and scales and introduces a contextual attention module to explore the latent context hidden in remote sensing images. MLCA-Net improves the flexibility and diversity of the generated captions while keeping their accuracy and conciseness by exploring the properties of scale variations and semantic ambiguity. Finally, the effectiveness, robustness, and generalization of MLCA-Net are proved through extensive experiments on existing datasets and NWPU-Captions.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.