{"title":"用于无人机遥感图像语义分割的混合 CNN 和变压器网络","authors":"Xuanyu Zhou;Lifan Zhou;Shengrong Gong;Haizhen Zhang;Shan Zhong;Yu Xia;Yizhou Huang","doi":"10.1109/JMASS.2023.3332948","DOIUrl":null,"url":null,"abstract":"Semantic segmentation of unmanned aerial vehicle (UAV) remote sensing images is a recent research hotspot, offering technical support for diverse types of UAV remote sensing missions. However, unlike general scene images, UAV remote sensing images present inherent challenges. These challenges include the complexity of backgrounds, substantial variations in target scales, and dense arrangements of small targets, which severely hinder the accuracy of semantic segmentation. To address these issues, we propose a convolutional neural network (CNN) and transformer hybrid network for semantic segmentation of UAV remote sensing images. The proposed network follows an encoder–decoder architecture that merges a transformer-based encoder with a CNN-based decoder. First, we incorporate the Swin transformer as the encoder to address the limitations of CNN in global modeling, mitigating the interference caused by complex background information. Second, to effectively handle the significant changes in target scales, we design the multiscale feature integration module (MFIM) that enhances the multiscale feature representation capability of the network. Finally, the semantic feature fusion module (SFFM) is designed to filter the redundant noise during the feature fusion process, which improves the recognition of small targets and edges. Experimental results demonstrate that the proposed method outperforms other popular methods on the UAVid and Aeroscapes datasets.","PeriodicalId":100624,"journal":{"name":"IEEE Journal on Miniaturization for Air and Space Systems","volume":"5 1","pages":"33-41"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hybrid CNN and Transformer Network for Semantic Segmentation of UAV Remote Sensing Images\",\"authors\":\"Xuanyu Zhou;Lifan Zhou;Shengrong Gong;Haizhen Zhang;Shan Zhong;Yu Xia;Yizhou Huang\",\"doi\":\"10.1109/JMASS.2023.3332948\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Semantic segmentation of unmanned aerial vehicle (UAV) remote sensing images is a recent research hotspot, offering technical support for diverse types of UAV remote sensing missions. However, unlike general scene images, UAV remote sensing images present inherent challenges. These challenges include the complexity of backgrounds, substantial variations in target scales, and dense arrangements of small targets, which severely hinder the accuracy of semantic segmentation. To address these issues, we propose a convolutional neural network (CNN) and transformer hybrid network for semantic segmentation of UAV remote sensing images. The proposed network follows an encoder–decoder architecture that merges a transformer-based encoder with a CNN-based decoder. First, we incorporate the Swin transformer as the encoder to address the limitations of CNN in global modeling, mitigating the interference caused by complex background information. Second, to effectively handle the significant changes in target scales, we design the multiscale feature integration module (MFIM) that enhances the multiscale feature representation capability of the network. Finally, the semantic feature fusion module (SFFM) is designed to filter the redundant noise during the feature fusion process, which improves the recognition of small targets and edges. Experimental results demonstrate that the proposed method outperforms other popular methods on the UAVid and Aeroscapes datasets.\",\"PeriodicalId\":100624,\"journal\":{\"name\":\"IEEE Journal on Miniaturization for Air and Space Systems\",\"volume\":\"5 1\",\"pages\":\"33-41\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Miniaturization for Air and Space Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10319338/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Miniaturization for Air and Space Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10319338/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hybrid CNN and Transformer Network for Semantic Segmentation of UAV Remote Sensing Images
Semantic segmentation of unmanned aerial vehicle (UAV) remote sensing images is a recent research hotspot, offering technical support for diverse types of UAV remote sensing missions. However, unlike general scene images, UAV remote sensing images present inherent challenges. These challenges include the complexity of backgrounds, substantial variations in target scales, and dense arrangements of small targets, which severely hinder the accuracy of semantic segmentation. To address these issues, we propose a convolutional neural network (CNN) and transformer hybrid network for semantic segmentation of UAV remote sensing images. The proposed network follows an encoder–decoder architecture that merges a transformer-based encoder with a CNN-based decoder. First, we incorporate the Swin transformer as the encoder to address the limitations of CNN in global modeling, mitigating the interference caused by complex background information. Second, to effectively handle the significant changes in target scales, we design the multiscale feature integration module (MFIM) that enhances the multiscale feature representation capability of the network. Finally, the semantic feature fusion module (SFFM) is designed to filter the redundant noise during the feature fusion process, which improves the recognition of small targets and edges. Experimental results demonstrate that the proposed method outperforms other popular methods on the UAVid and Aeroscapes datasets.