DCTNET: HYBRID NETWORK MODEL FUSING WITH MULTISCALE DEFORMABLE CNN AND TRANSFORMER STRUCTURE FOR ROAD EXTRACTION FROM GAOFEN SATELLITE REMOTE SENSING IMAGE
{"title":"DCTNET: HYBRID NETWORK MODEL FUSING WITH MULTISCALE DEFORMABLE CNN AND TRANSFORMER STRUCTURE FOR ROAD EXTRACTION FROM GAOFEN SATELLITE REMOTE SENSING IMAGE","authors":"Q. Yuan","doi":"10.5194/isprs-archives-xlviii-m-3-2023-273-2023","DOIUrl":null,"url":null,"abstract":"Abstract. The urban road network detection and extraction have significant applications in many domains, such as intelligent transportation and navigation, urban planning, and automatic driving. Although manual annotation methods can provide accurate road network maps, their low efficiency with high-cost consumption are insufficient for the current tasks. Traditional methods based on spectral or geometric information rely on shallow features and often struggle with low semantic segmentation accuracy in complex remote sensing backgrounds. In recent years, deep convolutional neural networks (CNN) have provided robust feature representations to distinguish complex terrain objects. However, these CNNs ignore the fusion of global-local contexts and are often confused with other types of features, especially buildings. In addition, conventional convolution operations use a fixed template paradigm to aggregate local feature information. The road features present complex linear-shape geometric relationships, which brings some obstacles to feature construction. To address the above issues, we proposed a hybrid network structure that combines the advantages of CNN and transformer models. Specifically, a multiscale deformable convolution module has been developed to capture local road context information adaptively. The Transformer model is introduced into the encoder to enhance semantic information to build the global context. Meanwhile, the CNN features are fused with the transformer features. Finally, the model outputs a road extraction prediction map in high spatial resolution. Quantitative analysis and visual expression confirm that the proposed model can effectively and automatically extract road features from complex remote sensing backgrounds, outperforming state-of-the-art methods with IOU by 86.5% and OA by 97.4%.\n","PeriodicalId":30634,"journal":{"name":"The International Archives of the Photogrammetry Remote Sensing and Spatial Information Sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International Archives of the Photogrammetry Remote Sensing and Spatial Information Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5194/isprs-archives-xlviii-m-3-2023-273-2023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract. The urban road network detection and extraction have significant applications in many domains, such as intelligent transportation and navigation, urban planning, and automatic driving. Although manual annotation methods can provide accurate road network maps, their low efficiency with high-cost consumption are insufficient for the current tasks. Traditional methods based on spectral or geometric information rely on shallow features and often struggle with low semantic segmentation accuracy in complex remote sensing backgrounds. In recent years, deep convolutional neural networks (CNN) have provided robust feature representations to distinguish complex terrain objects. However, these CNNs ignore the fusion of global-local contexts and are often confused with other types of features, especially buildings. In addition, conventional convolution operations use a fixed template paradigm to aggregate local feature information. The road features present complex linear-shape geometric relationships, which brings some obstacles to feature construction. To address the above issues, we proposed a hybrid network structure that combines the advantages of CNN and transformer models. Specifically, a multiscale deformable convolution module has been developed to capture local road context information adaptively. The Transformer model is introduced into the encoder to enhance semantic information to build the global context. Meanwhile, the CNN features are fused with the transformer features. Finally, the model outputs a road extraction prediction map in high spatial resolution. Quantitative analysis and visual expression confirm that the proposed model can effectively and automatically extract road features from complex remote sensing backgrounds, outperforming state-of-the-art methods with IOU by 86.5% and OA by 97.4%.