{"title":"遥感自监督学习模式集成与增强视觉转换器","authors":"Kaixuan Lu;Ruiqian Zhang;Xiao Huang;Yuxing Xie;Xiaogang Ning;Hanchao Zhang;Mengke Yuan;Pan Zhang;Tao Wang;Tongkui Liao","doi":"10.1109/TGRS.2025.3541390","DOIUrl":null,"url":null,"abstract":"Recent self-supervised learning (SSL) methods have demonstrated impressive results in learning visual representations from unlabeled remote sensing (RS) images. However, most RS images predominantly consist of scenographic scenes containing multiple ground objects without explicit foreground targets, which limits the performance of existing SSL methods that focus on foreground targets. This raises the question: Is there a method that can automatically aggregate similar objects within scenographic RS images, thereby enabling models to differentiate knowledge embedded in various geospatial patterns for improved feature representation? In this work, we present the pattern integration and enhancement vision transformer (PIEViT), a novel SSL framework designed specifically for RS imagery. PIEViT utilizes a teacher-student architecture to address both image-level and patch-level tasks. It employs a proposed, geospatial pattern cohesion (GPC) module to explore the natural clustering of patches, enhancing the differentiation of individual features. A feature integration projection (FIP) module is employed to further refine masked token reconstruction using geospatially clustered patches. We validated PIEViT across multiple downstream tasks, including object detection, semantic segmentation, and change detection. Experiments demonstrated that PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for RS image interpretation tasks.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-13"},"PeriodicalIF":8.6000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing\",\"authors\":\"Kaixuan Lu;Ruiqian Zhang;Xiao Huang;Yuxing Xie;Xiaogang Ning;Hanchao Zhang;Mengke Yuan;Pan Zhang;Tao Wang;Tongkui Liao\",\"doi\":\"10.1109/TGRS.2025.3541390\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent self-supervised learning (SSL) methods have demonstrated impressive results in learning visual representations from unlabeled remote sensing (RS) images. However, most RS images predominantly consist of scenographic scenes containing multiple ground objects without explicit foreground targets, which limits the performance of existing SSL methods that focus on foreground targets. This raises the question: Is there a method that can automatically aggregate similar objects within scenographic RS images, thereby enabling models to differentiate knowledge embedded in various geospatial patterns for improved feature representation? In this work, we present the pattern integration and enhancement vision transformer (PIEViT), a novel SSL framework designed specifically for RS imagery. PIEViT utilizes a teacher-student architecture to address both image-level and patch-level tasks. It employs a proposed, geospatial pattern cohesion (GPC) module to explore the natural clustering of patches, enhancing the differentiation of individual features. A feature integration projection (FIP) module is employed to further refine masked token reconstruction using geospatially clustered patches. We validated PIEViT across multiple downstream tasks, including object detection, semantic segmentation, and change detection. Experiments demonstrated that PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for RS image interpretation tasks.\",\"PeriodicalId\":13213,\"journal\":{\"name\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"volume\":\"63 \",\"pages\":\"1-13\"},\"PeriodicalIF\":8.6000,\"publicationDate\":\"2025-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10884596/\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10884596/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing
Recent self-supervised learning (SSL) methods have demonstrated impressive results in learning visual representations from unlabeled remote sensing (RS) images. However, most RS images predominantly consist of scenographic scenes containing multiple ground objects without explicit foreground targets, which limits the performance of existing SSL methods that focus on foreground targets. This raises the question: Is there a method that can automatically aggregate similar objects within scenographic RS images, thereby enabling models to differentiate knowledge embedded in various geospatial patterns for improved feature representation? In this work, we present the pattern integration and enhancement vision transformer (PIEViT), a novel SSL framework designed specifically for RS imagery. PIEViT utilizes a teacher-student architecture to address both image-level and patch-level tasks. It employs a proposed, geospatial pattern cohesion (GPC) module to explore the natural clustering of patches, enhancing the differentiation of individual features. A feature integration projection (FIP) module is employed to further refine masked token reconstruction using geospatially clustered patches. We validated PIEViT across multiple downstream tasks, including object detection, semantic segmentation, and change detection. Experiments demonstrated that PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for RS image interpretation tasks.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.