{"title":"Foundation Model-Based Spectral–Spatial Transformer for Hyperspectral Image Classification","authors":"Lingbo Huang;Yushi Chen;Xin He","doi":"10.1109/TGRS.2024.3456129","DOIUrl":null,"url":null,"abstract":"Recently, deep learning models have dominated hyperspectral image (HSI) classification. Nowadays, deep learning is undergoing a paradigm shift with the rise of transformer-based foundation models. In this study, the potential of transformer-based foundation models, including the vision foundation model (VFM) and language foundation model (LFM), for HSI classification are investigated. First, to improve the performance of traditional HSI classification tasks, a spectral-spatial VFM-based transformer (SS-VFMT) is proposed, which inserts spectral-spatial information into the pretrained foundation transformer. Specifically, a given pretrained transformer receives HSI patch tokens for long-range feature extraction benefiting from the prelearned weights. Meanwhile, two enhancement modules, i.e., spatial and spectral enhancement modules (SpaEMs\n<inline-formula> <tex-math>$\\backslash $ </tex-math></inline-formula>\nSpeEMs), utilize spectral and spatial information for steering the behavior of the transformer. Besides, an additional patch relationship distillation strategy is designed for SS-VFMT to exploit the pretrained knowledge better, leading to the proposed SS-VFMT-D. Second, based on SS-VFMT, to address a new HSI classification task, i.e., generalized zero-shot classification, a spectral-spatial vision-language-based transformer (SS-VLFMT) is proposed. This task is to recognize novel classes not seen during training, which is more meaningful as the real world is usually open. The SS-VLFMT leverages SS-VFMT to extract spectral-spatial features and corresponding hash codes while integrating a pretrained language model to extract text features from class names. Experimental results on HSI datasets reveal that the proposed methods are competitive compared to the state-of-the-art methods. Moreover, the foundation model-based methods open a new window for HSI classification tasks, especially for HSI zero-shot classification.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"62 ","pages":"1-25"},"PeriodicalIF":8.6000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10677405/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, deep learning models have dominated hyperspectral image (HSI) classification. Nowadays, deep learning is undergoing a paradigm shift with the rise of transformer-based foundation models. In this study, the potential of transformer-based foundation models, including the vision foundation model (VFM) and language foundation model (LFM), for HSI classification are investigated. First, to improve the performance of traditional HSI classification tasks, a spectral-spatial VFM-based transformer (SS-VFMT) is proposed, which inserts spectral-spatial information into the pretrained foundation transformer. Specifically, a given pretrained transformer receives HSI patch tokens for long-range feature extraction benefiting from the prelearned weights. Meanwhile, two enhancement modules, i.e., spatial and spectral enhancement modules (SpaEMs
$\backslash $
SpeEMs), utilize spectral and spatial information for steering the behavior of the transformer. Besides, an additional patch relationship distillation strategy is designed for SS-VFMT to exploit the pretrained knowledge better, leading to the proposed SS-VFMT-D. Second, based on SS-VFMT, to address a new HSI classification task, i.e., generalized zero-shot classification, a spectral-spatial vision-language-based transformer (SS-VLFMT) is proposed. This task is to recognize novel classes not seen during training, which is more meaningful as the real world is usually open. The SS-VLFMT leverages SS-VFMT to extract spectral-spatial features and corresponding hash codes while integrating a pretrained language model to extract text features from class names. Experimental results on HSI datasets reveal that the proposed methods are competitive compared to the state-of-the-art methods. Moreover, the foundation model-based methods open a new window for HSI classification tasks, especially for HSI zero-shot classification.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.