Foundation Model-Based Spectral–Spatial Transformer for Hyperspectral Image Classification

IF 8.6 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2024-09-11 DOI:10.1109/TGRS.2024.3456129

Lingbo Huang;Yushi Chen;Xin He

{"title":"Foundation Model-Based Spectral–Spatial Transformer for Hyperspectral Image Classification","authors":"Lingbo Huang;Yushi Chen;Xin He","doi":"10.1109/TGRS.2024.3456129","DOIUrl":null,"url":null,"abstract":"Recently, deep learning models have dominated hyperspectral image (HSI) classification. Nowadays, deep learning is undergoing a paradigm shift with the rise of transformer-based foundation models. In this study, the potential of transformer-based foundation models, including the vision foundation model (VFM) and language foundation model (LFM), for HSI classification are investigated. First, to improve the performance of traditional HSI classification tasks, a spectral-spatial VFM-based transformer (SS-VFMT) is proposed, which inserts spectral-spatial information into the pretrained foundation transformer. Specifically, a given pretrained transformer receives HSI patch tokens for long-range feature extraction benefiting from the prelearned weights. Meanwhile, two enhancement modules, i.e., spatial and spectral enhancement modules (SpaEMs\n<inline-formula> <tex-math>$\\backslash $ </tex-math></inline-formula>\nSpeEMs), utilize spectral and spatial information for steering the behavior of the transformer. Besides, an additional patch relationship distillation strategy is designed for SS-VFMT to exploit the pretrained knowledge better, leading to the proposed SS-VFMT-D. Second, based on SS-VFMT, to address a new HSI classification task, i.e., generalized zero-shot classification, a spectral-spatial vision-language-based transformer (SS-VLFMT) is proposed. This task is to recognize novel classes not seen during training, which is more meaningful as the real world is usually open. The SS-VLFMT leverages SS-VFMT to extract spectral-spatial features and corresponding hash codes while integrating a pretrained language model to extract text features from class names. Experimental results on HSI datasets reveal that the proposed methods are competitive compared to the state-of-the-art methods. Moreover, the foundation model-based methods open a new window for HSI classification tasks, especially for HSI zero-shot classification.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"62 ","pages":"1-25"},"PeriodicalIF":8.6000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10677405/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, deep learning models have dominated hyperspectral image (HSI) classification. Nowadays, deep learning is undergoing a paradigm shift with the rise of transformer-based foundation models. In this study, the potential of transformer-based foundation models, including the vision foundation model (VFM) and language foundation model (LFM), for HSI classification are investigated. First, to improve the performance of traditional HSI classification tasks, a spectral-spatial VFM-based transformer (SS-VFMT) is proposed, which inserts spectral-spatial information into the pretrained foundation transformer. Specifically, a given pretrained transformer receives HSI patch tokens for long-range feature extraction benefiting from the prelearned weights. Meanwhile, two enhancement modules, i.e., spatial and spectral enhancement modules (SpaEMs

$\backslash $

SpeEMs), utilize spectral and spatial information for steering the behavior of the transformer. Besides, an additional patch relationship distillation strategy is designed for SS-VFMT to exploit the pretrained knowledge better, leading to the proposed SS-VFMT-D. Second, based on SS-VFMT, to address a new HSI classification task, i.e., generalized zero-shot classification, a spectral-spatial vision-language-based transformer (SS-VLFMT) is proposed. This task is to recognize novel classes not seen during training, which is more meaningful as the real world is usually open. The SS-VLFMT leverages SS-VFMT to extract spectral-spatial features and corresponding hash codes while integrating a pretrained language model to extract text features from class names. Experimental results on HSI datasets reveal that the proposed methods are competitive compared to the state-of-the-art methods. Moreover, the foundation model-based methods open a new window for HSI classification tasks, especially for HSI zero-shot classification.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于基础模型的光谱空间变换器用于高光谱图像分类

最近，深度学习模型主导了高光谱图像（HSI）分类。如今，随着基于变换器的基础模型的兴起，深度学习正在经历一场范式转变。本研究探讨了基于变换器的基础模型（包括视觉基础模型（VFM）和语言基础模型（LFM））在高光谱图像分类中的应用潜力。首先，为了提高传统人机交互分类任务的性能，我们提出了一种基于光谱-空间 VFM 的变换器（SS-VFMT），它将光谱-空间信息插入到预训练的基础变换器中。具体来说，给定的预训练变换器接收 HSI 片段标记，利用预先学习的权重进行长距离特征提取。同时，两个增强模块，即空间和光谱增强模块（SpaEMs $\backslash $ SpeEMs），利用光谱和空间信息来引导变换器的行为。此外，为了更好地利用预训练知识，SS-VFMT 还设计了一种额外的贴片关系蒸馏策略，从而提出了 SS-VFMT-D。其次，在 SS-VFMT 的基础上，为了解决新的人机交互分类任务，即广义零镜头分类，提出了基于光谱空间视觉语言的变换器（SS-VLFMT）。这项任务的目的是识别在训练过程中未出现过的新类别，由于现实世界通常是开放的，因此这项任务更有意义。SS-VLFMT 利用 SS-VFMT 提取光谱空间特征和相应的哈希代码，同时集成了一个预训练语言模型，以提取类别名称的文本特征。在人机交互数据集上的实验结果表明，所提出的方法与最先进的方法相比具有竞争力。此外，基于基础模型的方法为人机交互分类任务，尤其是人机交互零镜头分类打开了一扇新窗口。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.