基于基础模型的光谱空间变换器用于高光谱图像分类

IF 7.5 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2024-09-11 DOI:10.1109/TGRS.2024.3456129

Lingbo Huang;Yushi Chen;Xin He

{"title":"基于基础模型的光谱空间变换器用于高光谱图像分类","authors":"Lingbo Huang;Yushi Chen;Xin He","doi":"10.1109/TGRS.2024.3456129","DOIUrl":null,"url":null,"abstract":"Recently, deep learning models have dominated hyperspectral image (HSI) classification. Nowadays, deep learning is undergoing a paradigm shift with the rise of transformer-based foundation models. In this study, the potential of transformer-based foundation models, including the vision foundation model (VFM) and language foundation model (LFM), for HSI classification are investigated. First, to improve the performance of traditional HSI classification tasks, a spectral-spatial VFM-based transformer (SS-VFMT) is proposed, which inserts spectral-spatial information into the pretrained foundation transformer. Specifically, a given pretrained transformer receives HSI patch tokens for long-range feature extraction benefiting from the prelearned weights. Meanwhile, two enhancement modules, i.e., spatial and spectral enhancement modules (SpaEMs\n<inline-formula> <tex-math>$\\backslash $ </tex-math></inline-formula>\nSpeEMs), utilize spectral and spatial information for steering the behavior of the transformer. Besides, an additional patch relationship distillation strategy is designed for SS-VFMT to exploit the pretrained knowledge better, leading to the proposed SS-VFMT-D. Second, based on SS-VFMT, to address a new HSI classification task, i.e., generalized zero-shot classification, a spectral-spatial vision-language-based transformer (SS-VLFMT) is proposed. This task is to recognize novel classes not seen during training, which is more meaningful as the real world is usually open. The SS-VLFMT leverages SS-VFMT to extract spectral-spatial features and corresponding hash codes while integrating a pretrained language model to extract text features from class names. Experimental results on HSI datasets reveal that the proposed methods are competitive compared to the state-of-the-art methods. Moreover, the foundation model-based methods open a new window for HSI classification tasks, especially for HSI zero-shot classification.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":null,"pages":null},"PeriodicalIF":7.5000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Foundation Model-Based Spectral–Spatial Transformer for Hyperspectral Image Classification\",\"authors\":\"Lingbo Huang;Yushi Chen;Xin He\",\"doi\":\"10.1109/TGRS.2024.3456129\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, deep learning models have dominated hyperspectral image (HSI) classification. Nowadays, deep learning is undergoing a paradigm shift with the rise of transformer-based foundation models. In this study, the potential of transformer-based foundation models, including the vision foundation model (VFM) and language foundation model (LFM), for HSI classification are investigated. First, to improve the performance of traditional HSI classification tasks, a spectral-spatial VFM-based transformer (SS-VFMT) is proposed, which inserts spectral-spatial information into the pretrained foundation transformer. Specifically, a given pretrained transformer receives HSI patch tokens for long-range feature extraction benefiting from the prelearned weights. Meanwhile, two enhancement modules, i.e., spatial and spectral enhancement modules (SpaEMs\\n<inline-formula> <tex-math>$\\\\backslash $ </tex-math></inline-formula>\\nSpeEMs), utilize spectral and spatial information for steering the behavior of the transformer. Besides, an additional patch relationship distillation strategy is designed for SS-VFMT to exploit the pretrained knowledge better, leading to the proposed SS-VFMT-D. Second, based on SS-VFMT, to address a new HSI classification task, i.e., generalized zero-shot classification, a spectral-spatial vision-language-based transformer (SS-VLFMT) is proposed. This task is to recognize novel classes not seen during training, which is more meaningful as the real world is usually open. The SS-VLFMT leverages SS-VFMT to extract spectral-spatial features and corresponding hash codes while integrating a pretrained language model to extract text features from class names. Experimental results on HSI datasets reveal that the proposed methods are competitive compared to the state-of-the-art methods. Moreover, the foundation model-based methods open a new window for HSI classification tasks, especially for HSI zero-shot classification.\",\"PeriodicalId\":13213,\"journal\":{\"name\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10677405/\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10677405/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

最近，深度学习模型主导了高光谱图像（HSI）分类。如今，随着基于变换器的基础模型的兴起，深度学习正在经历一场范式转变。本研究探讨了基于变换器的基础模型（包括视觉基础模型（VFM）和语言基础模型（LFM））在高光谱图像分类中的应用潜力。首先，为了提高传统人机交互分类任务的性能，我们提出了一种基于光谱-空间 VFM 的变换器（SS-VFMT），它将光谱-空间信息插入到预训练的基础变换器中。具体来说，给定的预训练变换器接收 HSI 片段标记，利用预先学习的权重进行长距离特征提取。同时，两个增强模块，即空间和光谱增强模块（SpaEMs $\backslash $ SpeEMs），利用光谱和空间信息来引导变换器的行为。此外，为了更好地利用预训练知识，SS-VFMT 还设计了一种额外的贴片关系蒸馏策略，从而提出了 SS-VFMT-D。其次，在 SS-VFMT 的基础上，为了解决新的人机交互分类任务，即广义零镜头分类，提出了基于光谱空间视觉语言的变换器（SS-VLFMT）。这项任务的目的是识别在训练过程中未出现过的新类别，由于现实世界通常是开放的，因此这项任务更有意义。SS-VLFMT 利用 SS-VFMT 提取光谱空间特征和相应的哈希代码，同时集成了一个预训练语言模型，以提取类别名称的文本特征。在人机交互数据集上的实验结果表明，所提出的方法与最先进的方法相比具有竞争力。此外，基于基础模型的方法为人机交互分类任务，尤其是人机交互零镜头分类打开了一扇新窗口。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Foundation Model-Based Spectral–Spatial Transformer for Hyperspectral Image Classification

Recently, deep learning models have dominated hyperspectral image (HSI) classification. Nowadays, deep learning is undergoing a paradigm shift with the rise of transformer-based foundation models. In this study, the potential of transformer-based foundation models, including the vision foundation model (VFM) and language foundation model (LFM), for HSI classification are investigated. First, to improve the performance of traditional HSI classification tasks, a spectral-spatial VFM-based transformer (SS-VFMT) is proposed, which inserts spectral-spatial information into the pretrained foundation transformer. Specifically, a given pretrained transformer receives HSI patch tokens for long-range feature extraction benefiting from the prelearned weights. Meanwhile, two enhancement modules, i.e., spatial and spectral enhancement modules (SpaEMs

$\backslash $

SpeEMs), utilize spectral and spatial information for steering the behavior of the transformer. Besides, an additional patch relationship distillation strategy is designed for SS-VFMT to exploit the pretrained knowledge better, leading to the proposed SS-VFMT-D. Second, based on SS-VFMT, to address a new HSI classification task, i.e., generalized zero-shot classification, a spectral-spatial vision-language-based transformer (SS-VLFMT) is proposed. This task is to recognize novel classes not seen during training, which is more meaningful as the real world is usually open. The SS-VLFMT leverages SS-VFMT to extract spectral-spatial features and corresponding hash codes while integrating a pretrained language model to extract text features from class names. Experimental results on HSI datasets reveal that the proposed methods are competitive compared to the state-of-the-art methods. Moreover, the foundation model-based methods open a new window for HSI classification tasks, especially for HSI zero-shot classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.