Swalpa Kumar Roy , Ali Jamali , Jocelyn Chanussot , Pedram Ghamisi , Ebrahim Ghaderpour , Himan Shahabi
{"title":"SimPoolFormer: A two-stream vision transformer for hyperspectral image classification","authors":"Swalpa Kumar Roy , Ali Jamali , Jocelyn Chanussot , Pedram Ghamisi , Ebrahim Ghaderpour , Himan Shahabi","doi":"10.1016/j.rsase.2025.101478","DOIUrl":null,"url":null,"abstract":"<div><div>The ability of vision transformers (ViTs) to accurately model global dependencies has completely changed the field of vision research. However, because of their drawbacks, such as their high computational costs, dependence on significant labeled datasets, and restricted capacity to capture essential local features, efforts are being made to create more effective alternatives. On the other hand, vision multilayer perceptron (MLP) architectures have shown excellent capability in image classification tasks, performing equivalent to or even better than the widely used state-of-the-art ViTs and convolutional neural networks (CNNs). Vision MLPs have linear computational complexity, require less training data, and can attain long-range data dependencies through advanced mechanisms similar to transformers at much lower computational costs. Thus, in this paper, a novel deep learning architecture is developed, namely, SimPoolFormer, to address current shortcomings imposed by vision transformers. SimPoolFormer is a two-stream attention-in-attention vision transformer architecture based on two computationally efficient networks. The developed architecture replaces the computationally intensive multi-headed self-attention in ViT with SimPool for efficiency, while ResMLP is adopted in a second stream to enhance hyperspectral image (HSI) classification, leveraging its linear attention-based design. Results illustrate that SimPoolFormer is significantly superior to several other deep learning models, including 1D-CNN, 2D-CNN, RNN, VGG-16, EfficientNet, ResNet-50, and ViT on three complex HSI datasets: QUH-Tangdaowan, QUH-Qingyun, and QUH-Pingan. For example, in terms of average accuracy, SimPoolFormer improved the HSI classification accuracy over 2D-CNN, VGG-16, EfficientNet, ViT, ResNet-50, RNN, and 1D-CNN by 0.98%, 3.81%, 4.16%, 7.94%, 9.45%, 12.25%, and 13.95%, respectively, on the QUH-Qingyun dataset.</div></div>","PeriodicalId":53227,"journal":{"name":"Remote Sensing Applications-Society and Environment","volume":"37 ","pages":"Article 101478"},"PeriodicalIF":3.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Remote Sensing Applications-Society and Environment","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S235293852500031X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
The ability of vision transformers (ViTs) to accurately model global dependencies has completely changed the field of vision research. However, because of their drawbacks, such as their high computational costs, dependence on significant labeled datasets, and restricted capacity to capture essential local features, efforts are being made to create more effective alternatives. On the other hand, vision multilayer perceptron (MLP) architectures have shown excellent capability in image classification tasks, performing equivalent to or even better than the widely used state-of-the-art ViTs and convolutional neural networks (CNNs). Vision MLPs have linear computational complexity, require less training data, and can attain long-range data dependencies through advanced mechanisms similar to transformers at much lower computational costs. Thus, in this paper, a novel deep learning architecture is developed, namely, SimPoolFormer, to address current shortcomings imposed by vision transformers. SimPoolFormer is a two-stream attention-in-attention vision transformer architecture based on two computationally efficient networks. The developed architecture replaces the computationally intensive multi-headed self-attention in ViT with SimPool for efficiency, while ResMLP is adopted in a second stream to enhance hyperspectral image (HSI) classification, leveraging its linear attention-based design. Results illustrate that SimPoolFormer is significantly superior to several other deep learning models, including 1D-CNN, 2D-CNN, RNN, VGG-16, EfficientNet, ResNet-50, and ViT on three complex HSI datasets: QUH-Tangdaowan, QUH-Qingyun, and QUH-Pingan. For example, in terms of average accuracy, SimPoolFormer improved the HSI classification accuracy over 2D-CNN, VGG-16, EfficientNet, ViT, ResNet-50, RNN, and 1D-CNN by 0.98%, 3.81%, 4.16%, 7.94%, 9.45%, 12.25%, and 13.95%, respectively, on the QUH-Qingyun dataset.
期刊介绍:
The journal ''Remote Sensing Applications: Society and Environment'' (RSASE) focuses on remote sensing studies that address specific topics with an emphasis on environmental and societal issues - regional / local studies with global significance. Subjects are encouraged to have an interdisciplinary approach and include, but are not limited by: " -Global and climate change studies addressing the impact of increasing concentrations of greenhouse gases, CO2 emission, carbon balance and carbon mitigation, energy system on social and environmental systems -Ecological and environmental issues including biodiversity, ecosystem dynamics, land degradation, atmospheric and water pollution, urban footprint, ecosystem management and natural hazards (e.g. earthquakes, typhoons, floods, landslides) -Natural resource studies including land-use in general, biomass estimation, forests, agricultural land, plantation, soils, coral reefs, wetland and water resources -Agriculture, food production systems and food security outcomes -Socio-economic issues including urban systems, urban growth, public health, epidemics, land-use transition and land use conflicts -Oceanography and coastal zone studies, including sea level rise projections, coastlines changes and the ocean-land interface -Regional challenges for remote sensing application techniques, monitoring and analysis, such as cloud screening and atmospheric correction for tropical regions -Interdisciplinary studies combining remote sensing, household survey data, field measurements and models to address environmental, societal and sustainability issues -Quantitative and qualitative analysis that documents the impact of using remote sensing studies in social, political, environmental or economic systems