RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2025-02-01 Epub Date: 2024-12-10 DOI:10.1016/j.cviu.2024.104254

Ankit Jha , Mainak Singha , Avigyan Bhattacharya , Biplab Banerjee

{"title":"RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP","authors":"Ankit Jha , Mainak Singha , Avigyan Bhattacharya , Biplab Banerjee","doi":"10.1016/j.cviu.2024.104254","DOIUrl":null,"url":null,"abstract":"<div><div>Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP’s vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104254"},"PeriodicalIF":3.5000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003357","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/10 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP’s vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS

^{3}

Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS

^{3}

Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

RS3Lip：基于自监督学习和CLIP的部分嵌入遥感图像分类一致性

解决领域和类的泛化挑战仍然是遥感领域的一个重大障碍。最近，以CLIP为例的大规模预训练视觉语言模型（VLMs）通过广泛的对比训练，展示了令人印象深刻的零射击和少射击泛化能力。现有文献强调提示学习是用领域和内容信息丰富提示的一种手段，特别是通过较小的可学习投影仪，从而可感知地解决多领域数据挑战。与此同时，我们观察到CLIP的视觉编码器不能很好地泛化困惑或损坏的RS图像。为此，我们提出了一种新的解决方案，利用自监督学习（SSL）来确保困惑的RS图像在域泛化（DG）中的一致性。这种方法增强了视觉特征，促进了域不变提示的生成。我们提出的RS3Lip，与具有少数层的小型投影仪训练，补充了预训练的CLIP。它结合了SSL和视觉特征的绘制损失，以及SSL任务的特征和文本特征之间的一致性损失。实证结果表明，RS3Lip在五个基准光学遥感数据集上始终优于最先进的提示学习方法，在领域和类别泛化任务上至少提高了1.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems