Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations.

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Pub Date : 2024-06-01 Epub Date: 2024-09-16 DOI:10.1109/cvpr52733.2024.02470

Chenyu You, Yifei Min, Weicheng Dai, Jasjeet S Sekhon, Lawrence Staib, James S Duncan

{"title":"Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations.","authors":"Chenyu You, Yifei Min, Weicheng Dai, Jasjeet S Sekhon, Lawrence Staib, James S Duncan","doi":"10.1109/cvpr52733.2024.02470","DOIUrl":null,"url":null,"abstract":"<p><p>Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2024 ","pages":"26140-26150"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11620289/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/cvpr52733.2024.02470","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/16 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

校准多模态表示：追求无注释的组鲁棒性。

微调预训练的视觉语言模型，如CLIP，在各种下游任务上取得了成功。然而，这种模式存在几个痛点：(i)直接调整整个预训练模型既费时又耗量。此外，这些调优模型往往变得高度专门化，限制了它们在实际部署中的实用性；（ii）最近的研究表明，预训练的视觉语言分类器可能过度依赖于虚假特征——与训练数据中的目标相关的模式，但与真实的标记函数无关；（iii）现有的关于减少对虚假特征的依赖的研究，主要基于我们可以识别这些特征的假设，并不能为现实世界的应用提供明确的保证。作为一项试点研究，这项工作的重点是探索在不使用任何组注释的情况下减轻CLIP对虚假特征的依赖。为此，我们系统地研究了CLIP和CLIP+ERM上伪相关的存在性。首先，根据最近对深度特征重加权（DFR）的研究，我们验证了最后一层再训练可以大大提高预训练CLIP上的组鲁棒性。鉴于此，我们提倡一种轻量级的表征校准方法来对CLIP进行微调，首先使用预训练的CLIP生成一个校准集，然后通过对比学习来校准该集内样本的表征，而不需要组标签。在几个基准上进行了广泛的实验和深入的可视化，验证了我们的建议的有效性，在很大程度上减少了依赖并显著提高了模型的泛化。我们的密码就在这里。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

CiteScore

43.50

自引率

0.00%

发文量