领域提示学习方法使CLIP有效适应未知领域

Transactions of The Japanese Society for Artificial Intelligence Pub Date : 2023-11-01 DOI:10.1527/tjsai.38-6_b-mc2

Xin Zhang, Shixiang Shane Gu, Yutaka Matsuo, Yusuke Iwasawa

{"title":"领域提示学习方法使CLIP有效适应未知领域","authors":"Xin Zhang, Shixiang Shane Gu, Yutaka Matsuo, Yusuke Iwasawa","doi":"10.1527/tjsai.38-6_b-mc2","DOIUrl":null,"url":null,"abstract":"Domain generalization (DG) is a difficult transfer learning problem aiming to learn a generalizable model for unseen domains. Recent foundation models (FMs) are robust to many distribution shifts and, therefore, should substantially improve the performance of DG. In this work, we study generic ways to adopt contrastive languageimage pre-training (CLIP), a visual-language foundation model, for DG problems in image classification. While empirical risk minimization (ERM) greatly improves the accuracy with bigger backbones and training datasets using standard DG benchmarks, fine-tuning FMs is not practical in many real-world situations. We propose Domain Prompt Learning (DPL) as a novel approach for domain inference in the form of conditional prompt generation. DPL achieved a significant accuracy improvement with only training a lightweight prompt generator (a three-layer MLP), whose parameter is of equivalent scale to the classification projector in the previous DG literature. Combining DPL with CLIP provides surprising performance, raising the accuracy of zero-shot CLIP from 73.7% to 79.3% on several standard datasets, namely PACS, VLCS, OfficeHome, and TerraIncognita. We hope the simplicity and success of our approach lead to broader adoption and analysis of foundation models in the domain generalization field.","PeriodicalId":23256,"journal":{"name":"Transactions of The Japanese Society for Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Domain Prompt Learning for Efficiently Adapting CLIP to Unseen Domains\",\"authors\":\"Xin Zhang, Shixiang Shane Gu, Yutaka Matsuo, Yusuke Iwasawa\",\"doi\":\"10.1527/tjsai.38-6_b-mc2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Domain generalization (DG) is a difficult transfer learning problem aiming to learn a generalizable model for unseen domains. Recent foundation models (FMs) are robust to many distribution shifts and, therefore, should substantially improve the performance of DG. In this work, we study generic ways to adopt contrastive languageimage pre-training (CLIP), a visual-language foundation model, for DG problems in image classification. While empirical risk minimization (ERM) greatly improves the accuracy with bigger backbones and training datasets using standard DG benchmarks, fine-tuning FMs is not practical in many real-world situations. We propose Domain Prompt Learning (DPL) as a novel approach for domain inference in the form of conditional prompt generation. DPL achieved a significant accuracy improvement with only training a lightweight prompt generator (a three-layer MLP), whose parameter is of equivalent scale to the classification projector in the previous DG literature. Combining DPL with CLIP provides surprising performance, raising the accuracy of zero-shot CLIP from 73.7% to 79.3% on several standard datasets, namely PACS, VLCS, OfficeHome, and TerraIncognita. We hope the simplicity and success of our approach lead to broader adoption and analysis of foundation models in the domain generalization field.\",\"PeriodicalId\":23256,\"journal\":{\"name\":\"Transactions of The Japanese Society for Artificial Intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transactions of The Japanese Society for Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1527/tjsai.38-6_b-mc2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions of The Japanese Society for Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1527/tjsai.38-6_b-mc2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

领域泛化(DG)是一个复杂的迁移学习问题，旨在学习未知领域的可泛化模型。最近的基础模型(FMs)对许多分布变化都具有鲁棒性，因此，应该从本质上提高DG的性能。在这项工作中，我们研究了采用对比语言图像预训练(CLIP)的通用方法，这是一种视觉语言基础模型，用于图像分类中的DG问题。虽然经验风险最小化(ERM)使用标准DG基准极大地提高了大型骨干和训练数据集的准确性，但微调FMs在许多实际情况下并不实用。我们提出领域提示学习(DPL)作为一种以条件提示生成形式进行领域推理的新方法。DPL只需要训练一个轻量级的提示生成器(三层MLP)，其参数与之前DG文献中的分类投影仪的规模相当，就可以显著提高DPL的精度。DPL与CLIP的结合提供了令人惊讶的性能，在几个标准数据集(即PACS, VLCS, OfficeHome和TerraIncognita)上将零射击CLIP的准确率从73.7%提高到79.3%。我们希望我们的方法的简单性和成功导致基础模型在领域泛化领域得到更广泛的采用和分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Domain Prompt Learning for Efficiently Adapting CLIP to Unseen Domains

Domain generalization (DG) is a difficult transfer learning problem aiming to learn a generalizable model for unseen domains. Recent foundation models (FMs) are robust to many distribution shifts and, therefore, should substantially improve the performance of DG. In this work, we study generic ways to adopt contrastive languageimage pre-training (CLIP), a visual-language foundation model, for DG problems in image classification. While empirical risk minimization (ERM) greatly improves the accuracy with bigger backbones and training datasets using standard DG benchmarks, fine-tuning FMs is not practical in many real-world situations. We propose Domain Prompt Learning (DPL) as a novel approach for domain inference in the form of conditional prompt generation. DPL achieved a significant accuracy improvement with only training a lightweight prompt generator (a three-layer MLP), whose parameter is of equivalent scale to the classification projector in the previous DG literature. Combining DPL with CLIP provides surprising performance, raising the accuracy of zero-shot CLIP from 73.7% to 79.3% on several standard datasets, namely PACS, VLCS, OfficeHome, and TerraIncognita. We hope the simplicity and success of our approach lead to broader adoption and analysis of foundation models in the domain generalization field.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Transactions of The Japanese Society for Artificial Intelligence Computer Science-Artificial Intelligence

CiteScore

0.40

自引率

0.00%

发文量