{"title":"利用视觉语言基础模型进行开放集领域适应性调整","authors":"Qing Yu , Go Irie , Kiyoharu Aizawa","doi":"10.1016/j.cviu.2024.104230","DOIUrl":null,"url":null,"abstract":"<div><div>Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104230"},"PeriodicalIF":4.3000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Open-set domain adaptation with visual-language foundation models\",\"authors\":\"Qing Yu , Go Irie , Kiyoharu Aizawa\",\"doi\":\"10.1016/j.cviu.2024.104230\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"250 \",\"pages\":\"Article 104230\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314224003114\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003114","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
事实证明,无监督领域适应(UDA)可以非常有效地将从有标签数据的源领域获得的知识转移到无标签数据的目标领域。由于目标域中缺乏标注数据,可能存在未知类别,因此开放集域适应(ODA)成为在训练阶段识别这些类别的潜在解决方案。虽然现有的 ODA 方法旨在解决源域和目标域之间的分布偏移问题,但大多数方法都是在源域上对 ImageNet 预训练模型进行微调,然后在目标域上进行适配。最新的视觉语言基础模型(VLFM),如对比语言-图像预训练(CLIP),对许多分布偏移具有鲁棒性,因此应能大幅提高 ODA 的性能。在这项工作中,我们探索了将 CLIP(一种流行的 VLFM)用于 ODA 的通用方法。我们研究了使用 CLIP 进行零点预测的性能,然后提出了一种熵优化策略,利用 CLIP 的输出来辅助 ODA 模型。所提出的方法在各种基准测试中取得了最先进的结果,证明了它在解决 ODA 问题方面的有效性。
Open-set domain adaptation with visual-language foundation models
Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems