Learning Without Forgetting for Vision-Language Models

Da-Wei Zhou;Yuanhan Zhang;Yan Wang;Jingyi Ning;Han-Jia Ye;De-Chuan Zhan;Ziwei Liu
{"title":"Learning Without Forgetting for Vision-Language Models","authors":"Da-Wei Zhou;Yuanhan Zhang;Yan Wang;Jingyi Ning;Han-Jia Ye;De-Chuan Zhan;Ziwei Liu","doi":"10.1109/TPAMI.2025.3540889","DOIUrl":null,"url":null,"abstract":"Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on <italic>visual</i> information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of <italic>textual</i> information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: <bold>1)</b> how to adapt the model without forgetting and <bold>2)</b> how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (<sc><b>Proof</b></small>) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded, and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture better task-specific semantic information that facilitates recognition. Extensive experiments on nine benchmark datasets with various continual learning scenarios and various VLMs validate that <sc>Proof</small> achieves state-of-the-art performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"4489-4504"},"PeriodicalIF":18.6000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10882940/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (Proof) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded, and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture better task-specific semantic information that facilitates recognition. Extensive experiments on nine benchmark datasets with various continual learning scenarios and various VLMs validate that Proof achieves state-of-the-art performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
视觉语言模型的无遗忘学习
类增量学习(Class-Incremental Learning, CIL)或持续学习是现实世界中需要的一种能力,它要求学习系统在不忘记以前的任务的情况下适应新的任务。传统的CIL方法侧重于视觉信息来掌握核心特征,而视觉语言模型(VLM)的最新进展在借助文本信息学习泛化表示方面显示出了很好的能力。然而,当不断接受新课程的训练时,vlm往往会灾难性地忘记以前的知识。将vlm应用于CIL面临两大挑战:1)如何在不遗忘的情况下调整模型;2)如何充分利用多模态信息。为此,我们提出投影融合(Proof),使vlm能够学习而不会忘记。为了解决第一个挑战,我们提出了基于固定图像/文本编码器的训练任务特定投影。当面对新的任务时,扩展新的投影,固定以前的投影,减轻旧概念的遗忘。对于第二个挑战,我们提出融合模块,以更好地利用跨模态信息。通过共同调整视觉和文本特征,该模型可以更好地捕获特定于任务的语义信息,从而促进识别。在9个具有各种持续学习场景和各种vlm的基准数据集上进行了广泛的实验,验证了Proof达到了最先进的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models. Neural Eigenfunctions are Structured Representation Learners. On the Adversarial Transferability of Generalized "Skip Connections". Distribution-to-Points Matching for Image Text Retrieval. Penny-Wise and Pound-Foolish in AI-Generated Image Detection.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1