Learning Without Forgetting for Vision-Language Models

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-02-11 DOI:10.1109/TPAMI.2025.3540889

Da-Wei Zhou;Yuanhan Zhang;Yan Wang;Jingyi Ning;Han-Jia Ye;De-Chuan Zhan;Ziwei Liu

{"title":"Learning Without Forgetting for Vision-Language Models","authors":"Da-Wei Zhou;Yuanhan Zhang;Yan Wang;Jingyi Ning;Han-Jia Ye;De-Chuan Zhan;Ziwei Liu","doi":"10.1109/TPAMI.2025.3540889","DOIUrl":null,"url":null,"abstract":"Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on <italic>visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of <italic>textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: <bold>1) how to adapt the model without forgetting and <bold>2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (<sc>Proof) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded, and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture better task-specific semantic information that facilitates recognition. Extensive experiments on nine benchmark datasets with various continual learning scenarios and various VLMs validate that <sc>Proof achieves state-of-the-art performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"4489-4504"},"PeriodicalIF":18.6000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10882940/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (Proof) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded, and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture better task-specific semantic information that facilitates recognition. Extensive experiments on nine benchmark datasets with various continual learning scenarios and various VLMs validate that Proof achieves state-of-the-art performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视觉语言模型的无遗忘学习

类增量学习（Class-Incremental Learning， CIL）或持续学习是现实世界中需要的一种能力，它要求学习系统在不忘记以前的任务的情况下适应新的任务。传统的CIL方法侧重于视觉信息来掌握核心特征，而视觉语言模型（VLM）的最新进展在借助文本信息学习泛化表示方面显示出了很好的能力。然而，当不断接受新课程的训练时，vlm往往会灾难性地忘记以前的知识。将vlm应用于CIL面临两大挑战：1)如何在不遗忘的情况下调整模型；2)如何充分利用多模态信息。为此，我们提出投影融合（Proof），使vlm能够学习而不会忘记。为了解决第一个挑战，我们提出了基于固定图像/文本编码器的训练任务特定投影。当面对新的任务时，扩展新的投影，固定以前的投影，减轻旧概念的遗忘。对于第二个挑战，我们提出融合模块，以更好地利用跨模态信息。通过共同调整视觉和文本特征，该模型可以更好地捕获特定于任务的语义信息，从而促进识别。在9个具有各种持续学习场景和各种vlm的基准数据集上进行了广泛的实验，验证了Proof达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

期刊最新文献

Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models. Neural Eigenfunctions are Structured Representation Learners. On the Adversarial Transferability of Generalized "Skip Connections". Distribution-to-Points Matching for Image Text Retrieval. Penny-Wise and Pound-Foolish in AI-Generated Image Detection.