首页 > 最新文献

International Journal of Computer Vision最新文献

英文 中文
Robust Deep Object Tracking against Adversarial Attacks 对抗对抗性攻击的鲁棒深度目标跟踪
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-26 DOI: 10.1007/s11263-024-02226-0
Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang

Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.

近年来,解决深度神经网络(DNN)的脆弱性问题引起了广泛关注。最近关于对抗性攻击和防御的研究主要集中在单幅图像上,而针对视频序列进行时间攻击的研究却很少。由于没有考虑帧与帧之间的时间一致性,现有的针对静态图像设计的对抗性攻击方法在深度目标跟踪方面表现不佳。在这项工作中,我们在视频序列上生成对抗示例,以提高在白盒和黑盒设置下对抗攻击的跟踪鲁棒性。为此,我们在对逐帧估计的跟踪结果生成轻量级扰动时考虑了运动信号。在白盒攻击中,我们通过已知的跟踪器产生时间扰动,从而显著降低跟踪性能。对于黑盒攻击,我们将生成的扰动转移到未知的目标跟踪器中,以实现转移攻击。此外,我们还训练了通用对抗扰动,并将其直接添加到视频的所有帧中,从而以较小的计算成本提高了攻击效果。另一方面,我们通过连续学习来估计并移除输入序列中的扰动,从而恢复跟踪性能。我们将提出的对抗性攻击和防御方法应用于最先进的跟踪算法。在大规模基准数据集(包括 OTB、VOT、UAV123 和 LaSOT)上进行的广泛评估表明,我们的攻击方法显著降低了跟踪性能,并能很好地移植到其他骨干网和跟踪器上。值得注意的是,所提出的防御方法在一定程度上恢复了原有的跟踪性能,并在未受到对抗性攻击时实现了额外的性能提升。
{"title":"Robust Deep Object Tracking against Adversarial Attacks","authors":"Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang","doi":"10.1007/s11263-024-02226-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02226-0","url":null,"abstract":"<p>Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Breaking the Limits of Reliable Prediction via Generated Data 通过生成数据打破可靠预测的限制
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-20 DOI: 10.1007/s11263-024-02221-5
Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu

In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.

在安全关键应用的开放世界识别中,为深度神经网络提供可靠的预测已成为一项关键要求。针对可信度校准、误分类检测和分布外检测等与可靠预测相关的任务,已经提出了许多方法。最近,预训练被证明是提高可靠预测的最有效方法之一,特别是对于像 ViT 这样需要大量训练数据的现代网络。然而,手动收集数据非常耗时。在本文中,我们利用生成模型的突破,研究使用生成数据扩展训练集是否以及如何提高预测的可靠性。我们的实验表明,使用大量生成数据进行训练可以消除可靠预测中的过拟合,从而显著提高性能。令人惊讶的是,像 ResNet-18 这样的经典网络,在使用大量生成数据进行训练时,有时会表现出与使用大量真实数据集对 ViT 进行预训练时相当的性能。
{"title":"Breaking the Limits of Reliable Prediction via Generated Data","authors":"Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu","doi":"10.1007/s11263-024-02221-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02221-5","url":null,"abstract":"<p>In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"18 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention FastComposer:利用局部注意力生成无调谐多主体图像
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-19 DOI: 10.1007/s11263-024-02227-z
Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300(times )–2500(times ) speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).

扩散模型在文本到图像的生成方面表现出色,尤其是在主题驱动的个性化图像生成方面。然而,现有的方法由于需要针对特定对象进行微调而效率低下,微调需要大量计算,妨碍了高效部署。此外,现有方法在多主体生成方面也很吃力,因为它们经常会混淆主体间的身份。我们提出的 FastComposer 可实现高效、个性化、多主体文本到图像的生成,而无需微调。FastComposer 使用图像编码器提取的主体嵌入来增强扩散模型中的通用文本调节,只需向前传递即可根据主体图像和文本指示生成个性化图像。为了解决多主体生成中的身份混合问题,FastComposer 在训练过程中提出了交叉注意力定位监督,强制参考主体的注意力定位到目标图像中的正确区域。天真地对被试嵌入进行调节会导致被试过拟合。FastComposer 建议在去噪步骤中延迟主体调节,以保持主体驱动图像生成中的身份识别和可编辑性。FastComposer 能生成多个未见个体的图像,这些个体具有不同的风格、动作和背景。与基于微调的方法相比,它的速度提高了300(次)-2500(次),而且新主体不需要额外存储。FastComposer 为高效、个性化和高质量的多主体图像创建铺平了道路。代码、模型和数据集可在此处获取(https://github.com/mit-han-lab/fastcomposer)。
{"title":"FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention","authors":"Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han","doi":"10.1007/s11263-024-02227-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02227-z","url":null,"abstract":"<p>Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions <i>with only forward passes</i>. To address the identity blending problem in the multi-subject generation, FastComposer proposes <i>cross-attention localization</i> supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes <i>delayed subject conditioning</i> in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300<span>(times )</span>–2500<span>(times )</span> speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lidar Panoptic Segmentation in an Open World 开放世界中的激光雷达全景细分
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-19 DOI: 10.1007/s11263-024-02166-9
Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep

Addressing Lidar Panoptic Segmentation (LPS) is crucial for safe deployment of autnomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.

解决激光雷达全景分割(LPS)问题对于自动驾驶汽车的安全部署至关重要。LPS 的目的是根据预定义的语义类词汇识别和分割激光雷达点,包括可数对象的事物类(如行人和车辆)和无定形区域的事物类(如植被和道路)。重要的是,LPS 需要分割单个事物实例(如每辆车)。当前的 LPS 方法做出了一个不切实际的假设,即在真实的开放世界中,语义类词汇是固定不变的,但事实上,类本体通常会随着时间的推移而不断演化,因为机器人会遇到新的类实例,而这些新的类实例在预先定义的类词汇中被认为是未知的。为了解决这个不切实际的假设,我们研究了开放世界中的 LPS(LiPSOW):我们在一个具有预定义语义类词汇的数据集上训练模型,然后研究它们在更大的数据集上的泛化情况,在这个数据集上可能会出现事物和物品类的新实例。这种实验设置得出了有趣的结论。现有技术训练了特定类别的实例分割方法,并在已知类别上获得了最先进的结果,而基于类别无关的自下而上分组方法则在初始类别词汇之外的类别(即未知类别)上表现良好。遗憾的是,这些方法在已知类别上的表现无法与完全数据驱动的方法相提并论。我们的工作提出了一种中间路线:我们进行类无关的点聚类,并以分层方式对输入云进行过度分割,然后进行二进制点分割分类,类似于区域建议网络(Ren et al. NeurIPS, 2015)。我们通过计算点段加权分层树中的切分来获得最终的点云分割,与语义分类无关。值得注意的是,这种统一的方法在已知和未知类别上都有很好的表现。
{"title":"Lidar Panoptic Segmentation in an Open World","authors":"Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep","doi":"10.1007/s11263-024-02166-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02166-9","url":null,"abstract":"<p>Addressing Lidar Panoptic Segmentation (<i>LPS</i>) is crucial for safe deployment of autnomous vehicles. <i>LPS</i> aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including <span>thing</span> classes of countable objects (e.g., pedestrians and vehicles) and <span>stuff</span> classes of amorphous regions (e.g., vegetation and road). Importantly, <i>LPS</i> requires segmenting individual <span>thing</span> instances (<i>e.g</i>., every single vehicle). Current <i>LPS</i> methods make an unrealistic assumption that the semantic class vocabulary is <i>fixed</i> in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of <i>novel</i> classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study <i>LPS</i> in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of <span>thing</span> and <span>stuff</span> classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (<i>i.e</i>., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"13 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Active Learning for Low-Altitude Drone-View Object Detection 用于低空无人机视角物体检测的分层主动学习技术
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-15 DOI: 10.1007/s11263-024-02228-y
Haohao Hu, Tianyu Han, Yuerong Wang, Wanjun Zhong, Jingwei Yue, Peng Zan

Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.

无人机平台采用了多种物体检测技术。然而,对无人机视图样本进行标注既费时又费力。这主要是由于无人机视图中存在大量需要标注的小尺寸实例。为了解决这个问题,我们提出了一种用于低空无人机视图物体检测的分层主动学习方法--HALD。HALD 从点、盒、图像和类等不同层次依次提取未标记的图像信息,旨在获得可靠的图像信息指标。点级模块用于确定有效实例的数量和位置,而盒级模块则筛选出可靠的预测。图像级模块通过计算图像中有效方框的一致性来选择候选样本,而类别级模块则根据候选样本和标记样本在不同类别中的分布情况来选择最终选定的样本。在 VisDrone 和 CityPersons 数据集上进行的大量实验表明,HALD 优于其他几种基准方法。此外,我们还对提出的每个模块进行了深入分析。结果表明,通过四个层次结构可以有效提高样本信息度的评估性能。
{"title":"Hierarchical Active Learning for Low-Altitude Drone-View Object Detection","authors":"Haohao Hu, Tianyu Han, Yuerong Wang, Wanjun Zhong, Jingwei Yue, Peng Zan","doi":"10.1007/s11263-024-02228-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02228-y","url":null,"abstract":"<p>Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142233294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In Search of Lost Online Test-Time Adaptation: A Survey 寻找丢失的在线测试时间适应性:调查
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-15 DOI: 10.1007/s11263-024-02213-5
Zixin Wang, Yadan Luo, Liang Zheng, Zhuoxiao Chen, Sen Wang, Zi Huang

This article presents a comprehensive survey of online test-time adaptation (OTTA), focusing on effectively adapting machine learning models to distributionally different target data upon batch arrival. Despite the recent proliferation of OTTA methods, conclusions from previous studies are inconsistent due to ambiguous settings, outdated backbones, and inconsistent hyperparameter tuning, which obscure core challenges and hinder reproducibility. To enhance clarity and enable rigorous comparison, we classify OTTA techniques into three primary categories and benchmark them using a modern backbone, the Vision Transformer. Our benchmarks cover conventional corrupted datasets such as CIFAR-10/100-C and ImageNet-C, as well as real-world shifts represented by CIFAR-10.1, OfficeHome, and CIFAR-10-Warehouse. The CIFAR-10-Warehouse dataset includes a variety of variations from different search engines and synthesized data generated through diffusion models. To measure efficiency in online scenarios, we introduce novel evaluation metrics, including GFLOPs, wall clock time, and GPU memory usage, providing a clearer picture of the trade-offs between adaptation accuracy and computational overhead. Our findings diverge from existing literature, revealing that (1) transformers demonstrate heightened resilience to diverse domain shifts, (2) the efficacy of many OTTA methods relies on large batch sizes, and (3) stability in optimization and resistance to perturbations are crucial during adaptation, particularly when the batch size is 1. Based on these insights, we highlight promising directions for future research. Our benchmarking toolkit and source code are available at https://github.com/Jo-wang/OTTA_ViT_survey.

本文介绍了在线测试时间适应(OTTA)的全面调查,重点是在批量数据到达后,如何有效地使机器学习模型适应分布不同的目标数据。尽管近来 OTTA 方法层出不穷,但由于模糊的设置、过时的骨架和不一致的超参数调整,以往研究的结论并不一致,这掩盖了核心挑战并阻碍了可重复性。为了提高清晰度并进行严格比较,我们将 OTTA 技术分为三个主要类别,并使用现代骨干网(Vision Transformer)对其进行基准测试。我们的基准涵盖了 CIFAR-10/100-C 和 ImageNet-C 等传统损坏数据集,以及以 CIFAR-10.1、OfficeHome 和 CIFAR-10-Warehouse 为代表的真实世界变化。CIFAR-10-Warehouse 数据集包括来自不同搜索引擎的各种变化以及通过扩散模型生成的合成数据。为了衡量在线场景中的效率,我们引入了新的评估指标,包括 GFLOPs、挂钟时间和 GPU 内存使用量,从而更清晰地反映了适应精度和计算开销之间的权衡。我们的研究结果与现有文献有所不同,揭示出:(1) 变压器对不同领域的变化表现出更强的适应能力;(2) 许多 OTTA 方法的功效依赖于较大的批量规模;(3) 优化的稳定性和对扰动的抵抗力在适应过程中至关重要,尤其是当批量规模为 1 时。我们的基准测试工具包和源代码可在 https://github.com/Jo-wang/OTTA_ViT_survey 上获取。
{"title":"In Search of Lost Online Test-Time Adaptation: A Survey","authors":"Zixin Wang, Yadan Luo, Liang Zheng, Zhuoxiao Chen, Sen Wang, Zi Huang","doi":"10.1007/s11263-024-02213-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02213-5","url":null,"abstract":"<p>This article presents a comprehensive survey of online test-time adaptation (OTTA), focusing on effectively adapting machine learning models to distributionally different target data upon batch arrival. Despite the recent proliferation of OTTA methods, conclusions from previous studies are inconsistent due to ambiguous settings, outdated backbones, and inconsistent hyperparameter tuning, which obscure core challenges and hinder reproducibility. To enhance clarity and enable rigorous comparison, we classify OTTA techniques into three primary categories and benchmark them using a modern backbone, the Vision Transformer. Our benchmarks cover conventional corrupted datasets such as CIFAR-10/100-C and ImageNet-C, as well as real-world shifts represented by CIFAR-10.1, OfficeHome, and CIFAR-10-Warehouse. The CIFAR-10-Warehouse dataset includes a variety of variations from different search engines and synthesized data generated through diffusion models. To measure efficiency in online scenarios, we introduce novel evaluation metrics, including GFLOPs, wall clock time, and GPU memory usage, providing a clearer picture of the trade-offs between adaptation accuracy and computational overhead. Our findings diverge from existing literature, revealing that (1) transformers demonstrate heightened resilience to diverse domain shifts, (2) the efficacy of many OTTA methods relies on large batch sizes, and (3) stability in optimization and resistance to perturbations are crucial during adaptation, particularly when the batch size is 1. Based on these insights, we highlight promising directions for future research. Our benchmarking toolkit and source code are available at https://github.com/Jo-wang/OTTA_ViT_survey.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142233295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation WeakCLIP:针对弱监督语义分割调整 CLIP
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-05 DOI: 10.1007/s11263-024-02224-2
Lianghui Zhu, Xinggang Wang, Jiapei Feng, Tianheng Cheng, Yingyue Li, Bo Jiang, Dingwen Zhang, Junwei Han

Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the (frac{1}{16}) down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the val set of PASCAL VOC 2012 and 46.1% mIoU on the val set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.

对比语言和图像预训练(CLIP)在各种计算机视觉任务中取得了巨大成功,同时也为利用其大规模预训练知识增强弱监督图像理解提供了一个合适的途径。弱监督语义分割(WSSS)旨在完善类激活图(CAM)并生成高质量的伪掩码,是减少对像素级人类注释标签依赖的有效方法。弱监督语义分割(WSSS)旨在提炼类激活图(CAM)作为伪掩码,但它在很大程度上依赖于手工制作的先验和数字图像处理方法等归纳偏差。对于视觉语言预训练模型,即 CLIP,我们为 WSSS 提出了一种新颖的文本到像素匹配范例。然而,由于以下三个关键问题,将 CLIP 直接应用于 WSSS 具有挑战性:(1) 对比预训练与 WSSS CAM 精炼之间存在任务差距;(2) 缺乏文本到像素建模以充分利用预训练知识;(3) ViT 的下采样分辨率导致细节不足。因此,我们提出了 WeakCLIP 来解决这些问题,并将来自 CLIP 的预训练知识用于 WSSS。具体来说,我们首先通过提出金字塔适配器和可学习提示来提取 WSSS 特定表征,从而解决任务差距问题。然后,我们设计了一个共同关注匹配模块来模拟文本到像素的关系。最后,我们引入了金字塔适配器和文本引导解码器,以收集多层次信息,并将其与文本引导分层整合。WeakCLIP 提供了一种有效且参数效率高的方法,将 CLIP 知识转移到改进 CAM 中。广泛的实验证明,WeakCLIP 在标准基准上达到了最先进的 WSSS 性能,即在 PASCAL VOC 2012 的 Val 集上达到 74.0% mIoU,在 COCO 2014 的 Val 集上达到 46.1% mIoU。源代码和模型检查点发布于 https://github.com/hustvl/WeakCLIP。
{"title":"WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation","authors":"Lianghui Zhu, Xinggang Wang, Jiapei Feng, Tianheng Cheng, Yingyue Li, Bo Jiang, Dingwen Zhang, Junwei Han","doi":"10.1007/s11263-024-02224-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02224-2","url":null,"abstract":"<p>Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the <span>(frac{1}{16})</span> down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the <i>val</i> set of PASCAL VOC 2012 and 46.1% mIoU on the <i>val</i> set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Continual Face Forgery Detection via Historical Distribution Preserving 通过历史分布保存进行连续人脸伪造检测
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-04 DOI: 10.1007/s11263-024-02160-1
Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji

Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors. Our code is available at https://github.com/skJack/HDP.

人脸伪造技术发展迅速,对安全构成严重威胁。现有的人脸伪造检测方法试图学习可通用的特征,但在实际应用中仍有不足。此外,在历史训练数据上对这些方法进行微调需要耗费大量时间和存储资源。在本文中,我们将重点关注一个新颖而具有挑战性的问题:持续人脸伪造检测(CFFD),其目的是在不遗忘之前伪造攻击的情况下高效地学习新的伪造攻击。具体来说,我们提出了一种历史分布保存(HDP)框架,它可以保留和保存历史人脸的分布。为此,我们使用通用对抗扰动(UAP)来模拟历史伪造分布,并通过知识提炼来保持真实人脸在不同模型中的分布变化。我们还通过三种评估协议为 CFFD 构建了一个新的基准。我们在基准上进行的大量实验表明,我们的方法优于最先进的竞争对手。我们的代码见 https://github.com/skJack/HDP。
{"title":"Continual Face Forgery Detection via Historical Distribution Preserving","authors":"Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji","doi":"10.1007/s11263-024-02160-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02160-1","url":null,"abstract":"<p>Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors. Our code is available at https://github.com/skJack/HDP.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142131051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Fuzzy Positive Learning for Annotation-Scarce Semantic Segmentation 用于无注释语义分割的自适应模糊正向学习
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-02 DOI: 10.1007/s11263-024-02217-1
Pengchong Qiao, Yu Wang, Chang Liu, Lei Shang, Baigui Sun, Zhennan Wang, Xiawu Zheng, Rongrong Ji, Jie Chen

Annotation-scarce semantic segmentation aims to obtain meaningful pixel-level discrimination with scarce or even no manual annotations, of which the crux is how to utilize unlabeled data by pseudo-label learning. Typical works focus on ameliorating the error-prone pseudo-labeling, e.g., only utilizing high-confidence pseudo labels and filtering low-confidence ones out. But we think differently and resort to exhausting informative semantics from multiple probably correct candidate labels. This brings our method the ability to learn more accurately even though pseudo labels are unreliable. In this paper, we propose Adaptive Fuzzy Positive Learning (A-FPL) for correctly learning unlabeled data in a plug-and-play fashion, targeting adaptively encouraging fuzzy positive predictions and suppressing highly probable negatives. Specifically, A-FPL comprises two main components: (1) Fuzzy positive assignment (FPA) that adaptively assigns fuzzy positive labels to each pixel, while ensuring their quality through a T-value adaption algorithm (2) Fuzzy positive regularization (FPR) that restricts the predictions of fuzzy positive categories to be larger than those of negative categories. Being conceptually simple yet practically effective, A-FPL remarkably alleviates interference from wrong pseudo labels, progressively refining semantic discrimination. Theoretical analysis and extensive experiments on various training settings with consistent performance gain justify the superiority of our approach. Codes are at A-FPL.

稀缺注释的语义分割旨在通过稀缺甚至没有人工注释的情况下获得有意义的像素级判别,其中的关键是如何通过伪标签学习利用无标签数据。典型的工作重点是改善易出错的伪标签,例如只利用高置信度的伪标签,过滤掉低置信度的伪标签。但我们的想法不同,我们会从多个可能正确的候选标签中穷尽信息语义。这使我们的方法即使在伪标签不可靠的情况下也能更准确地学习。在本文中,我们提出了自适应模糊正向学习(A-FPL),以即插即用的方式正确学习无标签数据,目标是自适应地鼓励模糊正向预测,抑制高概率的否定预测。具体来说,A-FPL 包括两个主要部分:(1) 模糊正向分配(FPA):为每个像素自适应地分配模糊正向标签,同时通过 T 值自适应算法确保标签的质量 (2) 模糊正向正则化(FPR):限制模糊正向类别的预测值大于负向类别的预测值。A-FPL 概念简单但实际有效,能显著减轻错误伪标签的干扰,逐步完善语义辨别能力。理论分析和各种训练设置的广泛实验证明了我们方法的优越性。代码见 A-FPL。
{"title":"Adaptive Fuzzy Positive Learning for Annotation-Scarce Semantic Segmentation","authors":"Pengchong Qiao, Yu Wang, Chang Liu, Lei Shang, Baigui Sun, Zhennan Wang, Xiawu Zheng, Rongrong Ji, Jie Chen","doi":"10.1007/s11263-024-02217-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02217-1","url":null,"abstract":"<p>Annotation-scarce semantic segmentation aims to obtain meaningful pixel-level discrimination with scarce or even no manual annotations, of which the crux is how to utilize unlabeled data by pseudo-label learning. Typical works focus on ameliorating the error-prone pseudo-labeling, e.g., only utilizing high-confidence pseudo labels and filtering low-confidence ones out. But we think differently and resort to exhausting informative semantics from multiple probably correct candidate labels. This brings our method the ability to learn more accurately even though pseudo labels are unreliable. In this paper, we propose Adaptive Fuzzy Positive Learning (A-FPL) for correctly learning unlabeled data in a plug-and-play fashion, targeting adaptively encouraging fuzzy positive predictions and suppressing highly probable negatives. Specifically, A-FPL comprises two main components: (1) Fuzzy positive assignment (FPA) that adaptively assigns fuzzy positive labels to each pixel, while ensuring their quality through a T-value adaption algorithm (2) Fuzzy positive regularization (FPR) that restricts the predictions of fuzzy positive categories to be larger than those of negative categories. Being conceptually simple yet practically effective, A-FPL remarkably alleviates interference from wrong pseudo labels, progressively refining semantic discrimination. Theoretical analysis and extensive experiments on various training settings with consistent performance gain justify the superiority of our approach. Codes are at A-FPL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142123587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need 重新审视使用预训练模型的分类增量学习:通用性和适应性是你所需要的一切
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-31 DOI: 10.1007/s11263-024-02218-0
Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu

Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Traditional CIL models are trained from scratch to continually acquire knowledge as data evolves. Recently, pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL. Contrary to traditional methods, PTMs possess generalizable embeddings, which can be easily transferred for CIL. In this work, we revisit CIL with PTMs and argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring. (1) We first reveal that frozen PTM can already provide generalizable embeddings for CIL. Surprisingly, a simple baseline (SimpleCIL) which continually sets the classifiers of PTM to prototype features can beat state-of-the-art even without training on the downstream task. (2) Due to the distribution gap between pre-trained and downstream datasets, PTM can be further cultivated with adaptivity via model adaptation. We propose AdaPt and mERge (Aper), which aggregates the embeddings of PTM and adapted models for classifier construction. Aper is a general framework that can be orthogonally combined with any parameter-efficient tuning method, which holds the advantages of PTM’s generalizability and adapted model’s adaptivity. (3) Additionally, considering previous ImageNet-based benchmarks are unsuitable in the era of PTM due to data overlapping, we propose four new benchmarks for assessment, namely ImageNet-A, ObjectNet, OmniBenchmark, and VTAB. Extensive experiments validate the effectiveness of Aper with a unified and concise framework. Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.

类别递增学习(CIL)旨在适应新出现的类别,同时不遗忘旧类别。传统的 CIL 模型需要从头开始训练,以便随着数据的发展不断获取知识。最近,预训练取得了重大进展,使得大量预训练模型(PTM)可以用于 CIL。与传统方法不同的是,PTMs 拥有可通用的嵌入,可以很容易地转移到 CIL 中。在这项工作中,我们用 PTM 重新审视了 CIL,并认为 CIL 的核心因素是模型更新的适应性和知识转移的通用性。(1) 我们首先揭示了冻结的 PTM 已经可以为 CIL 提供可通用的嵌入。令人惊讶的是,一个简单的基线(SimpleCIL)不断将 PTM 的分类器设置为原型特征,即使不对下游任务进行训练,也能击败最先进的技术。(2) 由于预训练数据集和下游数据集之间存在分布差距,因此可以通过模型自适应进一步培养 PTM 的自适应能力。我们提出了 AdaPt and mERge (Aper),它聚合了 PTM 的嵌入和适配模型,用于构建分类器。Aper 是一个通用框架,可与任何参数高效的调整方法正交结合,兼具 PTM 的泛化性和适配模型的自适应性。(3) 此外,考虑到以往基于 ImageNet 的基准因数据重叠而不适合 PTM 时代,我们提出了四个新的评估基准,即 ImageNet-A、ObjectNet、OmniBenchmark 和 VTAB。广泛的实验验证了 Aper 在统一简洁框架下的有效性。代码见 https://github.com/zhoudw-zdw/RevisitingCIL。
{"title":"Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need","authors":"Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu","doi":"10.1007/s11263-024-02218-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02218-0","url":null,"abstract":"<p>Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Traditional CIL models are trained from scratch to continually acquire knowledge as data evolves. Recently, pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL. Contrary to traditional methods, PTMs possess generalizable embeddings, which can be easily transferred for CIL. In this work, we revisit CIL with PTMs and argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring. (1) We first reveal that frozen PTM can already provide generalizable embeddings for CIL. Surprisingly, a simple baseline (SimpleCIL) which continually sets the classifiers of PTM to prototype features can beat state-of-the-art even without training on the downstream task. (2) Due to the distribution gap between pre-trained and downstream datasets, PTM can be further cultivated with adaptivity via model adaptation. We propose AdaPt and mERge (<span>Aper</span>), which aggregates the embeddings of PTM and adapted models for classifier construction. <span>Aper </span>is a general framework that can be orthogonally combined with any parameter-efficient tuning method, which holds the advantages of PTM’s generalizability and adapted model’s adaptivity. (3) Additionally, considering previous ImageNet-based benchmarks are unsuitable in the era of PTM due to data overlapping, we propose four new benchmarks for assessment, namely ImageNet-A, ObjectNet, OmniBenchmark, and VTAB. Extensive experiments validate the effectiveness of <span>Aper </span>with a unified and concise framework. Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1