Pub Date : 2024-09-26DOI: 10.1007/s11263-024-02226-0
Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang
Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.
{"title":"Robust Deep Object Tracking against Adversarial Attacks","authors":"Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang","doi":"10.1007/s11263-024-02226-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02226-0","url":null,"abstract":"<p>Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-20DOI: 10.1007/s11263-024-02221-5
Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu
In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.
在安全关键应用的开放世界识别中,为深度神经网络提供可靠的预测已成为一项关键要求。针对可信度校准、误分类检测和分布外检测等与可靠预测相关的任务,已经提出了许多方法。最近,预训练被证明是提高可靠预测的最有效方法之一,特别是对于像 ViT 这样需要大量训练数据的现代网络。然而,手动收集数据非常耗时。在本文中,我们利用生成模型的突破,研究使用生成数据扩展训练集是否以及如何提高预测的可靠性。我们的实验表明,使用大量生成数据进行训练可以消除可靠预测中的过拟合,从而显著提高性能。令人惊讶的是,像 ResNet-18 这样的经典网络,在使用大量生成数据进行训练时,有时会表现出与使用大量真实数据集对 ViT 进行预训练时相当的性能。
{"title":"Breaking the Limits of Reliable Prediction via Generated Data","authors":"Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu","doi":"10.1007/s11263-024-02221-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02221-5","url":null,"abstract":"<p>In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"18 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1007/s11263-024-02227-z
Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han
Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300(times )–2500(times ) speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).
{"title":"FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention","authors":"Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han","doi":"10.1007/s11263-024-02227-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02227-z","url":null,"abstract":"<p>Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions <i>with only forward passes</i>. To address the identity blending problem in the multi-subject generation, FastComposer proposes <i>cross-attention localization</i> supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes <i>delayed subject conditioning</i> in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300<span>(times )</span>–2500<span>(times )</span> speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1007/s11263-024-02166-9
Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep
Addressing Lidar Panoptic Segmentation (LPS) is crucial for safe deployment of autnomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.
解决激光雷达全景分割(LPS)问题对于自动驾驶汽车的安全部署至关重要。LPS 的目的是根据预定义的语义类词汇识别和分割激光雷达点,包括可数对象的事物类(如行人和车辆)和无定形区域的事物类(如植被和道路)。重要的是,LPS 需要分割单个事物实例(如每辆车)。当前的 LPS 方法做出了一个不切实际的假设,即在真实的开放世界中,语义类词汇是固定不变的,但事实上,类本体通常会随着时间的推移而不断演化,因为机器人会遇到新的类实例,而这些新的类实例在预先定义的类词汇中被认为是未知的。为了解决这个不切实际的假设,我们研究了开放世界中的 LPS(LiPSOW):我们在一个具有预定义语义类词汇的数据集上训练模型,然后研究它们在更大的数据集上的泛化情况,在这个数据集上可能会出现事物和物品类的新实例。这种实验设置得出了有趣的结论。现有技术训练了特定类别的实例分割方法,并在已知类别上获得了最先进的结果,而基于类别无关的自下而上分组方法则在初始类别词汇之外的类别(即未知类别)上表现良好。遗憾的是,这些方法在已知类别上的表现无法与完全数据驱动的方法相提并论。我们的工作提出了一种中间路线:我们进行类无关的点聚类,并以分层方式对输入云进行过度分割,然后进行二进制点分割分类,类似于区域建议网络(Ren et al. NeurIPS, 2015)。我们通过计算点段加权分层树中的切分来获得最终的点云分割,与语义分类无关。值得注意的是,这种统一的方法在已知和未知类别上都有很好的表现。
{"title":"Lidar Panoptic Segmentation in an Open World","authors":"Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep","doi":"10.1007/s11263-024-02166-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02166-9","url":null,"abstract":"<p>Addressing Lidar Panoptic Segmentation (<i>LPS</i>) is crucial for safe deployment of autnomous vehicles. <i>LPS</i> aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including <span>thing</span> classes of countable objects (e.g., pedestrians and vehicles) and <span>stuff</span> classes of amorphous regions (e.g., vegetation and road). Importantly, <i>LPS</i> requires segmenting individual <span>thing</span> instances (<i>e.g</i>., every single vehicle). Current <i>LPS</i> methods make an unrealistic assumption that the semantic class vocabulary is <i>fixed</i> in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of <i>novel</i> classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study <i>LPS</i> in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of <span>thing</span> and <span>stuff</span> classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (<i>i.e</i>., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"13 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.
{"title":"Hierarchical Active Learning for Low-Altitude Drone-View Object Detection","authors":"Haohao Hu, Tianyu Han, Yuerong Wang, Wanjun Zhong, Jingwei Yue, Peng Zan","doi":"10.1007/s11263-024-02228-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02228-y","url":null,"abstract":"<p>Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142233294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-15DOI: 10.1007/s11263-024-02213-5
Zixin Wang, Yadan Luo, Liang Zheng, Zhuoxiao Chen, Sen Wang, Zi Huang
This article presents a comprehensive survey of online test-time adaptation (OTTA), focusing on effectively adapting machine learning models to distributionally different target data upon batch arrival. Despite the recent proliferation of OTTA methods, conclusions from previous studies are inconsistent due to ambiguous settings, outdated backbones, and inconsistent hyperparameter tuning, which obscure core challenges and hinder reproducibility. To enhance clarity and enable rigorous comparison, we classify OTTA techniques into three primary categories and benchmark them using a modern backbone, the Vision Transformer. Our benchmarks cover conventional corrupted datasets such as CIFAR-10/100-C and ImageNet-C, as well as real-world shifts represented by CIFAR-10.1, OfficeHome, and CIFAR-10-Warehouse. The CIFAR-10-Warehouse dataset includes a variety of variations from different search engines and synthesized data generated through diffusion models. To measure efficiency in online scenarios, we introduce novel evaluation metrics, including GFLOPs, wall clock time, and GPU memory usage, providing a clearer picture of the trade-offs between adaptation accuracy and computational overhead. Our findings diverge from existing literature, revealing that (1) transformers demonstrate heightened resilience to diverse domain shifts, (2) the efficacy of many OTTA methods relies on large batch sizes, and (3) stability in optimization and resistance to perturbations are crucial during adaptation, particularly when the batch size is 1. Based on these insights, we highlight promising directions for future research. Our benchmarking toolkit and source code are available at https://github.com/Jo-wang/OTTA_ViT_survey.
{"title":"In Search of Lost Online Test-Time Adaptation: A Survey","authors":"Zixin Wang, Yadan Luo, Liang Zheng, Zhuoxiao Chen, Sen Wang, Zi Huang","doi":"10.1007/s11263-024-02213-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02213-5","url":null,"abstract":"<p>This article presents a comprehensive survey of online test-time adaptation (OTTA), focusing on effectively adapting machine learning models to distributionally different target data upon batch arrival. Despite the recent proliferation of OTTA methods, conclusions from previous studies are inconsistent due to ambiguous settings, outdated backbones, and inconsistent hyperparameter tuning, which obscure core challenges and hinder reproducibility. To enhance clarity and enable rigorous comparison, we classify OTTA techniques into three primary categories and benchmark them using a modern backbone, the Vision Transformer. Our benchmarks cover conventional corrupted datasets such as CIFAR-10/100-C and ImageNet-C, as well as real-world shifts represented by CIFAR-10.1, OfficeHome, and CIFAR-10-Warehouse. The CIFAR-10-Warehouse dataset includes a variety of variations from different search engines and synthesized data generated through diffusion models. To measure efficiency in online scenarios, we introduce novel evaluation metrics, including GFLOPs, wall clock time, and GPU memory usage, providing a clearer picture of the trade-offs between adaptation accuracy and computational overhead. Our findings diverge from existing literature, revealing that (1) transformers demonstrate heightened resilience to diverse domain shifts, (2) the efficacy of many OTTA methods relies on large batch sizes, and (3) stability in optimization and resistance to perturbations are crucial during adaptation, particularly when the batch size is 1. Based on these insights, we highlight promising directions for future research. Our benchmarking toolkit and source code are available at https://github.com/Jo-wang/OTTA_ViT_survey.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142233295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-05DOI: 10.1007/s11263-024-02224-2
Lianghui Zhu, Xinggang Wang, Jiapei Feng, Tianheng Cheng, Yingyue Li, Bo Jiang, Dingwen Zhang, Junwei Han
Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the (frac{1}{16}) down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the val set of PASCAL VOC 2012 and 46.1% mIoU on the val set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.
{"title":"WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation","authors":"Lianghui Zhu, Xinggang Wang, Jiapei Feng, Tianheng Cheng, Yingyue Li, Bo Jiang, Dingwen Zhang, Junwei Han","doi":"10.1007/s11263-024-02224-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02224-2","url":null,"abstract":"<p>Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the <span>(frac{1}{16})</span> down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the <i>val</i> set of PASCAL VOC 2012 and 46.1% mIoU on the <i>val</i> set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-04DOI: 10.1007/s11263-024-02160-1
Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji
Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors. Our code is available at https://github.com/skJack/HDP.
{"title":"Continual Face Forgery Detection via Historical Distribution Preserving","authors":"Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji","doi":"10.1007/s11263-024-02160-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02160-1","url":null,"abstract":"<p>Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors. Our code is available at https://github.com/skJack/HDP.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142131051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1007/s11263-024-02217-1
Pengchong Qiao, Yu Wang, Chang Liu, Lei Shang, Baigui Sun, Zhennan Wang, Xiawu Zheng, Rongrong Ji, Jie Chen
Annotation-scarce semantic segmentation aims to obtain meaningful pixel-level discrimination with scarce or even no manual annotations, of which the crux is how to utilize unlabeled data by pseudo-label learning. Typical works focus on ameliorating the error-prone pseudo-labeling, e.g., only utilizing high-confidence pseudo labels and filtering low-confidence ones out. But we think differently and resort to exhausting informative semantics from multiple probably correct candidate labels. This brings our method the ability to learn more accurately even though pseudo labels are unreliable. In this paper, we propose Adaptive Fuzzy Positive Learning (A-FPL) for correctly learning unlabeled data in a plug-and-play fashion, targeting adaptively encouraging fuzzy positive predictions and suppressing highly probable negatives. Specifically, A-FPL comprises two main components: (1) Fuzzy positive assignment (FPA) that adaptively assigns fuzzy positive labels to each pixel, while ensuring their quality through a T-value adaption algorithm (2) Fuzzy positive regularization (FPR) that restricts the predictions of fuzzy positive categories to be larger than those of negative categories. Being conceptually simple yet practically effective, A-FPL remarkably alleviates interference from wrong pseudo labels, progressively refining semantic discrimination. Theoretical analysis and extensive experiments on various training settings with consistent performance gain justify the superiority of our approach. Codes are at A-FPL.
稀缺注释的语义分割旨在通过稀缺甚至没有人工注释的情况下获得有意义的像素级判别,其中的关键是如何通过伪标签学习利用无标签数据。典型的工作重点是改善易出错的伪标签,例如只利用高置信度的伪标签,过滤掉低置信度的伪标签。但我们的想法不同,我们会从多个可能正确的候选标签中穷尽信息语义。这使我们的方法即使在伪标签不可靠的情况下也能更准确地学习。在本文中,我们提出了自适应模糊正向学习(A-FPL),以即插即用的方式正确学习无标签数据,目标是自适应地鼓励模糊正向预测,抑制高概率的否定预测。具体来说,A-FPL 包括两个主要部分:(1) 模糊正向分配(FPA):为每个像素自适应地分配模糊正向标签,同时通过 T 值自适应算法确保标签的质量 (2) 模糊正向正则化(FPR):限制模糊正向类别的预测值大于负向类别的预测值。A-FPL 概念简单但实际有效,能显著减轻错误伪标签的干扰,逐步完善语义辨别能力。理论分析和各种训练设置的广泛实验证明了我们方法的优越性。代码见 A-FPL。
{"title":"Adaptive Fuzzy Positive Learning for Annotation-Scarce Semantic Segmentation","authors":"Pengchong Qiao, Yu Wang, Chang Liu, Lei Shang, Baigui Sun, Zhennan Wang, Xiawu Zheng, Rongrong Ji, Jie Chen","doi":"10.1007/s11263-024-02217-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02217-1","url":null,"abstract":"<p>Annotation-scarce semantic segmentation aims to obtain meaningful pixel-level discrimination with scarce or even no manual annotations, of which the crux is how to utilize unlabeled data by pseudo-label learning. Typical works focus on ameliorating the error-prone pseudo-labeling, e.g., only utilizing high-confidence pseudo labels and filtering low-confidence ones out. But we think differently and resort to exhausting informative semantics from multiple probably correct candidate labels. This brings our method the ability to learn more accurately even though pseudo labels are unreliable. In this paper, we propose Adaptive Fuzzy Positive Learning (A-FPL) for correctly learning unlabeled data in a plug-and-play fashion, targeting adaptively encouraging fuzzy positive predictions and suppressing highly probable negatives. Specifically, A-FPL comprises two main components: (1) Fuzzy positive assignment (FPA) that adaptively assigns fuzzy positive labels to each pixel, while ensuring their quality through a T-value adaption algorithm (2) Fuzzy positive regularization (FPR) that restricts the predictions of fuzzy positive categories to be larger than those of negative categories. Being conceptually simple yet practically effective, A-FPL remarkably alleviates interference from wrong pseudo labels, progressively refining semantic discrimination. Theoretical analysis and extensive experiments on various training settings with consistent performance gain justify the superiority of our approach. Codes are at A-FPL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142123587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-31DOI: 10.1007/s11263-024-02218-0
Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu
Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Traditional CIL models are trained from scratch to continually acquire knowledge as data evolves. Recently, pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL. Contrary to traditional methods, PTMs possess generalizable embeddings, which can be easily transferred for CIL. In this work, we revisit CIL with PTMs and argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring. (1) We first reveal that frozen PTM can already provide generalizable embeddings for CIL. Surprisingly, a simple baseline (SimpleCIL) which continually sets the classifiers of PTM to prototype features can beat state-of-the-art even without training on the downstream task. (2) Due to the distribution gap between pre-trained and downstream datasets, PTM can be further cultivated with adaptivity via model adaptation. We propose AdaPt and mERge (Aper), which aggregates the embeddings of PTM and adapted models for classifier construction. Aper is a general framework that can be orthogonally combined with any parameter-efficient tuning method, which holds the advantages of PTM’s generalizability and adapted model’s adaptivity. (3) Additionally, considering previous ImageNet-based benchmarks are unsuitable in the era of PTM due to data overlapping, we propose four new benchmarks for assessment, namely ImageNet-A, ObjectNet, OmniBenchmark, and VTAB. Extensive experiments validate the effectiveness of Aper with a unified and concise framework. Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.
{"title":"Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need","authors":"Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu","doi":"10.1007/s11263-024-02218-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02218-0","url":null,"abstract":"<p>Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Traditional CIL models are trained from scratch to continually acquire knowledge as data evolves. Recently, pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL. Contrary to traditional methods, PTMs possess generalizable embeddings, which can be easily transferred for CIL. In this work, we revisit CIL with PTMs and argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring. (1) We first reveal that frozen PTM can already provide generalizable embeddings for CIL. Surprisingly, a simple baseline (SimpleCIL) which continually sets the classifiers of PTM to prototype features can beat state-of-the-art even without training on the downstream task. (2) Due to the distribution gap between pre-trained and downstream datasets, PTM can be further cultivated with adaptivity via model adaptation. We propose AdaPt and mERge (<span>Aper</span>), which aggregates the embeddings of PTM and adapted models for classifier construction. <span>Aper </span>is a general framework that can be orthogonally combined with any parameter-efficient tuning method, which holds the advantages of PTM’s generalizability and adapted model’s adaptivity. (3) Additionally, considering previous ImageNet-based benchmarks are unsuitable in the era of PTM due to data overlapping, we propose four new benchmarks for assessment, namely ImageNet-A, ObjectNet, OmniBenchmark, and VTAB. Extensive experiments validate the effectiveness of <span>Aper </span>with a unified and concise framework. Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}