Pub Date : 2024-10-03DOI: 10.1007/s11263-024-02244-y
Xianzhu Liu, Haozhe Xie, Shengping Zhang, Hongxun Yao, Rongrong Ji, Liqiang Nie, Dacheng Tao
Semantic scene completion (SSC) aims to simultaneously perform scene completion (SC) and predict semantic categories of a 3D scene from a single depth and/or RGB image. Most existing SSC methods struggle to handle complex regions with multiple objects close to each other, especially for objects with reflective or dark surfaces. This primarily stems from two challenges: (1) the loss of geometric information due to the unreliability of depth values from sensors, and (2) the potential for semantic confusion when simultaneously predicting 3D shapes and semantic labels. To address these problems, we propose a Semantic-guided Semantic Scene Completion framework, dubbed SG-SSC, which involves Semantic-guided Fusion (SGF) and Volume-guided Semantic Predictor (VGSP). Guided by 2D semantic segmentation maps, SGF adaptively fuses RGB and depth features to compensate for the missing geometric information caused by the missing values in depth images, thus performing more robustly to unreliable depth information. VGSP exploits the mutual benefit between SC and SSC tasks, making SSC more focused on predicting the categories of voxels with high occupancy probabilities and also allowing SC to utilize semantic priors to better predict voxel occupancy. Experimental results show that SG-SSC outperforms existing state-of-the-art methods on the NYU, NYUCAD, and SemanticKITTI datasets. Models and code are available at https://github.com/aipixel/SG-SSC.
{"title":"2D Semantic-Guided Semantic Scene Completion","authors":"Xianzhu Liu, Haozhe Xie, Shengping Zhang, Hongxun Yao, Rongrong Ji, Liqiang Nie, Dacheng Tao","doi":"10.1007/s11263-024-02244-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02244-y","url":null,"abstract":"<p>Semantic scene completion (SSC) aims to simultaneously perform scene completion (SC) and predict semantic categories of a 3D scene from a single depth and/or RGB image. Most existing SSC methods struggle to handle complex regions with multiple objects close to each other, especially for objects with reflective or dark surfaces. This primarily stems from two challenges: (1) the loss of geometric information due to the unreliability of depth values from sensors, and (2) the potential for semantic confusion when simultaneously predicting 3D shapes and semantic labels. To address these problems, we propose a Semantic-guided Semantic Scene Completion framework, dubbed SG-SSC, which involves Semantic-guided Fusion (SGF) and Volume-guided Semantic Predictor (VGSP). Guided by 2D semantic segmentation maps, SGF adaptively fuses RGB and depth features to compensate for the missing geometric information caused by the missing values in depth images, thus performing more robustly to unreliable depth information. VGSP exploits the mutual benefit between SC and SSC tasks, making SSC more focused on predicting the categories of voxels with high occupancy probabilities and also allowing SC to utilize semantic priors to better predict voxel occupancy. Experimental results show that SG-SSC outperforms existing state-of-the-art methods on the NYU, NYUCAD, and SemanticKITTI datasets. Models and code are available at https://github.com/aipixel/SG-SSC.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142374204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-30DOI: 10.1007/s11263-024-02233-1
Ruicong Liu, Haofei Wang, Feng Lu
Gaze, as a pivotal indicator of human emotion, plays a crucial role in various computer vision tasks. However, the accuracy of gaze estimation often significantly deteriorates when applied to unseen environments, thereby limiting its practical value. Therefore, enhancing the generalizability of gaze estimators to new domains emerges as a critical challenge. A common limitation in existing domain adaptation research is the inability to identify and leverage truly influential factors during the adaptation process. This shortcoming often results in issues such as limited accuracy and unstable adaptation. To address this issue, this article discovers a truly influential factor in the cross-domain problem, i.e., high-frequency components (HFC). This discovery stems from an analysis of gaze jitter-a frequently overlooked but impactful issue where predictions can deviate drastically even for visually similar input images. Inspired by this discovery, we propose an “embed-then-suppress" HFC manipulation strategy to adapt gaze estimation to new domains. Our method first embeds additive HFC to the input images, then performs domain adaptation by suppressing the impact of HFC. Specifically, the suppression is carried out in a contrasive manner. Each original image is paired with its HFC-embedded version, thereby enabling our method to suppress the HFC impact through contrasting the representations within the pairs. The proposed method is evaluated across four cross-domain gaze estimation tasks. The experimental results show that it not only enhances gaze estimation accuracy but also significantly reduces gaze jitter in the target domain. Compared with previous studies, our method offers higher accuracy, reduced gaze jitter, and improved adaptation stability, marking the potential for practical deployment.
{"title":"From Gaze Jitter to Domain Adaptation: Generalizing Gaze Estimation by Manipulating High-Frequency Components","authors":"Ruicong Liu, Haofei Wang, Feng Lu","doi":"10.1007/s11263-024-02233-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02233-1","url":null,"abstract":"<p>Gaze, as a pivotal indicator of human emotion, plays a crucial role in various computer vision tasks. However, the accuracy of gaze estimation often significantly deteriorates when applied to unseen environments, thereby limiting its practical value. Therefore, enhancing the generalizability of gaze estimators to new domains emerges as a critical challenge. A common limitation in existing domain adaptation research is the inability to identify and leverage truly influential factors during the adaptation process. This shortcoming often results in issues such as limited accuracy and unstable adaptation. To address this issue, this article discovers a truly influential factor in the cross-domain problem, <i>i.e.</i>, high-frequency components (HFC). This discovery stems from an analysis of gaze jitter-a frequently overlooked but impactful issue where predictions can deviate drastically even for visually similar input images. Inspired by this discovery, we propose an “embed-then-suppress\" HFC manipulation strategy to adapt gaze estimation to new domains. Our method first embeds additive HFC to the input images, then performs domain adaptation by suppressing the impact of HFC. Specifically, the suppression is carried out in a contrasive manner. Each original image is paired with its HFC-embedded version, thereby enabling our method to suppress the HFC impact through contrasting the representations within the pairs. The proposed method is evaluated across four cross-domain gaze estimation tasks. The experimental results show that it not only enhances gaze estimation accuracy but also significantly reduces gaze jitter in the target domain. Compared with previous studies, our method offers higher accuracy, reduced gaze jitter, and improved adaptation stability, marking the potential for practical deployment.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/.
{"title":"LEO: Generative Latent Image Animator for Human Video Synthesis","authors":"Yaohui Wang, Xin Ma, Xinyuan Chen, Cunjian Chen, Antitza Dantcheva, Bo Dai, Yu Qiao","doi":"10.1007/s11263-024-02231-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02231-3","url":null,"abstract":"<p>Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/. </p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-26DOI: 10.1007/s11263-024-02211-7
Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by gradient magnitude imbalance and gradient direction divergence. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is available at https://github.com/mzhaoshuai/SlimCLR.
{"title":"Slimmable Networks for Contrastive Self-supervised Learning","authors":"Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang","doi":"10.1007/s11263-024-02211-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02211-7","url":null,"abstract":"<p>Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by <i>gradient magnitude imbalance</i> and <i>gradient direction divergence</i>. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is available at https://github.com/mzhaoshuai/SlimCLR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large pre-trained vision language models (VLMs) have demonstrated impressive representation learning capabilities, but their transferability across various downstream tasks heavily relies on prompt learning. Since VLMs consist of text and visual sub-branches, existing prompt approaches are mainly divided into text and visual prompts. Recent text prompt methods have achieved great performance by designing input-condition prompts that encompass both text and image domain knowledge. However, roughly incorporating the same image feature into each learnable text token may be unjustifiable, as it could result in learnable text prompts being concentrated on one or a subset of characteristics. In light of this, we propose a fine-grained text prompt (FTP) that decomposes the single global image features into several finer-grained semantics and incorporates them into corresponding text prompt tokens. On the other hand, current methods neglect valuable text semantic information when building the visual prompt. Furthermore, text information contains redundant and negative category semantics. To address this, we propose a text-reorganized visual prompt (TVP) that reorganizes the text descriptions of the current image to construct the visual prompt, guiding the image branch to attend to class-related representations. By leveraging both FTP and TVP, we enable mutual prompting between the text and visual modalities, unleashing their potential to tap into the representation capabilities of VLMs. Extensive experiments on 11 classification benchmarks show that our method surpasses existing methods by a large margin. In particular, our approach improves recent state-of-the-art CoCoOp by 4.79% on new classes and 3.88% on harmonic mean over eleven classification benchmarks.
{"title":"Mutual Prompt Leaning for Vision Language Models","authors":"Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu, Jingyuan Feng, Shengsheng Wang, Jingdong Wang","doi":"10.1007/s11263-024-02243-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02243-z","url":null,"abstract":"<p>Large pre-trained vision language models (VLMs) have demonstrated impressive representation learning capabilities, but their transferability across various downstream tasks heavily relies on prompt learning. Since VLMs consist of text and visual sub-branches, existing prompt approaches are mainly divided into text and visual prompts. Recent text prompt methods have achieved great performance by designing input-condition prompts that encompass both text and image domain knowledge. However, roughly incorporating the same image feature into each learnable text token may be unjustifiable, as it could result in learnable text prompts being concentrated on one or a subset of characteristics. In light of this, we propose a fine-grained text prompt (FTP) that decomposes the single global image features into several finer-grained semantics and incorporates them into corresponding text prompt tokens. On the other hand, current methods neglect valuable text semantic information when building the visual prompt. Furthermore, text information contains redundant and negative category semantics. To address this, we propose a text-reorganized visual prompt (TVP) that reorganizes the text descriptions of the current image to construct the visual prompt, guiding the image branch to attend to class-related representations. By leveraging both FTP and TVP, we enable mutual prompting between the text and visual modalities, unleashing their potential to tap into the representation capabilities of VLMs. Extensive experiments on 11 classification benchmarks show that our method surpasses existing methods by a large margin. In particular, our approach improves recent state-of-the-art CoCoOp by 4.79% on new classes and 3.88% on harmonic mean over eleven classification benchmarks.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-26DOI: 10.1007/s11263-024-02226-0
Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang
Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.
{"title":"Robust Deep Object Tracking against Adversarial Attacks","authors":"Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang","doi":"10.1007/s11263-024-02226-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02226-0","url":null,"abstract":"<p>Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-20DOI: 10.1007/s11263-024-02221-5
Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu
In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.
在安全关键应用的开放世界识别中,为深度神经网络提供可靠的预测已成为一项关键要求。针对可信度校准、误分类检测和分布外检测等与可靠预测相关的任务,已经提出了许多方法。最近,预训练被证明是提高可靠预测的最有效方法之一,特别是对于像 ViT 这样需要大量训练数据的现代网络。然而,手动收集数据非常耗时。在本文中,我们利用生成模型的突破,研究使用生成数据扩展训练集是否以及如何提高预测的可靠性。我们的实验表明,使用大量生成数据进行训练可以消除可靠预测中的过拟合,从而显著提高性能。令人惊讶的是,像 ResNet-18 这样的经典网络,在使用大量生成数据进行训练时,有时会表现出与使用大量真实数据集对 ViT 进行预训练时相当的性能。
{"title":"Breaking the Limits of Reliable Prediction via Generated Data","authors":"Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu","doi":"10.1007/s11263-024-02221-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02221-5","url":null,"abstract":"<p>In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1007/s11263-024-02166-9
Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep
Addressing Lidar Panoptic Segmentation (LPS) is crucial for safe deployment of autnomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.
解决激光雷达全景分割(LPS)问题对于自动驾驶汽车的安全部署至关重要。LPS 的目的是根据预定义的语义类词汇识别和分割激光雷达点,包括可数对象的事物类(如行人和车辆)和无定形区域的事物类(如植被和道路)。重要的是,LPS 需要分割单个事物实例(如每辆车)。当前的 LPS 方法做出了一个不切实际的假设,即在真实的开放世界中,语义类词汇是固定不变的,但事实上,类本体通常会随着时间的推移而不断演化,因为机器人会遇到新的类实例,而这些新的类实例在预先定义的类词汇中被认为是未知的。为了解决这个不切实际的假设,我们研究了开放世界中的 LPS(LiPSOW):我们在一个具有预定义语义类词汇的数据集上训练模型,然后研究它们在更大的数据集上的泛化情况,在这个数据集上可能会出现事物和物品类的新实例。这种实验设置得出了有趣的结论。现有技术训练了特定类别的实例分割方法,并在已知类别上获得了最先进的结果,而基于类别无关的自下而上分组方法则在初始类别词汇之外的类别(即未知类别)上表现良好。遗憾的是,这些方法在已知类别上的表现无法与完全数据驱动的方法相提并论。我们的工作提出了一种中间路线:我们进行类无关的点聚类,并以分层方式对输入云进行过度分割,然后进行二进制点分割分类,类似于区域建议网络(Ren et al. NeurIPS, 2015)。我们通过计算点段加权分层树中的切分来获得最终的点云分割,与语义分类无关。值得注意的是,这种统一的方法在已知和未知类别上都有很好的表现。
{"title":"Lidar Panoptic Segmentation in an Open World","authors":"Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep","doi":"10.1007/s11263-024-02166-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02166-9","url":null,"abstract":"<p>Addressing Lidar Panoptic Segmentation (<i>LPS</i>) is crucial for safe deployment of autnomous vehicles. <i>LPS</i> aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including <span>thing</span> classes of countable objects (e.g., pedestrians and vehicles) and <span>stuff</span> classes of amorphous regions (e.g., vegetation and road). Importantly, <i>LPS</i> requires segmenting individual <span>thing</span> instances (<i>e.g</i>., every single vehicle). Current <i>LPS</i> methods make an unrealistic assumption that the semantic class vocabulary is <i>fixed</i> in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of <i>novel</i> classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study <i>LPS</i> in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of <span>thing</span> and <span>stuff</span> classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (<i>i.e</i>., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1007/s11263-024-02227-z
Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han
Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300(times )–2500(times ) speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).
{"title":"FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention","authors":"Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han","doi":"10.1007/s11263-024-02227-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02227-z","url":null,"abstract":"<p>Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions <i>with only forward passes</i>. To address the identity blending problem in the multi-subject generation, FastComposer proposes <i>cross-attention localization</i> supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes <i>delayed subject conditioning</i> in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300<span>(times )</span>–2500<span>(times )</span> speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.
{"title":"Hierarchical Active Learning for Low-Altitude Drone-View Object Detection","authors":"Haohao Hu, Tianyu Han, Yuerong Wang, Wanjun Zhong, Jingwei Yue, Peng Zan","doi":"10.1007/s11263-024-02228-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02228-y","url":null,"abstract":"<p>Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142233294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}