International Journal of Computer Vision最新文献_第5页

MoonShot: Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-24 DOI: 10.1007/s11263-025-02346-1

David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo

Current video diffusion models (VDMs) mostly rely on text conditions, limiting control over video appearance and geometry. This study introduces a new model, MoonShot, conditioning on both image and text for enhanced control. It features the Multimodal Video Block (MVB), integrating the motion-aware dual cross-attention layer for precise appearance and motion alignment with provided prompts, and the spatiotemporal attention layer for large motion dynamics. It can also incorporate pre-trained Image ControlNet modules for geometry conditioning without extra video training. Experiments show our model significantly improves visual quality and motion fidelity, and its versatility allows for applications in personalized video generation, animation, and editing, making it a foundational tool for controllable video creation. More video results can be found here.

引用次数: 0

DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-24 DOI: 10.1007/s11263-024-02274-6

Rui Shao, Tianxing Wu, Liqiang Nie, Ziwei Liu

Existing deepfake detection methods fail to generalize well to unseen or degraded samples, which can be attributed to the over-fitting of low-level forgery patterns. Here we argue that high-level semantics are also indispensable recipes for generalizable forgery detection. Recently, large pre-trained Vision Transformers (ViTs) have shown promising generalization capability. In this paper, we propose the first parameter-efficient tuning approach for deepfake detection, namely DeepFake-Adapter, to effectively and efficiently adapt the generalizable high-level semantics from large pre-trained ViTs to aid deepfake detection. Given large pre-trained models but limited deepfake data, DeepFake-Adapter introduces lightweight yet dedicated dual-level adapter modules to a ViT while keeping the model backbone frozen. Specifically, to guide the adaptation process to be aware of both global and local forgery cues of deepfake data, 1) we not only insert Globally-aware Bottleneck Adapters in parallel to MLP layers of ViT, 2) but also actively cross-attend Locally-aware Spatial Adapters with features from ViT. Unlike existing deepfake detection methods merely focusing on low-level forgery patterns, the forgery detection process of our model can be regularized by generalizable high-level semantics from a pre-trained ViT and adapted by global and local low-level forgeries of deepfake data. Extensive experiments on several standard deepfake detection benchmarks validate the effectiveness of our approach. Notably, DeepFake-Adapter demonstrates a convincing advantage under cross-dataset and cross-manipulation settings.

{"title":"DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection","authors":"Rui Shao, Tianxing Wu, Liqiang Nie, Ziwei Liu","doi":"10.1007/s11263-024-02274-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02274-6","url":null,"abstract":"Existing deepfake detection methods fail to generalize well to unseen or degraded samples, which can be attributed to the over-fitting of low-level forgery patterns. Here we argue that high-level semantics are also indispensable recipes for generalizable forgery detection. Recently, large pre-trained Vision Transformers (ViTs) have shown promising generalization capability. In this paper, we propose the first parameter-efficient tuning approach for deepfake detection, namely DeepFake-Adapter, to effectively and efficiently adapt the generalizable high-level semantics from large pre-trained ViTs to aid deepfake detection. Given large pre-trained models but limited deepfake data, DeepFake-Adapter introduces lightweight yet dedicated dual-level adapter modules to a ViT while keeping the model backbone frozen. Specifically, to guide the adaptation process to be aware of both global and local forgery cues of deepfake data, 1) we not only insert Globally-aware Bottleneck Adapters in parallel to MLP layers of ViT, 2) but also actively cross-attend Locally-aware Spatial Adapters with features from ViT. Unlike existing deepfake detection methods merely focusing on low-level forgery patterns, the forgery detection process of our model can be regularized by generalizable high-level semantics from a pre-trained ViT and adapted by global and local low-level forgeries of deepfake data. Extensive experiments on several standard deepfake detection benchmarks validate the effectiveness of our approach. Notably, DeepFake-Adapter demonstrates a convincing advantage under cross-dataset and cross-manipulation settings.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"49 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143026409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Mutual Supervision Framework for Referring Expression Segmentation and Generation 引用表达式分割与生成的相互监督框架

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-23 DOI: 10.1007/s11263-024-02325-y

Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Liwei Wang

Reference Expression Segmentation (RES) and Reference Expression Generation (REG) are mutually inverse tasks that can be naturally jointly trained. Though recent work has explored such joint training, the mechanism of how RES and REG can benefit each other is still unclear. In this paper, we propose a mutual supervision framework that enables two tasks to improve each other. Our mutual supervision contains two directions. On the one hand, Disambiguation Supervision leverages the expression unambiguity measurement provided by RES to enhance the language generation of REG. On the other hand, Generation Supervision uses expressions automatically generated by REG to scale up the training of RES. Such mutual supervision effectively improves two tasks by solving their bottleneck problems. Extensive experiments show that our approach significantly outperforms all existing methods on REG and RES tasks under the same setting, and detailed ablation studies demonstrate the effectiveness of all components in our framework.

参考表达式分割（RES）和参考表达式生成（REG）是相互对立的任务，可以自然地联合训练。虽然最近的工作已经探索了这种联合训练，但RES和REG如何相互受益的机制仍不清楚。在本文中，我们提出了一个相互监督框架，使两个任务能够相互改进。我们的相互监督有两个方向。消歧监督一方面利用正则表达式提供的表达无歧义度量来增强正则表达式的语言生成。另一方面，生成监督使用REG自动生成的表达式来扩大res的训练规模，这种相互监督通过解决瓶颈问题有效地提高了两项任务。大量实验表明，在相同设置下，我们的方法在REG和RES任务上明显优于所有现有方法，详细的消融研究证明了我们框架中所有组件的有效性。

引用次数: 0

GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection GL-MCM：零弹偏离检测的全局和局部极大值概念匹配

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-22 DOI: 10.1007/s11263-025-02356-z

Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa

Zero-shot OOD detection is a task that detects OOD images during inference with only in-distribution (ID) class names. Existing methods assume ID images contain a single, centered object, and do not consider the more realistic multi-object scenarios, where both ID and OOD objects are present. To meet the needs of many users, the detection method must have the flexibility to adapt the type of ID images. To this end, we present Global-Local Maximum Concept Matching (GL-MCM), which incorporates local image scores as an auxiliary score to enhance the separability of global and local visual features. Due to the simple ensemble score function design, GL-MCM can control the type of ID images with a single weight parameter. Experiments on ImageNet and multi-object benchmarks demonstrate that GL-MCM outperforms baseline zero-shot methods and is comparable to fully supervised methods. Furthermore, GL-MCM offers strong flexibility in adjusting the target type of ID images. The code is available via https://github.com/AtsuMiyai/GL-MCM.

零镜头 OOD 检测是一项在推理过程中检测只有分布（ID）类名的 OOD 图像的任务。现有方法假定 ID 图像包含单个居中对象，而不考虑更现实的多对象情况，即同时存在 ID 和 OOD 对象。为了满足众多用户的需求，检测方法必须具有适应 ID 图像类型的灵活性。为此，我们提出了全局-局部最大概念匹配法（GL-MCM），它将局部图像得分作为辅助得分，以增强全局和局部视觉特征的可分离性。由于采用了简单的集合得分函数设计，GL-MCM 只需一个权重参数就能控制 ID 图像的类型。在 ImageNet 和多对象基准上进行的实验表明，GL-MCM 的性能优于基线零拍摄方法，并可与完全监督方法相媲美。此外，GL-MCM 在调整 ID 图像的目标类型方面具有很强的灵活性。代码可通过 https://github.com/AtsuMiyai/GL-MCM 获取。

引用次数: 0

Pre-trained Trojan Attacks for Visual Recognition 针对视觉识别的预训练木马攻击

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-22 DOI: 10.1007/s11263-024-02333-y

Aishan Liu, Xianglong Liu, Xinwei Zhang, Yisong Xiao, Yuguang Zhou, Siyuan Liang, Jiakai Wang, Xiaochun Cao, Dacheng Tao

Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation. In this paper, we propose the Pre-trained Trojan attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks. We highlight the challenges posed by cross-task activation and shortcut connections in successful backdoor attacks. To achieve effective trigger activation in diverse tasks, we stylize the backdoor trigger patterns with class-specific textures, enhancing the recognition of task-irrelevant low-level features associated with the target class in the trigger pattern. Moreover, we address the issue of shortcut connections by introducing a context-free learning pipeline for poison training. In this approach, triggers without contextual backgrounds are directly utilized as training data, diverging from the conventional use of clean images. Consequently, we establish a direct shortcut from the trigger to the target class, mitigating the shortcut connection issue. We conducted extensive experiments to thoroughly validate the effectiveness of our attacks on downstream detection and segmentation tasks. Additionally, we showcase the potential of our approach in more practical scenarios, including large vision models and 3D object detection in autonomous driving. This paper aims to raise awareness of the potential threats associated with applying PVMs in practical scenarios. Our codes are available at https://github.com/Veee9/Pre-trained-Trojan.

预训练视觉模型（pvm）由于其在对下游任务进行微调时的优异性能而成为主导组件。然而，pvm中后门的存在构成了严重的威胁。不幸的是，现有的研究主要集中在分类任务的后门pvm上，而忽略了下游任务（如检测和分割）中潜在的继承后门。在本文中，我们提出了一种预训练的木马攻击，它将后门嵌入到PVM中，从而可以跨各种下游视觉任务进行攻击。我们强调了跨任务激活和快捷连接在成功的后门攻击中所带来的挑战。为了在不同的任务中实现有效的触发器激活，我们使用特定于类的纹理对后门触发器模式进行了风格化，增强了对触发器模式中与目标类相关的任务无关的低级特征的识别。此外，我们通过引入上下文无关的毒物训练学习管道来解决快捷连接的问题。在这种方法中，直接使用没有上下文背景的触发器作为训练数据，与传统的干净图像使用不同。因此，我们建立了从触发器到目标类的直接快捷方式，从而减轻了快捷连接问题。我们进行了大量的实验，以彻底验证我们的攻击对下游检测和分割任务的有效性。此外，我们还展示了我们的方法在更实际的场景中的潜力，包括自动驾驶中的大视觉模型和3D物体检测。本文旨在提高人们对在实际场景中应用pvm相关的潜在威胁的认识。我们的代码可在https://github.com/Veee9/Pre-trained-Trojan上获得。

{"title":"Pre-trained Trojan Attacks for Visual Recognition","authors":"Aishan Liu, Xianglong Liu, Xinwei Zhang, Yisong Xiao, Yuguang Zhou, Siyuan Liang, Jiakai Wang, Xiaochun Cao, Dacheng Tao","doi":"10.1007/s11263-024-02333-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02333-y","url":null,"abstract":"Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation. In this paper, we propose the Pre-trained Trojan attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks. We highlight the challenges posed by cross-task activation and shortcut connections in successful backdoor attacks. To achieve effective trigger activation in diverse tasks, we stylize the backdoor trigger patterns with class-specific textures, enhancing the recognition of task-irrelevant low-level features associated with the target class in the trigger pattern. Moreover, we address the issue of shortcut connections by introducing a context-free learning pipeline for poison training. In this approach, triggers without contextual backgrounds are directly utilized as training data, diverging from the conventional use of clean images. Consequently, we establish a direct shortcut from the trigger to the target class, mitigating the shortcut connection issue. We conducted extensive experiments to thoroughly validate the effectiveness of our attacks on downstream detection and segmentation tasks. Additionally, we showcase the potential of our approach in more practical scenarios, including large vision models and 3D object detection in autonomous driving. This paper aims to raise awareness of the potential threats associated with applying PVMs in practical scenarios. Our codes are available at https://github.com/Veee9/Pre-trained-Trojan.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking Generalizability and Discriminability of Self-Supervised Learning from Evolutionary Game Theory Perspective 从进化博弈论角度重新思考自监督学习的泛化性和辨别性

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-19 DOI: 10.1007/s11263-024-02321-2

Jiangmeng Li, Zehua Zang, Qirui Ji, Chuxiong Sun, Wenwen Qiang, Junge Zhang, Changwen Zheng, Fuchun Sun, Hui Xiong

Representations learned by self-supervised approaches are generally considered to possess sufficient generalizability and discriminability. However, we disclose a nontrivial mutual-exclusion relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability or discriminability but not both simultaneously. Thus, learning representations jointly possessing strong generalizability and discriminability presents a specific challenge for self-supervised learning. To this end, we revisit the learning paradigm of self-supervised learning from the perspective of evolutionary game theory (EGT) and outline the theoretical roadmap to achieve a desired trade-off between these representation properties. EGT performs well in analyzing the trade-off point in a two-player game by utilizing dynamic system modeling. However, the EGT analysis requires sufficient annotated data, which contradicts the principle of self-supervised learning, i.e., the EGT analysis cannot be conducted without the annotations of the specific target domain for self-supervised learning. Thus, to enhance the methodological generalization, we propose a novel self-supervised learning method that leverages advancements in reinforcement learning to jointly benefit from the general guidance of EGT and sequentially optimize the model to chase the consistent improvement of generalizability and discriminability for specific target domains during pre-training. On top of this, we provide a benchmark to evaluate the generalizability and discriminability of learned representations comprehensively. Theoretically, we establish that the proposed method tightens the generalization error upper bound of self-supervised learning. Empirically, our method achieves state-of-the-art performance on various benchmarks. Our implementation is available at https://github.com/ZangZehua/essl.

通过自监督方法学习的表征通常被认为具有足够的泛化性和可辨别性。然而，我们通过对自监督学习的探索性论证，揭示了这些关键表征属性之间的非平凡互斥关系。最先进的自我监督方法倾向于增强概括性或可辨别性，但不能同时增强两者。因此，同时具有强泛化性和判别性的学习表征对自监督学习提出了特殊的挑战。为此，我们从进化博弈论（EGT）的角度重新审视了自监督学习的学习范式，并概述了在这些表征属性之间实现所需权衡的理论路线图。利用动态系统建模，EGT在分析二人博弈中的权衡点方面表现良好。然而，EGT分析需要足够的标注数据，这与自监督学习的原则相矛盾，即如果没有对自监督学习的特定目标域进行标注，EGT分析就无法进行。因此，为了增强方法的泛化，我们提出了一种新的自监督学习方法，该方法利用强化学习的进步，共同受益于EGT的一般指导，并在预训练过程中对模型进行顺序优化，以追求特定目标域的泛化性和可辨别性的一致提高。在此基础上，我们提供了一个综合评价学习表征的泛化性和可判别性的基准。理论上，我们证明了该方法收紧了自监督学习的泛化误差上界。根据经验，我们的方法在各种基准上达到了最先进的性能。我们的实现可以在https://github.com/ZangZehua/essl上获得。

{"title":"Rethinking Generalizability and Discriminability of Self-Supervised Learning from Evolutionary Game Theory Perspective","authors":"Jiangmeng Li, Zehua Zang, Qirui Ji, Chuxiong Sun, Wenwen Qiang, Junge Zhang, Changwen Zheng, Fuchun Sun, Hui Xiong","doi":"10.1007/s11263-024-02321-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02321-2","url":null,"abstract":"Representations learned by self-supervised approaches are generally considered to possess sufficient generalizability and discriminability. However, we disclose a nontrivial mutual-exclusion relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability or discriminability but not both simultaneously. Thus, learning representations jointly possessing strong generalizability and discriminability presents a specific challenge for self-supervised learning. To this end, we revisit the learning paradigm of self-supervised learning from the perspective of evolutionary game theory (EGT) and outline the theoretical roadmap to achieve a desired trade-off between these representation properties. EGT performs well in analyzing the trade-off point in a two-player game by utilizing dynamic system modeling. However, the EGT analysis requires sufficient annotated data, which contradicts the principle of self-supervised learning, i.e., the EGT analysis cannot be conducted without the annotations of the specific target domain for self-supervised learning. Thus, to enhance the methodological generalization, we propose a novel self-supervised learning method that leverages advancements in reinforcement learning to jointly benefit from the general guidance of EGT and sequentially optimize the model to chase the consistent improvement of generalizability and discriminability for specific target domains during pre-training. On top of this, we provide a benchmark to evaluate the generalizability and discriminability of learned representations comprehensively. Theoretically, we establish that the proposed method tightens the generalization error upper bound of self-supervised learning. Empirically, our method achieves state-of-the-art performance on various benchmarks. Our implementation is available at https://github.com/ZangZehua/essl.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142989088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation 基于跨模态蒸馏的城市场景无监督语义分割

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-15 DOI: 10.1007/s11263-024-02320-3

Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the need for manual labeling. Our contributions are as follows. First, we develop a novel approach for cross-modal unsupervised learning of semantic segmentation by leveraging synchronized LiDAR and image data. A crucial element of our method is the integration of an object proposal module that examines the LiDAR point cloud to generate proposals for spatially consistent objects. Second, we demonstrate that these 3D object proposals can be aligned with corresponding images and effectively grouped into semantically meaningful pseudo-classes. Third, we introduce a cross-modal distillation technique that utilizes image data partially annotated with the learnt pseudo-classes to train a transformer-based model for semantic image segmentation. Fourth, we demonstrate further significant improvements of our approach by extending the proposed model using a teacher-student distillation with an exponential moving average and incorporating soft targets from the teacher. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. We present an in-depth experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR’s density, supervised finetuning as well as additional qualitative results and their analysis.

语义图像分割模型通常需要大量的像素注释，而获取这些注释的成本很高，而且容易产生偏差。我们的工作研究了在没有任何人工标注的情况下学习城市场景中的语义分割。我们提出了一种新方法，利用来自车载相机和激光雷达传感器的未经整理的原始数据来学习像素语义分割，从而消除了人工标注的需要。我们的贡献如下。首先，我们开发了一种利用同步激光雷达和图像数据进行跨模态无监督语义分割学习的新方法。我们方法的一个关键要素是集成了一个对象建议模块，该模块可检查激光雷达点云，生成空间一致对象的建议。其次，我们证明了这些三维物体建议可以与相应的图像对齐，并有效地组合成具有语义意义的伪类。第三，我们引入了一种跨模态提炼技术，利用部分标注了所学伪类的图像数据来训练基于变换器的语义图像分割模型。第四，我们使用指数移动平均法扩展了所提出的师生蒸馏模型，并纳入了教师的软目标，从而证明了我们的方法有了进一步的显著改进。我们通过在四个不同的测试数据集（城市景观、黑暗苏黎世、夜间驾驶和 ACDC）上进行测试，展示了我们方法的泛化能力，而无需进行任何微调。我们对所提出的模型进行了深入的实验分析，包括使用另一个预训练数据集时的结果、每类和像素的准确度结果、混淆矩阵、PCA 可视化、k-NN 评估、簇数和激光雷达密度的消减、监督微调以及其他定性结果及其分析。

{"title":"Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation","authors":"Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic","doi":"10.1007/s11263-024-02320-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02320-3","url":null,"abstract":"Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the need for manual labeling. Our contributions are as follows. First, we develop a novel approach for cross-modal unsupervised learning of semantic segmentation by leveraging synchronized LiDAR and image data. A crucial element of our method is the integration of an object proposal module that examines the LiDAR point cloud to generate proposals for spatially consistent objects. Second, we demonstrate that these 3D object proposals can be aligned with corresponding images and effectively grouped into semantically meaningful pseudo-classes. Third, we introduce a cross-modal distillation technique that utilizes image data partially annotated with the learnt pseudo-classes to train a transformer-based model for semantic image segmentation. Fourth, we demonstrate further significant improvements of our approach by extending the proposed model using a teacher-student distillation with an exponential moving average and incorporating soft targets from the teacher. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. We present an in-depth experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR’s density, supervised finetuning as well as additional qualitative results and their analysis.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"42 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142986393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation UniCanvas：通过定制的文本到图像生成功能实现可感知的统一真实图像编辑

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-14 DOI: 10.1007/s11263-024-02334-x

Jian Jin, Yang Shen, Xinyang Zhao, Zhenyong Fu, Jian Yang

The demand for assorted conditional edits on a single real image is becoming increasingly prevalent. We focus on two dominant editing tasks that respectively condition on image and text input, namely subject-driven editing and semantic editing. Previous studies typically tackle these two editing tasks separately, thereby demanding multiple editing processes to achieve versatile edits on a single image. However, fragmented and sequential editing processes not only require more user effort but also further degrade the editing quality. In this paper, we propose UniCanvas, an affordance-aware unified framework that can achieve high-quality parallel subject-driven and semantic editing on a single real image within one inference process. UniCanvas innovatively unifies the multimodal inputs of the editing task into the textual condition space using tailored customization strategies. Building upon the unified representations, we propose a novel inference pipeline that performs parallel editing by selectively blending and manipulating two collaborative text-to-image generative branches. Customization enables the editing process to harness the strong visual understanding and reasoning capability of pre-trained generative models for affordance perception, and a unified inference space further facilitates more effective affordance interaction and alignment for compelling editing. Extensive experiments on diverse real images demonstrate that UniCanvas exhibits powerful scene affordance perception in unified image editing, achieving seamless subject-driven editing and precise semantic editing for various target subjects and query prompts (https://jinjianrick.github.io/unicanvas/).

对单个真实图像进行各种条件编辑的需求正变得越来越普遍。我们重点研究了两种主要的编辑任务，分别以图像和文本输入为条件，即主题驱动编辑和语义编辑。以前的研究通常分别处理这两个编辑任务，因此需要多个编辑过程才能在单个图像上实现多功能编辑。然而，碎片化和顺序化的编辑过程不仅需要用户付出更多的努力，还会进一步降低编辑质量。在本文中，我们提出了UniCanvas，这是一个可感知的统一框架，可以在一个推理过程中在单个真实图像上实现高质量的并行主题驱动和语义编辑。UniCanvas创新性地使用定制策略将编辑任务的多模态输入统一到文本条件空间中。在统一表示的基础上，我们提出了一种新的推理管道，通过选择性地混合和操纵两个协作的文本到图像生成分支来执行并行编辑。定制使编辑过程能够利用预先训练的生成模型的强大的视觉理解和推理能力来进行可视性感知，统一的推理空间进一步促进更有效的可视性交互和对齐，以进行引人注目的编辑。在多种真实图像上的大量实验表明，UniCanvas在统一图像编辑中表现出强大的场景感知能力，实现了对各种目标主题和查询提示的无缝主题驱动编辑和精确语义编辑（https://jinjianrick.github.io/unicanvas/）。

{"title":"UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation","authors":"Jian Jin, Yang Shen, Xinyang Zhao, Zhenyong Fu, Jian Yang","doi":"10.1007/s11263-024-02334-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02334-x","url":null,"abstract":"The demand for assorted conditional edits on a single real image is becoming increasingly prevalent. We focus on two dominant editing tasks that respectively condition on image and text input, namely subject-driven editing and semantic editing. Previous studies typically tackle these two editing tasks separately, thereby demanding multiple editing processes to achieve versatile edits on a single image. However, fragmented and sequential editing processes not only require more user effort but also further degrade the editing quality. In this paper, we propose UniCanvas, an affordance-aware unified framework that can achieve high-quality parallel subject-driven and semantic editing on a single real image within one inference process. UniCanvas innovatively unifies the multimodal inputs of the editing task into the textual condition space using tailored customization strategies. Building upon the unified representations, we propose a novel inference pipeline that performs parallel editing by selectively blending and manipulating two collaborative text-to-image generative branches. Customization enables the editing process to harness the strong visual understanding and reasoning capability of pre-trained generative models for affordance perception, and a unified inference space further facilitates more effective affordance interaction and alignment for compelling editing. Extensive experiments on diverse real images demonstrate that UniCanvas exhibits powerful scene affordance perception in unified image editing, achieving seamless subject-driven editing and precise semantic editing for various target subjects and query prompts (https://jinjianrick.github.io/unicanvas/).","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"59 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generalized Robotic Vision-Language Learning Model via Linguistic Foreground-Aware Contrast 基于语言前景感知对比的广义机器人视觉语言学习模型

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-14 DOI: 10.1007/s11263-024-02340-z

Kangcheng Liu, Chaoqun Wang, Xiaodong Han, Yong-Jin Liu, Baoquan Chen

Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Our proposed approach enhances both the local coherence as well as the overall feature discrimination. Moreover, we have designed the linguistic foreground-aware regional point sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC++ achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: (https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast).

对比学习最近在3D场景理解任务的无监督预训练中显示出巨大的潜力。然而，大多数现有的工作在建立对比时随机选择点特征作为锚点，导致对背景点的明显偏见，而背景点通常在3D场景中占主导地位。此外，对象意识和前景到背景的区分被忽视，使对比学习不太有效。为了解决这些问题，我们提出了一个通用的前景感知特征对比FAC++框架，以在预训练中学习更有效的点云表示。FAC++包括两种新的对比设计，以构建更有效和信息丰富的对比对。第一种是在相同前景段内构建正对，其中点往往具有相同的语义。其次，我们防止了3D片段/对象之间的过度区分，并通过暹罗通信网络中的自适应特征学习，在片段级别鼓励分组前景到背景的区分，该网络可以有效地自适应学习点云视图内部和跨点云视图的特征相关性。我们提出的方法既增强了局部一致性，又增强了整体特征识别。此外，我们还设计了语言前景感知区域点采样，以增强更平衡的前景感知学习，称为fac++。使用点激活图进行可视化显示，我们的对比对在预训练期间捕获了前景区域之间清晰的对应关系。定量实验也表明FAC++在各种下游3D语义分割、实例分割和目标检测任务中都实现了优越的知识转移和数据效率。所有代码、数据和模型可在：（https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast）获得。

{"title":"Generalized Robotic Vision-Language Learning Model via Linguistic Foreground-Aware Contrast","authors":"Kangcheng Liu, Chaoqun Wang, Xiaodong Han, Yong-Jin Liu, Baoquan Chen","doi":"10.1007/s11263-024-02340-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02340-z","url":null,"abstract":"Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Our proposed approach enhances both the local coherence as well as the overall feature discrimination. Moreover, we have designed the linguistic foreground-aware regional point sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC++ achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: (https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast).","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos and Beyond 刚性：循环GAN反演和编辑真实人脸视频和超越

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-13 DOI: 10.1007/s11263-024-02329-8

Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo

GAN inversion is essential for harnessing the editability of GANs in real images, yet existing methods that invert video frames individually often yield temporally inconsistent results. To address this issue, we present a unified recurrent framework, Recurrent vIdeo GAN Inversion and eDiting (RIGID), designed to enforce temporally coherent GAN inversion and facial editing in real videos explicitly and simultaneously. Our approach models temporal relations between current and previous frames in three ways: (1) by maximizing inversion fidelity and consistency through learning a temporally compensated latent code and spatial features, (2) by disentangling high-frequency incoherent noises from the latent space, and (3) by introducing an in-between frame composition constraint to eliminate inconsistency after attribute manipulation, ensuring that each frame is a direct composite of its neighbors. Compared to existing video- and attribute-specific works, RIGID eliminates the need for expensive re-training of the model, resulting in approximately 60(times ) faster performance. Furthermore, RIGID can be easily extended to other face domains, showcasing its versatility and adaptability. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods in inversion and editing tasks both qualitatively and quantitatively.

GAN反演对于在真实图像中利用GAN的可编辑性至关重要，然而现有的视频帧反演方法通常会产生时间上不一致的结果。为了解决这个问题，我们提出了一个统一的循环框架，循环视频GAN反转和编辑（RIGID），旨在明确地同时在真实视频中执行时间连贯的GAN反转和面部编辑。我们的方法通过三种方式对当前帧和之前帧之间的时间关系进行建模：(1)通过学习时间补偿的潜在代码和空间特征来最大化反演保真度和一致性，(2)从潜在空间中去除高频不相干噪声，以及(3)通过引入帧间组合约束来消除属性操作后的不一致性，确保每帧都是相邻帧的直接组合。与现有的视频和特定属性的工作相比，RIGID消除了对模型进行昂贵的重新训练的需要，从而产生大约60 (times )更快的性能。此外，刚性可以很容易地扩展到其他人脸域，显示其通用性和适应性。广泛的实验表明，刚性优于最先进的方法在反演和编辑任务定性和定量。

{"title":"RIGID: Recurrent GAN Inversion and Editing of Real Face Videos and Beyond","authors":"Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo","doi":"10.1007/s11263-024-02329-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02329-8","url":null,"abstract":"GAN inversion is essential for harnessing the editability of GANs in real images, yet existing methods that invert video frames individually often yield temporally inconsistent results. To address this issue, we present a unified recurrent framework, Recurrent vIdeo GAN Inversion and eDiting (RIGID), designed to enforce temporally coherent GAN inversion and facial editing in real videos explicitly and simultaneously. Our approach models temporal relations between current and previous frames in three ways: (1) by maximizing inversion fidelity and consistency through learning a temporally compensated latent code and spatial features, (2) by disentangling high-frequency incoherent noises from the latent space, and (3) by introducing an in-between frame composition constraint to eliminate inconsistency after attribute manipulation, ensuring that each frame is a direct composite of its neighbors. Compared to existing video- and attribute-specific works, RIGID eliminates the need for expensive re-training of the model, resulting in approximately 60(times ) faster performance. Furthermore, RIGID can be easily extended to other face domains, showcasing its versatility and adaptability. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods in inversion and editing tasks both qualitatively and quantitatively.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"121 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142967874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0