International Journal of Computer Vision最新文献_第9页

InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation 最大化信息传播的局部监督深度学习

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-11 DOI: 10.1007/s11263-024-02296-0

Yulin Wang, Zanlin Ni, Yifan Pu, Cai Zhou, Jixuan Ying, Shiji Song, Gao Huang

End-to-end (E2E) training has become the de-facto standard for training modern deep networks, e.g., ConvNets and vision Transformers (ViTs). Typically, a global error signal is generated at the end of a model and back-propagated layer-by-layer to update the parameters. This paper shows that the reliance on back-propagating global errors may not be necessary for deep learning. More precisely, deep networks with a competitive or even better performance can be obtained by purely leveraging locally supervised learning, i.e., splitting a network into gradient-isolated modules and training them with local supervision signals. However, such an extension is non-trivial. Our experimental and theoretical analysis demonstrates that simply training local modules with an E2E objective tends to be short-sighted, collapsing task-relevant information at early layers, and hurting the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discarding task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. We evaluate InfoPro extensively with ConvNets and ViTs, based on twelve computer vision benchmarks organized into five tasks (i.e., image/video recognition, semantic/instance segmentation, and object detection). InfoPro exhibits superior efficiency over E2E training in terms of GPU memory footprints, convergence speed, and training data scale. Moreover, InfoPro enables the effective training of more parameter- and computation-efficient models (e.g., much deeper networks), which suffer from inferior performance when trained in E2E. Code: https://github.com/blackfeather-wang/InfoPro-Pytorch.

端到端（E2E）训练已经成为训练现代深度网络的事实上的标准，例如ConvNets和视觉变压器（vit）。通常，在模型结束时生成一个全局错误信号，并逐层反向传播以更新参数。本文表明，对反向传播的全局误差的依赖可能不是深度学习所必需的。更准确地说，可以通过纯粹利用局部监督学习来获得具有竞争力甚至更好性能的深度网络，即将网络分成梯度隔离的模块并用局部监督信号进行训练。然而，这样的扩展是不平凡的。我们的实验和理论分析表明，简单地用E2E目标训练局部模块往往是短视的，会在早期层崩溃任务相关信息，并损害整个模型的性能。为了避免这个问题，我们提出了信息传播（InfoPro）损失，它鼓励本地模块保留尽可能多的有用信息，同时逐步丢弃与任务无关的信息。由于InfoPro损失在原始形式下难以计算，我们推导了可行的上界作为替代优化目标，得到了一个简单而有效的算法。我们使用ConvNets和ViTs对InfoPro进行了广泛的评估，基于12个计算机视觉基准，分为5个任务（即图像/视频识别、语义/实例分割和目标检测）。InfoPro在GPU内存占用、收敛速度和训练数据规模方面表现出优于E2E训练的效率。此外，InfoPro能够有效地训练更多参数和计算效率高的模型（例如，更深层的网络），这些模型在E2E中训练时性能较差。代码:https://github.com/blackfeather-wang/InfoPro-Pytorch。

{"title":"InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation","authors":"Yulin Wang, Zanlin Ni, Yifan Pu, Cai Zhou, Jixuan Ying, Shiji Song, Gao Huang","doi":"10.1007/s11263-024-02296-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02296-0","url":null,"abstract":"End-to-end (E2E) training has become the de-facto standard for training modern deep networks, e.g., ConvNets and vision Transformers (ViTs). Typically, a global error signal is generated at the end of a model and back-propagated layer-by-layer to update the parameters. This paper shows that the reliance on back-propagating global errors may not be necessary for deep learning. More precisely, deep networks with a competitive or even better performance can be obtained by purely leveraging locally supervised learning, i.e., splitting a network into gradient-isolated modules and training them with local supervision signals. However, such an extension is non-trivial. Our experimental and theoretical analysis demonstrates that simply training local modules with an E2E objective tends to be short-sighted, collapsing task-relevant information at early layers, and hurting the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discarding task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. We evaluate InfoPro extensively with ConvNets and ViTs, based on twelve computer vision benchmarks organized into five tasks (i.e., image/video recognition, semantic/instance segmentation, and object detection). InfoPro exhibits superior efficiency over E2E training in terms of GPU memory footprints, convergence speed, and training data scale. Moreover, InfoPro enables the effective training of more parameter- and computation-efficient models (e.g., much deeper networks), which suffer from inferior performance when trained in E2E. Code: https://github.com/blackfeather-wang/InfoPro-Pytorch.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"113 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142805404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection CMAE-3D：用于自监督3D对象检测的对比蒙面自动编码器

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-11 DOI: 10.1007/s11263-024-02313-2

Yanan Zhang, Jiaxin Chen, Di Huang

LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).

基于激光雷达的三维目标检测是自动驾驶的关键任务，因为它在三维现实空间中具有准确的目标识别和定位能力。然而，现有的方法严重依赖于耗时费力的大规模标记激光雷达数据，这对性能改进和实际应用都构成了瓶颈。在本文中，我们提出了用于自监督3D物体检测的对比蒙面自动编码器，称为CMAE-3D，这是一种很有前途的解决方案，可以有效地减轻3D感知中的标签依赖。具体而言，我们将对比学习（CL）和掩码自动编码器（MAE）整合到一个统一的框架中，以充分利用全局语义表征和局部空间感知的互补特性。此外，从MAE的角度，我们开发了几何-语义混合掩蔽（GSHM）来选择性地掩盖前背景不平衡和密度分布不均匀的点云中的代表性区域，并设计了多尺度潜在特征重建（MLFR）来捕获高级语义特征，同时减少低级细节的冗余重建。从层次关系对比学习的角度出发，提出层次关系对比学习（HRCL）来挖掘丰富的语义相似信息，同时从体素级和帧级两个层面缓解负样本不匹配的问题。大量的实验证明了我们的预训练方法在三个流行的数据集（KITTI， Waymo和nuScenes）上应用于多个主流3D物体检测器（SECOND， CenterPoint和PV-RCNN）时的有效性。

{"title":"CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection","authors":"Yanan Zhang, Jiaxin Chen, Di Huang","doi":"10.1007/s11263-024-02313-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02313-2","url":null,"abstract":"LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"12 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Language-Guided Hierarchical Fine-Grained Image Forgery Detection and Localization 语言引导的分层细粒度图像伪造检测与定位

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-10 DOI: 10.1007/s11263-024-02255-9

Xiao Guo, Xiaohong Liu, Iacopo Masi, Xiaoming Liu

Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then, we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and the inherent hierarchical nature of different forgery attributes, thereby improving the IFDL representation. In this work, we propose a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++. Specifically, HiFi-Net++ contains four components: multi-branch feature extractor, language-guided forgery localization enhancer, as well as classification and localization modules. Each branch of the multi-branch feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment the pixel-level forgery region and detect image-level forgery, respectively. In addition, the language-guided forgery localization enhancer (LFLE), containing image and text encoders learned by contrastive language-image pre-training (CLIP), is used to further enrich the IFDL representation. LFLE takes specifically designed texts and the given image as multi-modal inputs and then generates the visual embedding and manipulation score maps, which are used to further improve HiFi-Net++ manipulation localization performance. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on 8 different benchmarks for both tasks of IFDL and forgery attribute classification. Our source code and dataset can be found: github.com/CHELSEA234/HiFi-IFDL.

cnn合成域和图像编辑域生成的图像伪造属性存在较大差异，这给统一的图像伪造检测与定位（IFDL）带来了挑战。为此，我们提出了一个分层细粒度的IFDL表示学习公式。具体来说，我们首先用不同级别的多个标签表示被操纵图像的伪造属性。然后，我们使用它们之间的层次依赖关系在这些级别上执行细粒度分类。因此，鼓励算法学习不同伪造属性的综合特征和固有的层次性质，从而改进IFDL表示。在这项工作中，我们提出了一个语言引导的分层细粒度IFDL，表示为HiFi-Net++。具体来说，HiFi-Net++包含四个组件：多分支特征提取器、语言引导伪造定位增强器以及分类和定位模块。多分支特征提取器的每个分支学习对一个级别的伪造属性进行分类，定位模块和分类模块分别对像素级伪造区域进行分割，对图像级伪造进行检测。此外，使用语言引导的伪造定位增强器（LFLE），其中包含通过对比语言图像预训练（CLIP）学习的图像和文本编码器，进一步丰富了IFDL表示。LFLE将特定设计的文本和给定图像作为多模态输入，生成可视化嵌入和操作评分图，用于进一步提高HiFi-Net++操作定位性能。最后，我们构建了一个分层的细粒度数据集，以方便我们的研究。我们在8个不同的基准上对IFDL和伪造属性分类任务进行了验证。我们的源代码和数据集可以找到：github.com/CHELSEA234/HiFi-IFDL。

{"title":"Language-Guided Hierarchical Fine-Grained Image Forgery Detection and Localization","authors":"Xiao Guo, Xiaohong Liu, Iacopo Masi, Xiaoming Liu","doi":"10.1007/s11263-024-02255-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02255-9","url":null,"abstract":"Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then, we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and the inherent hierarchical nature of different forgery attributes, thereby improving the IFDL representation. In this work, we propose a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++. Specifically, HiFi-Net++ contains four components: multi-branch feature extractor, language-guided forgery localization enhancer, as well as classification and localization modules. Each branch of the multi-branch feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment the pixel-level forgery region and detect image-level forgery, respectively. In addition, the language-guided forgery localization enhancer (LFLE), containing image and text encoders learned by contrastive language-image pre-training (CLIP), is used to further enrich the IFDL representation. LFLE takes specifically designed texts and the given image as multi-modal inputs and then generates the visual embedding and manipulation score maps, which are used to further improve HiFi-Net++ manipulation localization performance. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on 8 different benchmarks for both tasks of IFDL and forgery attribute classification. Our source code and dataset can be found: github.com/CHELSEA234/HiFi-IFDL.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142805402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Mitigating Stability-Plasticity Dilemma in CLIP-guided Image Morphing via Geodesic Distillation Loss 通过大地蒸馏损失缓解 CLIP 引导图像变形中的稳定性-弹性困境

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-10 DOI: 10.1007/s11263-024-02308-z

Yeongtak Oh, Saehyung Lee, Uiwon Hwang, Sungroh Yoon

Large-scale language-vision pre-training models, such as CLIP, have achieved remarkable results in text-guided image morphing by leveraging several unconditional generative models. However, existing CLIP-guided methods face challenges in achieving photorealistic morphing when adapting the generator from the source to the target domain. Specifically, current guidance methods fail to provide detailed explanations of the morphing regions within the image, leading to misguidance and catastrophic forgetting of the original image’s fidelity. In this paper, we propose a novel approach considering proper regularization losses to overcome these difficulties by addressing the SP dilemma in CLIP guidance. Our approach consists of two key components: (1) a geodesic cosine similarity loss that minimizes inter-modality features (i.e., image and text) in a projected subspace of CLIP space, and (2) a latent regularization loss that minimizes intra-modality features (i.e., image and image) on the image manifold. By replacing the naive directional CLIP loss in a drop-in replacement manner, our method achieves superior morphing results for both images and videos across various benchmarks, including CLIP-inversion.

大规模语言视觉预训练模型，如CLIP，利用几个无条件生成模型在文本引导图像变形方面取得了显着的效果。然而，现有的clip引导方法在将生成器从源域调整到目标域时，在实现逼真变形方面面临挑战。具体来说，目前的制导方法不能提供图像内变形区域的详细解释，导致误导和灾难性地忘记原始图像的保真度。在本文中，我们提出了一种考虑适当正则化损失的新方法，通过解决CLIP制导中的SP困境来克服这些困难。我们的方法由两个关键组成部分组成：(1)在CLIP空间的投影子空间中最小化模态间特征（即图像和文本）的测地余弦相似性损失，以及(2)最小化图像流形上的模态内特征（即图像和图像）的潜在正则化损失。通过以插入式替换的方式替换原始的定向CLIP损失，我们的方法在各种基准测试中（包括CLIP反转）对图像和视频都获得了出色的变形结果。

{"title":"On Mitigating Stability-Plasticity Dilemma in CLIP-guided Image Morphing via Geodesic Distillation Loss","authors":"Yeongtak Oh, Saehyung Lee, Uiwon Hwang, Sungroh Yoon","doi":"10.1007/s11263-024-02308-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02308-z","url":null,"abstract":"Large-scale language-vision pre-training models, such as CLIP, have achieved remarkable results in text-guided image morphing by leveraging several unconditional generative models. However, existing CLIP-guided methods face challenges in achieving photorealistic morphing when adapting the generator from the source to the target domain. Specifically, current guidance methods fail to provide detailed explanations of the morphing regions within the image, leading to misguidance and catastrophic forgetting of the original image’s fidelity. In this paper, we propose a novel approach considering proper regularization losses to overcome these difficulties by addressing the SP dilemma in CLIP guidance. Our approach consists of two key components: (1) a geodesic cosine similarity loss that minimizes inter-modality features (i.e., image and text) in a projected subspace of CLIP space, and (2) a latent regularization loss that minimizes intra-modality features (i.e., image and image) on the image manifold. By replacing the naive directional CLIP loss in a drop-in replacement manner, our method achieves superior morphing results for both images and videos across various benchmarks, including CLIP-inversion.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"10 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Image-Based Virtual Try-On: A Survey 基于图像的虚拟试穿：一项调查

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-10 DOI: 10.1007/s11263-024-02305-2

Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, Mohan Kankanhalli, An-An Liu

Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to accelerate the development. In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We additionally apply CLIP to assess the semantic alignment of try-on results, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset. In addition to quantitative and qualitative evaluation of current open-source methods, unresolved issues are highlighted and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.

基于图像的虚拟试戴旨在将穿着自然的人的形象与服装形象合成在一起，这是一场网络购物的革命，激发了图像生成领域的相关课题，具有研究意义和商业潜力。然而，目前的研究进展与商业应用之间存在差距，缺乏对该领域的全面概述来加速发展。在本调查中，我们从管道架构、人物表现和关键模块（如试衣指示、服装翘曲和试衣阶段）等方面全面分析了最新的技术和方法。我们还应用CLIP来评估试穿结果的语义一致性，并在同一数据集上使用统一实现的评估指标评估代表性方法。除了对现有的开源方法进行定量和定性评估外，还强调了尚未解决的问题，并展望了未来的研究方向，以确定关键趋势并激发进一步的探索。统一实施的评估指标、数据集和收集方法将在https://github.com/little-misfit/Survey-Of-Virtual-Try-On上公开。

{"title":"Image-Based Virtual Try-On: A Survey","authors":"Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, Mohan Kankanhalli, An-An Liu","doi":"10.1007/s11263-024-02305-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02305-2","url":null,"abstract":"Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to accelerate the development. In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We additionally apply CLIP to assess the semantic alignment of try-on results, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset. In addition to quantitative and qualitative evaluation of current open-source methods, unresolved issues are highlighted and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"89 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142805365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Evaluation of Zero-Cost Proxies - from Neural Architecture Performance Prediction to Model Robustness 零成本代理的评估——从神经结构性能预测到模型鲁棒性

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-09 DOI: 10.1007/s11263-024-02265-7

Jovita Lukasik, Michael Moeller, Margret Keuper

Zero-cost proxies are nowadays frequently studied and used to search for neural architectures. They show an impressive ability to predict the performance of architectures by making use of their untrained weights. These techniques allow for immense search speed-ups. So far the joint search for well performing and robust architectures has received much less attention in the field of NAS. Therefore, the main focus of zero-cost proxies is the clean accuracy of architectures, whereas the model robustness should play an evenly important part. In this paper, we analyze the ability of common zero-cost proxies to serve as performance predictors for robustness in the popular NAS-Bench-201 search space. We are interested in the single prediction task for robustness and the joint multi-objective of clean and robust accuracy. We further analyze the feature importance of the proxies and show that predicting the robustness makes the prediction task from existing zero-cost proxies more challenging. As a result, the joint consideration of several proxies becomes necessary to predict a model’s robustness while the clean accuracy can be regressed from a single such feature. Our code is available at https://github.com/jovitalukasik/zcp_eval.

目前，人们经常研究并使用零成本代理来搜索神经结构。它们通过使用未训练的权重来预测体系结构的性能，表现出令人印象深刻的能力。这些技术可以极大地提高搜索速度。到目前为止，在NAS领域中，对性能良好且健壮的体系结构的联合搜索很少受到关注。因此，零成本代理的主要焦点是体系结构的干净准确性，而模型的鲁棒性应该发挥同等重要的作用。在本文中，我们分析了常见的零成本代理在流行的NAS-Bench-201搜索空间中作为鲁棒性性能预测指标的能力。我们感兴趣的是单一预测任务的鲁棒性和联合多目标的干净和鲁棒精度。我们进一步分析了代理的特征重要性，并表明预测鲁棒性使现有零成本代理的预测任务更具挑战性。因此，需要联合考虑多个代理来预测模型的鲁棒性，而干净的精度可以从单个这样的特征回归。我们的代码可在https://github.com/jovitalukasik/zcp_eval上获得。

{"title":"An Evaluation of Zero-Cost Proxies - from Neural Architecture Performance Prediction to Model Robustness","authors":"Jovita Lukasik, Michael Moeller, Margret Keuper","doi":"10.1007/s11263-024-02265-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02265-7","url":null,"abstract":"Zero-cost proxies are nowadays frequently studied and used to search for neural architectures. They show an impressive ability to predict the performance of architectures by making use of their untrained weights. These techniques allow for immense search speed-ups. So far the joint search for well performing and robust architectures has received much less attention in the field of NAS. Therefore, the main focus of zero-cost proxies is the clean accuracy of architectures, whereas the model robustness should play an evenly important part. In this paper, we analyze the ability of common zero-cost proxies to serve as performance predictors for robustness in the popular NAS-Bench-201 search space. We are interested in the single prediction task for robustness and the joint multi-objective of clean and robust accuracy. We further analyze the feature importance of the proxies and show that predicting the robustness makes the prediction task from existing zero-cost proxies more challenging. As a result, the joint consideration of several proxies becomes necessary to predict a model’s robustness while the clean accuracy can be regressed from a single such feature. Our code is available at https://github.com/jovitalukasik/zcp_eval.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"47 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Occlusion-Preserved Surveillance Video Synopsis with Flexible Object Graph 基于柔性目标图的遮挡保留监控视频摘要

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-09 DOI: 10.1007/s11263-024-02302-5

Yongwei Nie, Wei Ge, Siming Zeng, Qing Zhang, Guiqing Li, Ping Li, Hongmin Cai

Video synopsis is a technique that condenses a long surveillance video to a short summary. It faces challenges to process objects originally occluding each other in the source video. Previous approaches either treat occlusion objects as a single object, which however reduce compression ratio; or have to separate occlusion objects individually, but destroy interactions between them and yield visual artifacts. This paper presents a novel data structure called Flexible Object Graph (FOG) to handle original occlusions. Our FOG-based video synopsis approach can manipulate each object flexibly while preserving the original occlusions between them, achieving high synopsis ratio while maintaining interactions of objects. A challenging issue that comes with the introduction of FOG is that FOG may contain circulations that yield conflicts. We solve this problem by proposing a circulation conflict resolving algorithm. Furthermore, video synopsis methods usually minimize a multi-objective energy function. Previous approaches optimize the multiple objectives simultaneously which needs to strike a balance between them. Instead, we propose a stepwise optimization strategy consuming less running time while producing higher quality. Experiments demonstrate the effectiveness of our method.

视频摘要是一种将长监控视频压缩成短摘要的技术。它面临着处理源视频中原本相互遮挡的对象的挑战。先前的方法要么将遮挡对象作为单个对象处理，但降低了压缩比；或者必须单独分离遮挡对象，但破坏它们之间的相互作用并产生视觉伪影。本文提出了一种新的数据结构，称为柔性目标图（FOG）来处理原始遮挡。我们的基于fog的视频摘要方法可以灵活地操纵每个目标，同时保持它们之间的原始遮挡，在保持目标交互的同时实现高的摘要率。引入FOG带来的一个具有挑战性的问题是，FOG可能包含产生冲突的循环。我们提出了一种循环冲突解决算法来解决这个问题。此外，视频摘要方法通常最小化多目标能量函数。以往的方法是同时优化多个目标，需要在多个目标之间取得平衡。相反，我们提出了一种消耗更少运行时间而产生更高质量的逐步优化策略。实验证明了该方法的有效性。

{"title":"Occlusion-Preserved Surveillance Video Synopsis with Flexible Object Graph","authors":"Yongwei Nie, Wei Ge, Siming Zeng, Qing Zhang, Guiqing Li, Ping Li, Hongmin Cai","doi":"10.1007/s11263-024-02302-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02302-5","url":null,"abstract":"Video synopsis is a technique that condenses a long surveillance video to a short summary. It faces challenges to process objects originally occluding each other in the source video. Previous approaches either treat occlusion objects as a single object, which however reduce compression ratio; or have to separate occlusion objects individually, but destroy interactions between them and yield visual artifacts. This paper presents a novel data structure called Flexible Object Graph (FOG) to handle original occlusions. Our FOG-based video synopsis approach can manipulate each object flexibly while preserving the original occlusions between them, achieving high synopsis ratio while maintaining interactions of objects. A challenging issue that comes with the introduction of FOG is that FOG may contain circulations that yield conflicts. We solve this problem by proposing a circulation conflict resolving algorithm. Furthermore, video synopsis methods usually minimize a multi-objective energy function. Previous approaches optimize the multiple objectives simultaneously which needs to strike a balance between them. Instead, we propose a stepwise optimization strategy consuming less running time while producing higher quality. Experiments demonstrate the effectiveness of our method.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"212 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Object Pose Estimation Based on Multi-precision Vectors and Seg-Driven PnP 基于多精度矢量和分段驱动PnP的目标姿态估计

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-07 DOI: 10.1007/s11263-024-02317-y

Yulin Wang, Hongli Li, Chen Luo

Object pose estimation based on a single RGB image has wide application potential but is difficult to achieve. Existing pose estimation involves various inference pipelines. One popular pipeline is to first use Convolutional Neural Networks (CNN) to predict 2D projections of 3D keypoints in a single RGB image and then calculate the 6D pose via a Perspective-n-Point (PnP) solver. Due to the gap between synthetic data and real data, the model trained on synthetic data has difficulty predicting the 6D pose accurately when applied to real data. To address the acute problem, we propose a two-stage pipeline of object pose estimation based upon multi-precision vectors and segmentation-driven (Seg-Driven) PnP. In keypoint localization stage, we first develop a CNN-based three-branch network to predict multi-precision 2D vectors pointing to 2D keypoints. Then we introduce an accurate and fast Keypoint Voting scheme of Multi-precision vectors (KVM), which computes low-precision 2D keypoints using low-precision vectors and refines 2D keypoints on mid- and high-precision vectors. In the pose calculation stage, we propose Seg-Driven PnP to refine the 3D Translation of poses and get the optimal pose by minimizing the non-overlapping area between segmented and rendered masks. The Seg-Driven PnP leverages 2D segmentation trained on real images to improve the accuracy of pose estimation trained on synthetic data, thereby reducing the synthetic-to-real gap. Extensive experiments show our approach materially outperforms state-of-the-art methods on LM and HB datasets. Importantly, our proposed method works reasonably well for weakly textured and occluded objects in diverse scenes.

基于单幅RGB图像的目标姿态估计具有广泛的应用潜力，但实现难度较大。现有的姿态估计涉及各种推理管道。一种流行的方法是首先使用卷积神经网络（CNN）来预测单个RGB图像中3D关键点的2D投影，然后通过Perspective-n-Point （PnP）求解器计算6D姿态。由于合成数据与真实数据之间的差距，使用合成数据训练的模型在应用于真实数据时难以准确预测6D位姿。为了解决这个尖锐的问题，我们提出了一种基于多精度向量和分割驱动（segdriven） PnP的两阶段目标姿态估计管道。在关键点定位阶段，我们首先开发了一种基于cnn的三分支网络来预测指向二维关键点的多精度二维向量。在此基础上，提出了一种精确、快速的多精度矢量关键点投票方案，该方案利用低精度矢量计算低精度二维关键点，并在中高精度矢量上细化二维关键点。在姿态计算阶段，我们提出了分段驱动的PnP算法来细化姿态的三维平移，并通过最小化分割和渲染蒙版之间的非重叠区域来获得最优姿态。Seg-Driven PnP利用在真实图像上训练的2D分割来提高在合成数据上训练的姿态估计的准确性，从而减少合成与真实的差距。大量的实验表明，我们的方法在LM和HB数据集上明显优于最先进的方法。重要的是，我们提出的方法对于不同场景中的弱纹理和遮挡物体都能很好地工作。

{"title":"Object Pose Estimation Based on Multi-precision Vectors and Seg-Driven PnP","authors":"Yulin Wang, Hongli Li, Chen Luo","doi":"10.1007/s11263-024-02317-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02317-y","url":null,"abstract":"Object pose estimation based on a single RGB image has wide application potential but is difficult to achieve. Existing pose estimation involves various inference pipelines. One popular pipeline is to first use Convolutional Neural Networks (CNN) to predict 2D projections of 3D keypoints in a single RGB image and then calculate the 6D pose via a Perspective-n-Point (PnP) solver. Due to the gap between synthetic data and real data, the model trained on synthetic data has difficulty predicting the 6D pose accurately when applied to real data. To address the acute problem, we propose a two-stage pipeline of object pose estimation based upon multi-precision vectors and segmentation-driven (Seg-Driven) PnP. In keypoint localization stage, we first develop a CNN-based three-branch network to predict multi-precision 2D vectors pointing to 2D keypoints. Then we introduce an accurate and fast Keypoint Voting scheme of Multi-precision vectors (KVM), which computes low-precision 2D keypoints using low-precision vectors and refines 2D keypoints on mid- and high-precision vectors. In the pose calculation stage, we propose Seg-Driven PnP to refine the 3D Translation of poses and get the optimal pose by minimizing the non-overlapping area between segmented and rendered masks. The Seg-Driven PnP leverages 2D segmentation trained on real images to improve the accuracy of pose estimation trained on synthetic data, thereby reducing the synthetic-to-real gap. Extensive experiments show our approach materially outperforms state-of-the-art methods on LM and HB datasets. Importantly, our proposed method works reasonably well for weakly textured and occluded objects in diverse scenes.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142788543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks 模态缺失的rbt跟踪：可逆提示学习和高质量基准

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-07 DOI: 10.1007/s11263-024-02311-4

Andong Lu, Chenglong Li, Jiacong Zhao, Jin Tang, Bin Luo

Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.

目前的RGBT跟踪研究依赖于完整的多模态输入，但由于热传感器自校准和数据传输误差等因素，模态信息可能会丢失，本工作称之为模态缺失挑战。为了解决这一挑战，我们提出了一种新的可逆提示学习方法，该方法将内容保留提示集成到训练有素的跟踪模型中，以适应各种模态缺失场景，实现鲁棒性rbt跟踪。针对某一模态缺失场景，我们提出利用可用模态生成缺失模态的提示，以适应RGBT跟踪模型。然而，在提示语生成过程中，可用模态和缺失模态之间的跨模态差距往往会导致语义失真和信息丢失。为了解决这个问题，我们设计了可逆提示符，通过从生成的提示符中整合输入可用模态的完整重构。为了提供一个全面的评估平台，我们构建了几个高质量的基准数据集，其中考虑了各种模态缺失场景来模拟现实世界的挑战。在三个模态缺失的基准数据集上进行的大量实验表明，与最先进的方法相比，我们的方法实现了显着的性能改进。我们已经在https://github.com/mmic-lcl上发布了代码和模拟数据集。

{"title":"Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks","authors":"Andong Lu, Chenglong Li, Jiacong Zhao, Jin Tang, Bin Luo","doi":"10.1007/s11263-024-02311-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02311-4","url":null,"abstract":"Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142788758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering CLIP-Powered TASS：目标感知的视听问答单流网络

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-05 DOI: 10.1007/s11263-024-02289-z

Yuanyuan Jiang, Jianqin Yin

While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the pretrained knowledge of the CLIP model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, TSG+ module transfers the image-text matching knowledge from CLIP models to the required region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the image-text matching knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Besides, we propose a simple yet effective preprocessing strategy to optimize accuracy-efficiency trade-offs. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods. The code is available at https://github.com/Bravo5542/CLIP-TASS.

虽然视觉语言预训练模型（VLMs）在各种多模态理解任务中表现出色，但它们在细粒度视听推理，特别是视听问答（AVQA）方面的潜力仍未得到充分开发。AVQA对vlm提出了特殊的挑战，因为它需要在区域级别上进行视觉理解，并与音频模式无缝集成。以前基于vmm的AVQA方法仅仅使用CLIP作为特征编码器，但未充分利用其知识，并且像大多数AVQA方法一样，将音频和视频作为双流框架中的独立实体。本文通过自然的视听匹配特性，利用CLIP模型的预训练知识，提出了一种新的基于CLIP的目标感知单流（TASS）网络。它由两个关键组件组成：目标感知空间接地模块（TSG+）和单流联合时间接地模块（JTG）。具体来说，TSG+模块将CLIP模型的图像-文本匹配知识转移到所需的区域-文本匹配过程中，而不需要相应的真值标签。此外，与之前仍然需要额外的视听融合模块的独立双流网络不同，JTG在简化的单流架构中统一了视听融合和问题感知时间基础。它将音频和视频视为一个内聚的实体，并通过保留我们提出的跨模态同步（CMS）损失的时间相关性，进一步将图像-文本匹配知识扩展到音频-文本匹配。此外，我们提出了一种简单而有效的预处理策略来优化精度和效率之间的权衡。在MUSIC-AVQA基准上进行的大量实验验证了我们提出的方法比现有的最先进方法的有效性。代码可在https://github.com/Bravo5542/CLIP-TASS上获得。

{"title":"CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering","authors":"Yuanyuan Jiang, Jianqin Yin","doi":"10.1007/s11263-024-02289-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02289-z","url":null,"abstract":"While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the pretrained knowledge of the CLIP model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, TSG+ module transfers the image-text matching knowledge from CLIP models to the required region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the image-text matching knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Besides, we propose a simple yet effective preprocessing strategy to optimize accuracy-efficiency trade-offs. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods. The code is available at https://github.com/Bravo5542/CLIP-TASS.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"67 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0