International Journal of Computer Vision最新文献_第6页

Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks 模态缺失的rbt跟踪：可逆提示学习和高质量基准

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-07 DOI: 10.1007/s11263-024-02311-4

Andong Lu, Chenglong Li, Jiacong Zhao, Jin Tang, Bin Luo

Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.

目前的RGBT跟踪研究依赖于完整的多模态输入，但由于热传感器自校准和数据传输误差等因素，模态信息可能会丢失，本工作称之为模态缺失挑战。为了解决这一挑战，我们提出了一种新的可逆提示学习方法，该方法将内容保留提示集成到训练有素的跟踪模型中，以适应各种模态缺失场景，实现鲁棒性rbt跟踪。针对某一模态缺失场景，我们提出利用可用模态生成缺失模态的提示，以适应RGBT跟踪模型。然而，在提示语生成过程中，可用模态和缺失模态之间的跨模态差距往往会导致语义失真和信息丢失。为了解决这个问题，我们设计了可逆提示符，通过从生成的提示符中整合输入可用模态的完整重构。为了提供一个全面的评估平台，我们构建了几个高质量的基准数据集，其中考虑了各种模态缺失场景来模拟现实世界的挑战。在三个模态缺失的基准数据集上进行的大量实验表明，与最先进的方法相比，我们的方法实现了显着的性能改进。我们已经在https://github.com/mmic-lcl上发布了代码和模拟数据集。

{"title":"Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks","authors":"Andong Lu, Chenglong Li, Jiacong Zhao, Jin Tang, Bin Luo","doi":"10.1007/s11263-024-02311-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02311-4","url":null,"abstract":"Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142788758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering CLIP-Powered TASS：目标感知的视听问答单流网络

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-05 DOI: 10.1007/s11263-024-02289-z

Yuanyuan Jiang, Jianqin Yin

While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the pretrained knowledge of the CLIP model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, TSG+ module transfers the image-text matching knowledge from CLIP models to the required region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the image-text matching knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Besides, we propose a simple yet effective preprocessing strategy to optimize accuracy-efficiency trade-offs. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods. The code is available at https://github.com/Bravo5542/CLIP-TASS.

虽然视觉语言预训练模型（VLMs）在各种多模态理解任务中表现出色，但它们在细粒度视听推理，特别是视听问答（AVQA）方面的潜力仍未得到充分开发。AVQA对vlm提出了特殊的挑战，因为它需要在区域级别上进行视觉理解，并与音频模式无缝集成。以前基于vmm的AVQA方法仅仅使用CLIP作为特征编码器，但未充分利用其知识，并且像大多数AVQA方法一样，将音频和视频作为双流框架中的独立实体。本文通过自然的视听匹配特性，利用CLIP模型的预训练知识，提出了一种新的基于CLIP的目标感知单流（TASS）网络。它由两个关键组件组成：目标感知空间接地模块（TSG+）和单流联合时间接地模块（JTG）。具体来说，TSG+模块将CLIP模型的图像-文本匹配知识转移到所需的区域-文本匹配过程中，而不需要相应的真值标签。此外，与之前仍然需要额外的视听融合模块的独立双流网络不同，JTG在简化的单流架构中统一了视听融合和问题感知时间基础。它将音频和视频视为一个内聚的实体，并通过保留我们提出的跨模态同步（CMS）损失的时间相关性，进一步将图像-文本匹配知识扩展到音频-文本匹配。此外，我们提出了一种简单而有效的预处理策略来优化精度和效率之间的权衡。在MUSIC-AVQA基准上进行的大量实验验证了我们提出的方法比现有的最先进方法的有效性。代码可在https://github.com/Bravo5542/CLIP-TASS上获得。

{"title":"CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering","authors":"Yuanyuan Jiang, Jianqin Yin","doi":"10.1007/s11263-024-02289-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02289-z","url":null,"abstract":"While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the pretrained knowledge of the CLIP model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, TSG+ module transfers the image-text matching knowledge from CLIP models to the required region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the image-text matching knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Besides, we propose a simple yet effective preprocessing strategy to optimize accuracy-efficiency trade-offs. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods. The code is available at https://github.com/Bravo5542/CLIP-TASS.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"67 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Instance-dependent Label Distribution Estimation for Learning with Label Noise 基于实例的标签噪声学习估计

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-02 DOI: 10.1007/s11263-024-02299-x

Zehui Liao, Shishuai Hu, Yutong Xie, Yong Xia

Noise transition matrix estimation is a promising approach for learning with label noise. It can infer clean posterior probabilities, known as Label Distribution (LD), based on noisy ones and reduce the impact of noisy labels. However, this estimation is challenging, since the ground truth labels are not always available. Most existing methods estimate a global noise transition matrix using either correctly labeled samples (anchor points) or detected reliable samples (pseudo anchor points). These methods heavily rely on the existence of anchor points or the quality of pseudo ones, and the global noise transition matrix can hardly provide accurate label transition information for each sample, since the label noise in real applications is mostly instance-dependent. To address these challenges, we propose an Instance-dependent Label Distribution Estimation (ILDE) method to learn from noisy labels for image classification. The method’s workflow has three major steps. First, we estimate each sample’s noisy posterior probability, supervised by noisy labels. Second, since mislabeling probability closely correlates with inter-class correlation, we compute the inter-class correlation matrix to estimate the noise transition matrix, bypassing the need for (pseudo) anchor points. Moreover, for a precise approximation of the instance-dependent noise transition matrix, we calculate the inter-class correlation matrix using only mini-batch samples rather than the entire training dataset. Third, we transform the noisy posterior probability into instance-dependent LD by multiplying it with the estimated noise transition matrix, using the resulting LD for enhanced supervision to prevent DCNNs from memorizing noisy labels. The proposed ILDE method has been evaluated against several state-of-the-art methods on two synthetic and three real-world noisy datasets. Our results indicate that the proposed ILDE method outperforms all competing methods, no matter whether the noise is synthetic or real noise.

噪声转移矩阵估计是一种很有前途的标签噪声学习方法。它可以根据有噪声的后验概率推断出干净的后验概率，即标签分布（LD），并减少有噪声标签的影响。然而，这种估计是具有挑战性的，因为基础真值标签并不总是可用的。大多数现有方法使用正确标记的样本（锚点）或检测到的可靠样本（伪锚点）来估计全局噪声转移矩阵。这些方法严重依赖于锚点的存在或伪锚点的质量，由于实际应用中的标签噪声大多依赖于实例，因此全局噪声转移矩阵很难为每个样本提供准确的标签转移信息。为了解决这些挑战，我们提出了一种基于实例的标签分布估计（ILDE）方法，从噪声标签中学习图像分类。该方法的工作流程有三个主要步骤。首先，我们估计每个样本的噪声后验概率，由噪声标签监督。其次，由于错误标记概率与类间相关性密切相关，我们计算类间相关性矩阵来估计噪声转移矩阵，而不需要（伪）锚点。此外，为了精确逼近依赖于实例的噪声转移矩阵，我们仅使用小批量样本而不是整个训练数据集计算类间相关矩阵。第三，我们通过将噪声后验概率与估计的噪声转移矩阵相乘，将其转换为实例相关的LD，使用所得LD进行增强监督，以防止DCNNs记忆噪声标签。提出的ILDE方法已经在两个合成和三个真实世界的噪声数据集上对几种最先进的方法进行了评估。结果表明，无论噪声是合成噪声还是真实噪声，所提出的ILDE方法都优于所有竞争方法。

{"title":"Instance-dependent Label Distribution Estimation for Learning with Label Noise","authors":"Zehui Liao, Shishuai Hu, Yutong Xie, Yong Xia","doi":"10.1007/s11263-024-02299-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02299-x","url":null,"abstract":"Noise transition matrix estimation is a promising approach for learning with label noise. It can infer clean posterior probabilities, known as Label Distribution (LD), based on noisy ones and reduce the impact of noisy labels. However, this estimation is challenging, since the ground truth labels are not always available. Most existing methods estimate a global noise transition matrix using either correctly labeled samples (anchor points) or detected reliable samples (pseudo anchor points). These methods heavily rely on the existence of anchor points or the quality of pseudo ones, and the global noise transition matrix can hardly provide accurate label transition information for each sample, since the label noise in real applications is mostly instance-dependent. To address these challenges, we propose an Instance-dependent Label Distribution Estimation (ILDE) method to learn from noisy labels for image classification. The method’s workflow has three major steps. First, we estimate each sample’s noisy posterior probability, supervised by noisy labels. Second, since mislabeling probability closely correlates with inter-class correlation, we compute the inter-class correlation matrix to estimate the noise transition matrix, bypassing the need for (pseudo) anchor points. Moreover, for a precise approximation of the instance-dependent noise transition matrix, we calculate the inter-class correlation matrix using only mini-batch samples rather than the entire training dataset. Third, we transform the noisy posterior probability into instance-dependent LD by multiplying it with the estimated noise transition matrix, using the resulting LD for enhanced supervision to prevent DCNNs from memorizing noisy labels. The proposed ILDE method has been evaluated against several state-of-the-art methods on two synthetic and three real-world noisy datasets. Our results indicate that the proposed ILDE method outperforms all competing methods, no matter whether the noise is synthetic or real noise.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142760581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ReFusion: Learning Image Fusion from Reconstruction with Learnable Loss Via Meta-Learning 融合：基于元学习的可学习损失重构学习图像融合

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-02 DOI: 10.1007/s11263-024-02256-8

Haowen Bai, Zixiang Zhao, Jiangshe Zhang, Yichen Wu, Lilun Deng, Yukun Cui, Baisong Jiang, Shuang Xu

Image fusion aims to combine information from multiple source images into a single one with more comprehensive informational content. Deep learning-based image fusion algorithms face significant challenges, including the lack of a definitive ground truth and the corresponding distance measurement. Additionally, current manually defined loss functions limit the model’s flexibility and generalizability for various fusion tasks. To address these limitations, we propose ReFusion, a unified meta-learning based image fusion framework that dynamically optimizes the fusion loss for various tasks through source image reconstruction. Compared to existing methods, ReFusion employs a parameterized loss function, that allows the training framework to be dynamically adapted according to the specific fusion scenario and task. ReFusion consists of three key components: a fusion module, a source reconstruction module, and a loss proposal module. We employ a meta-learning strategy to train the loss proposal module using the reconstruction loss. This strategy forces the fused image to be more conducive to reconstruct source images, allowing the loss proposal module to generate a adaptive fusion loss that preserves the optimal information from the source images. The update of the fusion module relies on the learnable fusion loss proposed by the loss proposal module. The three modules update alternately, enhancing each other to optimize the fusion loss for different tasks and consistently achieve satisfactory results. Extensive experiments demonstrate that ReFusion is capable of adapting to various tasks, including infrared-visible, medical, multi-focus, and multi-exposure image fusion. The code is available at https://github.com/HaowenBai/ReFusion.

图像融合的目的是将来自多个源图像的信息组合成一个信息内容更全面的图像。基于深度学习的图像融合算法面临着巨大的挑战，包括缺乏明确的基础真理和相应的距离测量。此外，目前手工定义的损失函数限制了模型在各种融合任务中的灵活性和泛化性。为了解决这些限制，我们提出了ReFusion，这是一个统一的基于元学习的图像融合框架，通过源图像重建动态优化各种任务的融合损失。与现有方法相比，ReFusion采用了参数化损失函数，使训练框架能够根据特定的融合场景和任务进行动态调整。融合由三个关键部分组成：融合模块、源重构模块和损失建议模块。我们采用元学习策略使用重建损失来训练损失建议模块。该策略迫使融合图像更有利于重建源图像，允许损失建议模块生成自适应融合损失，保留源图像的最佳信息。融合模块的更新依赖于损失建议模块提出的可学习的融合损失。三个模块交替更新，相互增强，以优化不同任务的融合损失，始终如一地获得满意的结果。大量实验表明，ReFusion能够适应各种任务，包括红外-可见光、医学、多焦点和多曝光图像融合。代码可在https://github.com/HaowenBai/ReFusion上获得。

{"title":"ReFusion: Learning Image Fusion from Reconstruction with Learnable Loss Via Meta-Learning","authors":"Haowen Bai, Zixiang Zhao, Jiangshe Zhang, Yichen Wu, Lilun Deng, Yukun Cui, Baisong Jiang, Shuang Xu","doi":"10.1007/s11263-024-02256-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02256-8","url":null,"abstract":"Image fusion aims to combine information from multiple source images into a single one with more comprehensive informational content. Deep learning-based image fusion algorithms face significant challenges, including the lack of a definitive ground truth and the corresponding distance measurement. Additionally, current manually defined loss functions limit the model’s flexibility and generalizability for various fusion tasks. To address these limitations, we propose ReFusion, a unified meta-learning based image fusion framework that dynamically optimizes the fusion loss for various tasks through source image reconstruction. Compared to existing methods, ReFusion employs a parameterized loss function, that allows the training framework to be dynamically adapted according to the specific fusion scenario and task. ReFusion consists of three key components: a fusion module, a source reconstruction module, and a loss proposal module. We employ a meta-learning strategy to train the loss proposal module using the reconstruction loss. This strategy forces the fused image to be more conducive to reconstruct source images, allowing the loss proposal module to generate a adaptive fusion loss that preserves the optimal information from the source images. The update of the fusion module relies on the learnable fusion loss proposed by the loss proposal module. The three modules update alternately, enhancing each other to optimize the fusion loss for different tasks and consistently achieve satisfactory results. Extensive experiments demonstrate that ReFusion is capable of adapting to various tasks, including infrared-visible, medical, multi-focus, and multi-exposure image fusion. The code is available at https://github.com/HaowenBai/ReFusion.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"12 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142758162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DiffLLE: Diffusion-based Domain Calibration for Weak Supervised Low-light Image Enhancement DiffLLE：基于扩散的弱监督弱光图像增强域校准

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-27 DOI: 10.1007/s11263-024-02292-4

Shuzhou Yang, Xuanyu Zhang, Yinhuai Wang, Jiwen Yu, Yuhan Wang, Jian Zhang

Existing weak supervised low-light image enhancement methods lack enough effectiveness and generalization in practical applications. We suppose this is because of the absence of explicit supervision and the inherent gap between real-world low-light domain and the training low-light domain. For example, low-light datasets are well-designed, but real-world night scenes are plagued with sophisticated interference such as noise, artifacts, and extreme lighting conditions. In this paper, we develop Diffusion-based domain calibration to realize more robust and effective weak supervised Low-Light Enhancement, called DiffLLE. Since the diffusion model performs impressive denoising capability and has been trained on massive clean images, we adopt it to bridge the gap between the real low-light domain and training degradation domain, while providing efficient priors of real-world content for weak supervised models. Specifically, we adopt a naive weak supervised enhancement algorithm to realize preliminary restoration and design two zero-shot plug-and-play modules based on diffusion model to improve generalization and effectiveness. The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve, which makes the enhancement model perform robustly even in sophisticated wild degradation. Due to the limited enhancement effect of the weak supervised model, we further develop the Fine-grained Target domain Distillation (FTD) module to find a more visual-friendly solution space. It exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrinks the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field, addressing the lack of strong explicit supervision for weak supervised methods. Benefiting from these, our approach even outperforms some supervised methods by using only a simple weak supervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE, especially in real-world dark scenes.

现有的弱监督微光图像增强方法在实际应用中缺乏足够的有效性和泛化性。我们认为这是因为缺乏明确的监督，以及现实世界的低光域和训练的低光域之间存在固有的差距。例如，低光照数据集设计得很好，但现实世界的夜景受到噪声、伪影和极端光照条件等复杂干扰的困扰。在本文中，我们开发了基于扩散的域校准来实现更鲁棒和有效的弱监督弱光增强，称为DiffLLE。由于扩散模型具有令人印象深刻的去噪能力，并且已经在大量干净图像上进行了训练，因此我们采用它来弥合真实弱光域和训练退化域之间的差距，同时为弱监督模型提供有效的真实内容先验。具体而言，我们采用朴素弱监督增强算法实现初步恢复，并设计了两个基于扩散模型的零射即插即用模块，提高了泛化和有效性。扩散引导退化校准（Diffusion-guided Degradation Calibration， DDC）模块通过基于扩散的域校准和亮度增强曲线，缩小了现实世界和训练低光退化之间的差距，使得增强模型即使在复杂的野外退化中也能保持鲁棒性。由于弱监督模型的增强效果有限，我们进一步开发了细粒度目标域蒸馏（FTD）模块，以寻找更直观的解决方案空间。它利用预训练扩散模型的先验来生成伪参考，将初步恢复结果从粗糙的正光域缩小到更精细的高质量干净场，解决了弱监督方法缺乏强显式监督的问题。受益于这些，我们的方法甚至优于一些监督方法，仅使用一个简单的弱监督基线。大量的实验证明了该方法的优越性，特别是在真实的黑暗场景中。

{"title":"DiffLLE: Diffusion-based Domain Calibration for Weak Supervised Low-light Image Enhancement","authors":"Shuzhou Yang, Xuanyu Zhang, Yinhuai Wang, Jiwen Yu, Yuhan Wang, Jian Zhang","doi":"10.1007/s11263-024-02292-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02292-4","url":null,"abstract":"Existing weak supervised low-light image enhancement methods lack enough effectiveness and generalization in practical applications. We suppose this is because of the absence of explicit supervision and the inherent gap between real-world low-light domain and the training low-light domain. For example, low-light datasets are well-designed, but real-world night scenes are plagued with sophisticated interference such as noise, artifacts, and extreme lighting conditions. In this paper, we develop Diffusion-based domain calibration to realize more robust and effective weak supervised Low-Light Enhancement, called DiffLLE. Since the diffusion model performs impressive denoising capability and has been trained on massive clean images, we adopt it to bridge the gap between the real low-light domain and training degradation domain, while providing efficient priors of real-world content for weak supervised models. Specifically, we adopt a naive weak supervised enhancement algorithm to realize preliminary restoration and design two zero-shot plug-and-play modules based on diffusion model to improve generalization and effectiveness. The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve, which makes the enhancement model perform robustly even in sophisticated wild degradation. Due to the limited enhancement effect of the weak supervised model, we further develop the Fine-grained Target domain Distillation (FTD) module to find a more visual-friendly solution space. It exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrinks the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field, addressing the lack of strong explicit supervision for weak supervised methods. Benefiting from these, our approach even outperforms some supervised methods by using only a simple weak supervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE, especially in real-world dark scenes.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"1 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142753768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Draw Sketch, Draw Flesh: Whole-Body Computed Tomography from Any X-Ray Views 绘制草图，绘制肉体：从任何x射线视图的全身计算机断层扫描

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-27 DOI: 10.1007/s11263-024-02286-2

Yongsheng Pan, Yiwen Ye, Yanning Zhang, Yong Xia, Dinggang Shen

Stereoscopic observation is a common foundation of medical image analysis and is generally achieved by 3D medical imaging based on settled scanners, such as CT and MRI, that are not as convenient as X-ray machines in some flexible scenarios. However, X-ray images can only provide perspective 2D observation and lack view in the third dimension. If 3D information can be deduced from X-ray images, it would broaden the application of X-ray machines. Focus on the above objective, this paper dedicates to the generation of pseudo 3D CT scans from non-parallel 2D perspective X-ray (PXR) views and proposes the Draw Sketch and Draw Flesh (DSDF) framework to first roughly predict the tissue distribution (Sketch) from PXR views and then render the tissue details (Flesh) from the tissue distribution and PXR views. Different from previous studies that focus only on partial locations, e.g., chest or neck, this study theoretically investigates the feasibility of head-to-leg reconstruction, i.e., generally applicable to any body parts. Experiments on 559 whole-body samples from 4 cohorts suggest that our DSDF can reconstruct more reasonable pseudo CT images than state-of-the-art methods and achieve promising results in both visualization and various downstream tasks. The source code and well-trained models are available a https://github.com/YongshengPan/WholeBodyXraytoCT.

立体观察是医学图像分析的共同基础，一般通过CT、MRI等固定扫描仪的三维医学成像来实现，在一些灵活的场景下，这些扫描仪不如x光机方便。然而，x射线图像只能提供二维透视观察，缺乏三维视角。如果能从x射线图像中推断出三维信息，将会拓宽x射线机的应用范围。针对上述目标，本文致力于从非平行二维透视x射线（PXR）视图生成伪三维CT扫描，并提出了Draw Sketch and Draw Flesh （DSDF）框架，首先从PXR视图粗略预测组织分布（Sketch），然后从组织分布和PXR视图渲染组织细节（Flesh）。不同于以往的研究只关注部分部位，如胸部或颈部，本研究从理论上探讨了头腿重建的可行性，即普遍适用于身体的任何部位。来自4个队列的559个全身样本的实验表明，我们的DSDF可以重建比现有方法更合理的伪CT图像，并且在可视化和各种下游任务中都取得了令人满意的结果。源代码和训练有素的模型可从https://github.com/YongshengPan/WholeBodyXraytoCT获得。

{"title":"Draw Sketch, Draw Flesh: Whole-Body Computed Tomography from Any X-Ray Views","authors":"Yongsheng Pan, Yiwen Ye, Yanning Zhang, Yong Xia, Dinggang Shen","doi":"10.1007/s11263-024-02286-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02286-2","url":null,"abstract":"Stereoscopic observation is a common foundation of medical image analysis and is generally achieved by 3D medical imaging based on settled scanners, such as CT and MRI, that are not as convenient as X-ray machines in some flexible scenarios. However, X-ray images can only provide perspective 2D observation and lack view in the third dimension. If 3D information can be deduced from X-ray images, it would broaden the application of X-ray machines. Focus on the above objective, this paper dedicates to the generation of pseudo 3D CT scans from non-parallel 2D perspective X-ray (PXR) views and proposes the Draw Sketch and Draw Flesh (DSDF) framework to first roughly predict the tissue distribution (Sketch) from PXR views and then render the tissue details (Flesh) from the tissue distribution and PXR views. Different from previous studies that focus only on partial locations, e.g., chest or neck, this study theoretically investigates the feasibility of head-to-leg reconstruction, i.e., generally applicable to any body parts. Experiments on 559 whole-body samples from 4 cohorts suggest that our DSDF can reconstruct more reasonable pseudo CT images than state-of-the-art methods and achieve promising results in both visualization and various downstream tasks. The source code and well-trained models are available a https://github.com/YongshengPan/WholeBodyXraytoCT.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"26 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142753766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ICEv2: Interpretability, Comprehensiveness, and Explainability in Vision Transformer ICEv2：视觉转换器中的可解释性、全面性和可说明性

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-26 DOI: 10.1007/s11263-024-02290-6

Hoyoung Choi, Seungwan Jin, Kyungsik Han

Vision transformers use [CLS] token to predict image classes. Their explainability visualization has been studied using relevant information from the [CLS] token or focusing on attention scores during self-attention. However, such visualization is challenging because of the dependence of the interpretability of a vision transformer on skip connections and attention operators, the instability of non-linearities in the learning process, and the limited reflection of self-attention scores on relevance. We argue that the output patch embeddings in a vision transformer preserve the image information of each patch location, which can facilitate the prediction of an image class. In this paper, we propose ICEv2 (ICEv2: ({{{underline{varvec{I}}}}})nterpretability, ({{{underline{varvec{C}}}}})omprehensiveness, and ({{{underline{varvec{E}}}}})xplainability in Vision Transformer), an explainability visualization method that addresses the limitations of ICE (i.e., high dependence of hyperparameters on performance and the inability to preserve the model’s properties) by minimizing the number of training encoder layers, redesigning the MLP layer, and optimizing hyperparameters along with various model size. Overall, ICEv2 shows higher efficiency, performance, robustness, and scalability than ICE. On the ImageNet-Segmentation dataset, ICEv2 outperformed all explainability visualization methods in all cases depending on the model size. On the Pascal VOC dataset, ICEv2 outperformed both self-supervised and supervised methods on Jaccard similarity. In the unsupervised single object discovery, where untrained classes are present in the images, ICEv2 effectively distinguished between foreground and background, showing performance comparable to the previous state-of-the-art. Lastly, ICEv2 can be trained with significantly lower training computational complexity.

视觉转换器使用[CLS]标记来预测图像类别。研究人员利用[CLS]标记的相关信息或自我注意过程中的注意力分数，对其可解释性进行了可视化。然而，由于视觉转换器的可解释性取决于跳转连接和注意力运算符、学习过程中非线性的不稳定性以及自我注意力分数对相关性的有限反映，这种可视化具有挑战性。我们认为，视觉变换器中的输出补丁嵌入保留了每个补丁位置的图像信息，这有助于预测图像类别。本文提出了 ICEv2（ICEv2：ICEv2：视觉转换器中的可解释性可视化方法，它解决了ICE的局限性（即、这种可解释性可视化方法解决了 ICE 的局限性（即超参数对性能的高度依赖性以及无法保留模型的特性），具体方法包括尽量减少训练编码器层的数量、重新设计 MLP 层以及优化各种模型大小的超参数。总体而言，ICEv2 在效率、性能、鲁棒性和可扩展性方面都优于 ICE。在 ImageNet-Segmentation 数据集上，根据模型大小，ICEv2 在所有情况下都优于所有可解释性可视化方法。在 Pascal VOC 数据集上，ICEv2 在 Jaccard 相似性方面的表现优于自监督和监督方法。在无监督单个对象发现中，图像中存在未经训练的类别，ICEv2 能有效区分前景和背景，其表现与之前的先进方法相当。最后，ICEv2 的训练计算复杂度大大降低。

{"title":"ICEv2: Interpretability, Comprehensiveness, and Explainability in Vision Transformer","authors":"Hoyoung Choi, Seungwan Jin, Kyungsik Han","doi":"10.1007/s11263-024-02290-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02290-6","url":null,"abstract":"Vision transformers use [CLS] token to predict image classes. Their explainability visualization has been studied using relevant information from the [CLS] token or focusing on attention scores during self-attention. However, such visualization is challenging because of the dependence of the interpretability of a vision transformer on skip connections and attention operators, the instability of non-linearities in the learning process, and the limited reflection of self-attention scores on relevance. We argue that the output patch embeddings in a vision transformer preserve the image information of each patch location, which can facilitate the prediction of an image class. In this paper, we propose ICEv2 (ICEv2: ({{{underline{varvec{I}}}}})nterpretability, ({{{underline{varvec{C}}}}})omprehensiveness, and ({{{underline{varvec{E}}}}})xplainability in Vision Transformer), an explainability visualization method that addresses the limitations of ICE (i.e., high dependence of hyperparameters on performance and the inability to preserve the model’s properties) by minimizing the number of training encoder layers, redesigning the MLP layer, and optimizing hyperparameters along with various model size. Overall, ICEv2 shows higher efficiency, performance, robustness, and scalability than ICE. On the ImageNet-Segmentation dataset, ICEv2 outperformed all explainability visualization methods in all cases depending on the model size. On the Pascal VOC dataset, ICEv2 outperformed both self-supervised and supervised methods on Jaccard similarity. In the unsupervised single object discovery, where untrained classes are present in the images, ICEv2 effectively distinguished between foreground and background, showing performance comparable to the previous state-of-the-art. Lastly, ICEv2 can be trained with significantly lower training computational complexity.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"67 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142718529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Globally Correlation-Aware Hard Negative Generation 全局相关性感知硬负生成

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-25 DOI: 10.1007/s11263-024-02288-0

Wenjie Peng, Hongxiang Huang, Tianshui Chen, Quhui Ke, Gang Dai, Shuangping Huang

Hard negative generation aims to generate informative negative samples that help to determine the decision boundaries and thus facilitate advancing deep metric learning. Current works select pair/triplet samples, learn their correlations, and fuse them to generate hard negatives. However, these works merely consider the local correlations of selected samples, ignoring global sample correlations that would provide more significant information to generate more informative negatives. In this work, we propose a globally correlation-aware hard negative generation (GCA-HNG) framework, which first learns sample correlations from a global perspective and exploits these correlations to guide generating hardness-adaptive and diverse negatives. Specifically, this approach begins by constructing a structured graph to model sample correlations, where each node represents a specific sample and each edge represents the correlations between corresponding samples. Then, we introduce an iterative graph message propagation to propagate the messages of node and edge through the whole graph and thus learn the sample correlations globally. Finally, with the guidance of the learned global correlations, we propose a channel-adaptive manner to combine an anchor and multiple negatives for HNG. Compared to current methods, GCA-HNG allows perceiving sample correlations with numerous negatives from a global and comprehensive perspective and generates the negatives with better hardness and diversity. Extensive experiment results demonstrate that the proposed GCA-HNG is superior to related methods on four image retrieval benchmark datasets.

硬否定生成的目的是生成信息丰富的否定样本，帮助确定决策边界，从而促进深度度量学习的发展。目前的研究选择成对/三重样本，学习它们之间的相关性，并将它们融合以生成硬阴性样本。然而，这些研究仅仅考虑了所选样本的局部相关性，而忽略了全局样本相关性，而全局样本相关性能提供更重要的信息，从而生成信息量更大的阴性样本。在这项工作中，我们提出了一个全局相关性感知硬底片生成（GCA-HNG）框架，该框架首先从全局角度学习样本相关性，然后利用这些相关性指导生成硬适应性和多样化的底片。具体来说，这种方法首先构建一个结构图来模拟样本相关性，其中每个节点代表一个特定样本，每条边代表相应样本之间的相关性。然后，我们引入迭代图信息传播，将节点和边的信息传播到整个图中，从而全局学习样本相关性。最后，在学习到的全局相关性的指导下，我们提出了一种信道自适应方式，将 HNG 的锚和多个底片结合起来。与目前的方法相比，GCA-HNG 可以从全局和综合的角度感知样本与众多底片的相关性，并生成具有更好硬度和多样性的底片。大量实验结果表明，在四个图像检索基准数据集上，所提出的 GCA-HNG 优于相关方法。

{"title":"Globally Correlation-Aware Hard Negative Generation","authors":"Wenjie Peng, Hongxiang Huang, Tianshui Chen, Quhui Ke, Gang Dai, Shuangping Huang","doi":"10.1007/s11263-024-02288-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02288-0","url":null,"abstract":"Hard negative generation aims to generate informative negative samples that help to determine the decision boundaries and thus facilitate advancing deep metric learning. Current works select pair/triplet samples, learn their correlations, and fuse them to generate hard negatives. However, these works merely consider the local correlations of selected samples, ignoring global sample correlations that would provide more significant information to generate more informative negatives. In this work, we propose a globally correlation-aware hard negative generation (GCA-HNG) framework, which first learns sample correlations from a global perspective and exploits these correlations to guide generating hardness-adaptive and diverse negatives. Specifically, this approach begins by constructing a structured graph to model sample correlations, where each node represents a specific sample and each edge represents the correlations between corresponding samples. Then, we introduce an iterative graph message propagation to propagate the messages of node and edge through the whole graph and thus learn the sample correlations globally. Finally, with the guidance of the learned global correlations, we propose a channel-adaptive manner to combine an anchor and multiple negatives for HNG. Compared to current methods, GCA-HNG allows perceiving sample correlations with numerous negatives from a global and comprehensive perspective and generates the negatives with better hardness and diversity. Extensive experiment results demonstrate that the proposed GCA-HNG is superior to related methods on four image retrieval benchmark datasets.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"80 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142697109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IEBins: Iterative Elastic Bins for Monocular Depth Estimation and Completion IEBins：用于单目深度估算和完成的迭代弹性分区

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-25 DOI: 10.1007/s11263-024-02293-3

Shuwei Shao, Zhongcai Pei, Weihai Chen, Peter C. Y. Chen, Zhengguo Li

Monocular depth estimation and completion are fundamental aspects of geometric computer vision, serving as essential techniques for various downstream applications. In recent developments, several methods have reformulated these two tasks as a classification-regression problem, deriving depth with a linear combination of predicted probabilistic distribution and bin centers. In this paper, we introduce an innovative concept termed iterative elastic bins (IEBins) for the classification-regression-based monocular depth estimation and completion. The IEBins involves the idea of iterative division of bins. In the initialization stage, a coarse and uniform discretization is applied to the entire depth range. Subsequent update stages then iteratively identify and uniformly discretize the target bin, by leveraging it as the new depth range for further refinement. To mitigate the risk of error accumulation during iterations, we propose a novel elastic target bin, replacing the original one. The width of this elastic bin is dynamically adapted according to the depth uncertainty. Furthermore, we develop dedicated frameworks to instantiate the IEBins. Extensive experiments on the KITTI, NYU-Depth-v2, SUN RGB-D, ScanNet and DIODE datasets indicate that our method outperforms prior state-of-the-art monocular depth estimation and completion competitors.

单目深度估计和完成是几何计算机视觉的基本方面，是各种下游应用的基本技术。在最近的发展中，有几种方法将这两项任务重新表述为一个分类-回归问题，通过预测概率分布和箱中心的线性组合得出深度。在本文中，我们为基于分类-回归的单目深度估计和完成引入了一个创新概念，称为迭代弹性仓（IEBins）。IEBins 包含迭代分仓的思想。在初始化阶段，对整个深度范围进行粗略而均匀的离散化。随后的更新阶段会反复识别并统一离散化目标仓，将其作为进一步细化的新深度范围。为了降低迭代过程中误差累积的风险，我们提出了一种新的弹性目标仓，以取代原来的目标仓。这种弹性目标仓的宽度可根据深度的不确定性进行动态调整。此外，我们还开发了专用框架来实例化 IEBins。在 KITTI、NYU-Depth-v2、SUN RGB-D、ScanNet 和 DIODE 数据集上进行的大量实验表明，我们的方法优于之前最先进的单目深度估计和补全竞争对手。

{"title":"IEBins: Iterative Elastic Bins for Monocular Depth Estimation and Completion","authors":"Shuwei Shao, Zhongcai Pei, Weihai Chen, Peter C. Y. Chen, Zhengguo Li","doi":"10.1007/s11263-024-02293-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02293-3","url":null,"abstract":"Monocular depth estimation and completion are fundamental aspects of geometric computer vision, serving as essential techniques for various downstream applications. In recent developments, several methods have reformulated these two tasks as a classification-regression problem, deriving depth with a linear combination of predicted probabilistic distribution and bin centers. In this paper, we introduce an innovative concept termed iterative elastic bins (IEBins) for the classification-regression-based monocular depth estimation and completion. The IEBins involves the idea of iterative division of bins. In the initialization stage, a coarse and uniform discretization is applied to the entire depth range. Subsequent update stages then iteratively identify and uniformly discretize the target bin, by leveraging it as the new depth range for further refinement. To mitigate the risk of error accumulation during iterations, we propose a novel elastic target bin, replacing the original one. The width of this elastic bin is dynamically adapted according to the depth uncertainty. Furthermore, we develop dedicated frameworks to instantiate the IEBins. Extensive experiments on the KITTI, NYU-Depth-v2, SUN RGB-D, ScanNet and DIODE datasets indicate that our method outperforms prior state-of-the-art monocular depth estimation and completion competitors.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"43 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142712464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transformer for Object Re-identification: A Survey 物体再识别转换器：一项调查

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-23 DOI: 10.1007/s11263-024-02284-4

Mang Ye, Shuoyi Chen, Chenyue Li, Wei-Shi Zheng, David Crandall, Bo Du

Object Re-identification (Re-ID) aims to identify specific objects across different times and scenes, which is a widely researched task in computer vision. For a prolonged period, this field has been predominantly driven by deep learning technology based on convolutional neural networks. In recent years, the emergence of Vision Transformers has spurred a growing number of studies delving deeper into Transformer-based Re-ID, continuously breaking performance records and witnessing significant progress in the Re-ID field. Offering a powerful, flexible, and unified solution, Transformers cater to a wide array of Re-ID tasks with unparalleled efficacy. This paper provides a comprehensive review and in-depth analysis of the Transformer-based Re-ID. In categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, we thoroughly elucidate the advantages demonstrated by the Transformer in addressing a multitude of challenges across these domains. Considering the trending unsupervised Re-ID, we propose a new Transformer baseline, UntransReID, achieving state-of-the-art performance on both single/cross modal tasks. For the under-explored animal Re-ID, we devise a standardized experimental benchmark and conduct extensive experiments to explore the applicability of Transformer for this task and facilitate future research. Finally, we discuss some important yet under-investigated open issues in the large foundation model era, we believe it will serve as a new handbook for researchers in this field. A periodically updated website will be available at https://github.com/mangye16/ReID-Survey.

物体再识别（Re-ID）旨在识别不同时间和不同场景中的特定物体，是计算机视觉领域一项被广泛研究的任务。长期以来，这一领域主要由基于卷积神经网络的深度学习技术驱动。近年来，视觉变形器的出现促使越来越多的研究深入探讨基于变形器的再识别（Re-ID）技术，不断刷新性能记录，见证了再识别领域的重大进展。变压器提供了一个功能强大、灵活统一的解决方案，可以满足各种重新识别任务的需要，具有无与伦比的功效。本文对基于变形金刚的重新识别技术进行了全面回顾和深入分析。我们将现有作品分为基于图像/视频的再识别、使用有限数据/注释的再识别、跨模态再识别和特殊再识别场景，深入阐明了变形金刚在应对这些领域的众多挑战时所展现出的优势。考虑到无监督再识别技术的发展趋势，我们提出了一种新的 Transformer 基线--UntransReID，在单模态/跨模态任务中均实现了最先进的性能。对于探索不足的动物再识别，我们设计了一个标准化的实验基准，并进行了广泛的实验，以探索 Transformer 在该任务中的适用性，并促进未来的研究。最后，我们讨论了大型基础模型时代一些重要但尚未得到充分研究的开放性问题，相信这将成为该领域研究人员的新手册。定期更新的网站：https://github.com/mangye16/ReID-Survey。

{"title":"Transformer for Object Re-identification: A Survey","authors":"Mang Ye, Shuoyi Chen, Chenyue Li, Wei-Shi Zheng, David Crandall, Bo Du","doi":"10.1007/s11263-024-02284-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02284-4","url":null,"abstract":"Object Re-identification (Re-ID) aims to identify specific objects across different times and scenes, which is a widely researched task in computer vision. For a prolonged period, this field has been predominantly driven by deep learning technology based on convolutional neural networks. In recent years, the emergence of Vision Transformers has spurred a growing number of studies delving deeper into Transformer-based Re-ID, continuously breaking performance records and witnessing significant progress in the Re-ID field. Offering a powerful, flexible, and unified solution, Transformers cater to a wide array of Re-ID tasks with unparalleled efficacy. This paper provides a comprehensive review and in-depth analysis of the Transformer-based Re-ID. In categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, we thoroughly elucidate the advantages demonstrated by the Transformer in addressing a multitude of challenges across these domains. Considering the trending unsupervised Re-ID, we propose a new Transformer baseline, UntransReID, achieving state-of-the-art performance on both single/cross modal tasks. For the under-explored animal Re-ID, we devise a standardized experimental benchmark and conduct extensive experiments to explore the applicability of Transformer for this task and facilitate future research. Finally, we discuss some important yet under-investigated open issues in the large foundation model era, we believe it will serve as a new handbook for researchers in this field. A periodically updated website will be available at https://github.com/mangye16/ReID-Survey.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"15 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142690532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0