Pub Date : 2024-12-07DOI: 10.1007/s11263-024-02311-4
Andong Lu, Chenglong Li, Jiacong Zhao, Jin Tang, Bin Luo
Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.
{"title":"Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality Benchmarks","authors":"Andong Lu, Chenglong Li, Jiacong Zhao, Jin Tang, Bin Luo","doi":"10.1007/s11263-024-02311-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02311-4","url":null,"abstract":"<p>Current RGBT tracking research relies on the complete multi-modality input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: https://github.com/mmic-lcl.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142788758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-05DOI: 10.1007/s11263-024-02289-z
Yuanyuan Jiang, Jianqin Yin
While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the pretrained knowledge of the CLIP model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, TSG+ module transfers the image-text matching knowledge from CLIP models to the required region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the image-text matching knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Besides, we propose a simple yet effective preprocessing strategy to optimize accuracy-efficiency trade-offs. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods. The code is available at https://github.com/Bravo5542/CLIP-TASS.
{"title":"CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering","authors":"Yuanyuan Jiang, Jianqin Yin","doi":"10.1007/s11263-024-02289-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02289-z","url":null,"abstract":"<p>While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the pretrained knowledge of the CLIP model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, TSG+ module transfers the image-text matching knowledge from CLIP models to the required region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the image-text matching knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Besides, we propose a simple yet effective preprocessing strategy to optimize accuracy-efficiency trade-offs. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods. The code is available at https://github.com/Bravo5542/CLIP-TASS.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"67 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-02DOI: 10.1007/s11263-024-02299-x
Zehui Liao, Shishuai Hu, Yutong Xie, Yong Xia
Noise transition matrix estimation is a promising approach for learning with label noise. It can infer clean posterior probabilities, known as Label Distribution (LD), based on noisy ones and reduce the impact of noisy labels. However, this estimation is challenging, since the ground truth labels are not always available. Most existing methods estimate a global noise transition matrix using either correctly labeled samples (anchor points) or detected reliable samples (pseudo anchor points). These methods heavily rely on the existence of anchor points or the quality of pseudo ones, and the global noise transition matrix can hardly provide accurate label transition information for each sample, since the label noise in real applications is mostly instance-dependent. To address these challenges, we propose an Instance-dependent Label Distribution Estimation (ILDE) method to learn from noisy labels for image classification. The method’s workflow has three major steps. First, we estimate each sample’s noisy posterior probability, supervised by noisy labels. Second, since mislabeling probability closely correlates with inter-class correlation, we compute the inter-class correlation matrix to estimate the noise transition matrix, bypassing the need for (pseudo) anchor points. Moreover, for a precise approximation of the instance-dependent noise transition matrix, we calculate the inter-class correlation matrix using only mini-batch samples rather than the entire training dataset. Third, we transform the noisy posterior probability into instance-dependent LD by multiplying it with the estimated noise transition matrix, using the resulting LD for enhanced supervision to prevent DCNNs from memorizing noisy labels. The proposed ILDE method has been evaluated against several state-of-the-art methods on two synthetic and three real-world noisy datasets. Our results indicate that the proposed ILDE method outperforms all competing methods, no matter whether the noise is synthetic or real noise.
{"title":"Instance-dependent Label Distribution Estimation for Learning with Label Noise","authors":"Zehui Liao, Shishuai Hu, Yutong Xie, Yong Xia","doi":"10.1007/s11263-024-02299-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02299-x","url":null,"abstract":"<p>Noise transition matrix estimation is a promising approach for learning with label noise. It can infer clean posterior probabilities, known as Label Distribution (LD), based on noisy ones and reduce the impact of noisy labels. However, this estimation is challenging, since the ground truth labels are not always available. Most existing methods estimate a global noise transition matrix using either correctly labeled samples (anchor points) or detected reliable samples (pseudo anchor points). These methods heavily rely on the existence of anchor points or the quality of pseudo ones, and the global noise transition matrix can hardly provide accurate label transition information for each sample, since the label noise in real applications is mostly instance-dependent. To address these challenges, we propose an Instance-dependent Label Distribution Estimation (ILDE) method to learn from noisy labels for image classification. The method’s workflow has three major steps. First, we estimate each sample’s noisy posterior probability, supervised by noisy labels. Second, since mislabeling probability closely correlates with inter-class correlation, we compute the inter-class correlation matrix to estimate the noise transition matrix, bypassing the need for (pseudo) anchor points. Moreover, for a precise approximation of the instance-dependent noise transition matrix, we calculate the inter-class correlation matrix using only mini-batch samples rather than the entire training dataset. Third, we transform the noisy posterior probability into instance-dependent LD by multiplying it with the estimated noise transition matrix, using the resulting LD for enhanced supervision to prevent DCNNs from memorizing noisy labels. The proposed ILDE method has been evaluated against several state-of-the-art methods on two synthetic and three real-world noisy datasets. Our results indicate that the proposed ILDE method outperforms all competing methods, no matter whether the noise is synthetic or real noise.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142760581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image fusion aims to combine information from multiple source images into a single one with more comprehensive informational content. Deep learning-based image fusion algorithms face significant challenges, including the lack of a definitive ground truth and the corresponding distance measurement. Additionally, current manually defined loss functions limit the model’s flexibility and generalizability for various fusion tasks. To address these limitations, we propose ReFusion, a unified meta-learning based image fusion framework that dynamically optimizes the fusion loss for various tasks through source image reconstruction. Compared to existing methods, ReFusion employs a parameterized loss function, that allows the training framework to be dynamically adapted according to the specific fusion scenario and task. ReFusion consists of three key components: a fusion module, a source reconstruction module, and a loss proposal module. We employ a meta-learning strategy to train the loss proposal module using the reconstruction loss. This strategy forces the fused image to be more conducive to reconstruct source images, allowing the loss proposal module to generate a adaptive fusion loss that preserves the optimal information from the source images. The update of the fusion module relies on the learnable fusion loss proposed by the loss proposal module. The three modules update alternately, enhancing each other to optimize the fusion loss for different tasks and consistently achieve satisfactory results. Extensive experiments demonstrate that ReFusion is capable of adapting to various tasks, including infrared-visible, medical, multi-focus, and multi-exposure image fusion. The code is available at https://github.com/HaowenBai/ReFusion.
{"title":"ReFusion: Learning Image Fusion from Reconstruction with Learnable Loss Via Meta-Learning","authors":"Haowen Bai, Zixiang Zhao, Jiangshe Zhang, Yichen Wu, Lilun Deng, Yukun Cui, Baisong Jiang, Shuang Xu","doi":"10.1007/s11263-024-02256-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02256-8","url":null,"abstract":"<p>Image fusion aims to combine information from multiple source images into a single one with more comprehensive informational content. Deep learning-based image fusion algorithms face significant challenges, including the lack of a definitive ground truth and the corresponding distance measurement. Additionally, current manually defined loss functions limit the model’s flexibility and generalizability for various fusion tasks. To address these limitations, we propose <b>ReFusion</b>, a unified meta-learning based image fusion framework that dynamically optimizes the fusion loss for various tasks through source image reconstruction. Compared to existing methods, ReFusion employs a parameterized loss function, that allows the training framework to be dynamically adapted according to the specific fusion scenario and task. ReFusion consists of three key components: a fusion module, a source reconstruction module, and a loss proposal module. We employ a meta-learning strategy to train the loss proposal module using the reconstruction loss. This strategy forces the fused image to be more conducive to reconstruct source images, allowing the loss proposal module to generate a adaptive fusion loss that preserves the optimal information from the source images. The update of the fusion module relies on the learnable fusion loss proposed by the loss proposal module. The three modules update alternately, enhancing each other to optimize the fusion loss for different tasks and consistently achieve satisfactory results. Extensive experiments demonstrate that ReFusion is capable of adapting to various tasks, including infrared-visible, medical, multi-focus, and multi-exposure image fusion. The code is available at https://github.com/HaowenBai/ReFusion.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"12 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142758162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing weak supervised low-light image enhancement methods lack enough effectiveness and generalization in practical applications. We suppose this is because of the absence of explicit supervision and the inherent gap between real-world low-light domain and the training low-light domain. For example, low-light datasets are well-designed, but real-world night scenes are plagued with sophisticated interference such as noise, artifacts, and extreme lighting conditions. In this paper, we develop Diffusion-based domain calibration to realize more robust and effective weak supervised Low-Light Enhancement, called DiffLLE. Since the diffusion model performs impressive denoising capability and has been trained on massive clean images, we adopt it to bridge the gap between the real low-light domain and training degradation domain, while providing efficient priors of real-world content for weak supervised models. Specifically, we adopt a naive weak supervised enhancement algorithm to realize preliminary restoration and design two zero-shot plug-and-play modules based on diffusion model to improve generalization and effectiveness. The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve, which makes the enhancement model perform robustly even in sophisticated wild degradation. Due to the limited enhancement effect of the weak supervised model, we further develop the Fine-grained Target domain Distillation (FTD) module to find a more visual-friendly solution space. It exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrinks the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field, addressing the lack of strong explicit supervision for weak supervised methods. Benefiting from these, our approach even outperforms some supervised methods by using only a simple weak supervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE, especially in real-world dark scenes.
{"title":"DiffLLE: Diffusion-based Domain Calibration for Weak Supervised Low-light Image Enhancement","authors":"Shuzhou Yang, Xuanyu Zhang, Yinhuai Wang, Jiwen Yu, Yuhan Wang, Jian Zhang","doi":"10.1007/s11263-024-02292-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02292-4","url":null,"abstract":"<p>Existing weak supervised low-light image enhancement methods lack enough effectiveness and generalization in practical applications. We suppose this is because of the absence of explicit supervision and the inherent gap between real-world low-light domain and the training low-light domain. For example, low-light datasets are well-designed, but real-world night scenes are plagued with sophisticated interference such as noise, artifacts, and extreme lighting conditions. In this paper, we develop <b>Diff</b>usion-based domain calibration to realize more robust and effective weak supervised <b>L</b>ow-<b>L</b>ight <b>E</b>nhancement, called <b>DiffLLE</b>. Since the diffusion model performs impressive denoising capability and has been trained on massive clean images, we adopt it to bridge the gap between the real low-light domain and training degradation domain, while providing efficient priors of real-world content for weak supervised models. Specifically, we adopt a naive weak supervised enhancement algorithm to realize preliminary restoration and design two zero-shot plug-and-play modules based on diffusion model to improve generalization and effectiveness. The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve, which makes the enhancement model perform robustly even in sophisticated wild degradation. Due to the limited enhancement effect of the weak supervised model, we further develop the Fine-grained Target domain Distillation (FTD) module to find a more visual-friendly solution space. It exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrinks the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field, addressing the lack of strong explicit supervision for weak supervised methods. Benefiting from these, our approach even outperforms some supervised methods by using only a simple weak supervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE, especially in real-world dark scenes.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"1 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142753768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27DOI: 10.1007/s11263-024-02286-2
Yongsheng Pan, Yiwen Ye, Yanning Zhang, Yong Xia, Dinggang Shen
Stereoscopic observation is a common foundation of medical image analysis and is generally achieved by 3D medical imaging based on settled scanners, such as CT and MRI, that are not as convenient as X-ray machines in some flexible scenarios. However, X-ray images can only provide perspective 2D observation and lack view in the third dimension. If 3D information can be deduced from X-ray images, it would broaden the application of X-ray machines. Focus on the above objective, this paper dedicates to the generation of pseudo 3D CT scans from non-parallel 2D perspective X-ray (PXR) views and proposes the Draw Sketch and Draw Flesh (DSDF) framework to first roughly predict the tissue distribution (Sketch) from PXR views and then render the tissue details (Flesh) from the tissue distribution and PXR views. Different from previous studies that focus only on partial locations, e.g., chest or neck, this study theoretically investigates the feasibility of head-to-leg reconstruction, i.e., generally applicable to any body parts. Experiments on 559 whole-body samples from 4 cohorts suggest that our DSDF can reconstruct more reasonable pseudo CT images than state-of-the-art methods and achieve promising results in both visualization and various downstream tasks. The source code and well-trained models are available a https://github.com/YongshengPan/WholeBodyXraytoCT.
立体观察是医学图像分析的共同基础,一般通过CT、MRI等固定扫描仪的三维医学成像来实现,在一些灵活的场景下,这些扫描仪不如x光机方便。然而,x射线图像只能提供二维透视观察,缺乏三维视角。如果能从x射线图像中推断出三维信息,将会拓宽x射线机的应用范围。针对上述目标,本文致力于从非平行二维透视x射线(PXR)视图生成伪三维CT扫描,并提出了Draw Sketch and Draw Flesh (DSDF)框架,首先从PXR视图粗略预测组织分布(Sketch),然后从组织分布和PXR视图渲染组织细节(Flesh)。不同于以往的研究只关注部分部位,如胸部或颈部,本研究从理论上探讨了头腿重建的可行性,即普遍适用于身体的任何部位。来自4个队列的559个全身样本的实验表明,我们的DSDF可以重建比现有方法更合理的伪CT图像,并且在可视化和各种下游任务中都取得了令人满意的结果。源代码和训练有素的模型可从https://github.com/YongshengPan/WholeBodyXraytoCT获得。
{"title":"Draw Sketch, Draw Flesh: Whole-Body Computed Tomography from Any X-Ray Views","authors":"Yongsheng Pan, Yiwen Ye, Yanning Zhang, Yong Xia, Dinggang Shen","doi":"10.1007/s11263-024-02286-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02286-2","url":null,"abstract":"<p>Stereoscopic observation is a common foundation of medical image analysis and is generally achieved by 3D medical imaging based on settled scanners, such as CT and MRI, that are not as convenient as X-ray machines in some flexible scenarios. However, X-ray images can only provide perspective 2D observation and lack view in the third dimension. If 3D information can be deduced from X-ray images, it would broaden the application of X-ray machines. Focus on the above objective, this paper dedicates to the generation of pseudo 3D CT scans from non-parallel 2D perspective X-ray (PXR) views and proposes the <i>Draw Sketch and Draw Flesh</i> (DSDF) framework to first roughly predict the tissue distribution (Sketch) from PXR views and then render the tissue details (Flesh) from the tissue distribution and PXR views. Different from previous studies that focus only on partial locations, e.g., chest or neck, this study theoretically investigates the feasibility of head-to-leg reconstruction, i.e., generally applicable to any body parts. Experiments on 559 whole-body samples from 4 cohorts suggest that our DSDF can reconstruct more reasonable pseudo CT images than state-of-the-art methods and achieve promising results in both visualization and various downstream tasks. The source code and well-trained models are available a https://github.com/YongshengPan/WholeBodyXraytoCT.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"26 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142753766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-26DOI: 10.1007/s11263-024-02290-6
Hoyoung Choi, Seungwan Jin, Kyungsik Han
Vision transformers use [CLS] token to predict image classes. Their explainability visualization has been studied using relevant information from the [CLS] token or focusing on attention scores during self-attention. However, such visualization is challenging because of the dependence of the interpretability of a vision transformer on skip connections and attention operators, the instability of non-linearities in the learning process, and the limited reflection of self-attention scores on relevance. We argue that the output patch embeddings in a vision transformer preserve the image information of each patch location, which can facilitate the prediction of an image class. In this paper, we propose ICEv2 (ICEv2: ({{{underline{varvec{I}}}}})nterpretability, ({{{underline{varvec{C}}}}})omprehensiveness, and ({{{underline{varvec{E}}}}})xplainability in Vision Transformer), an explainability visualization method that addresses the limitations of ICE (i.e., high dependence of hyperparameters on performance and the inability to preserve the model’s properties) by minimizing the number of training encoder layers, redesigning the MLP layer, and optimizing hyperparameters along with various model size. Overall, ICEv2 shows higher efficiency, performance, robustness, and scalability than ICE. On the ImageNet-Segmentation dataset, ICEv2 outperformed all explainability visualization methods in all cases depending on the model size. On the Pascal VOC dataset, ICEv2 outperformed both self-supervised and supervised methods on Jaccard similarity. In the unsupervised single object discovery, where untrained classes are present in the images, ICEv2 effectively distinguished between foreground and background, showing performance comparable to the previous state-of-the-art. Lastly, ICEv2 can be trained with significantly lower training computational complexity.
{"title":"ICEv2: Interpretability, Comprehensiveness, and Explainability in Vision Transformer","authors":"Hoyoung Choi, Seungwan Jin, Kyungsik Han","doi":"10.1007/s11263-024-02290-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02290-6","url":null,"abstract":"<p>Vision transformers use [CLS] token to predict image classes. Their explainability visualization has been studied using relevant information from the [CLS] token or focusing on attention scores during self-attention. However, such visualization is challenging because of the dependence of the interpretability of a vision transformer on skip connections and attention operators, the instability of non-linearities in the learning process, and the limited reflection of self-attention scores on relevance. We argue that the output patch embeddings in a vision transformer preserve the image information of each patch location, which can facilitate the prediction of an image class. In this paper, we propose ICEv2 (ICEv2: <span>({{{underline{varvec{I}}}}})</span>nterpretability, <span>({{{underline{varvec{C}}}}})</span>omprehensiveness, and <span>({{{underline{varvec{E}}}}})</span>xplainability in Vision Transformer), an explainability visualization method that addresses the limitations of ICE (i.e., high dependence of hyperparameters on performance and the inability to preserve the model’s properties) by minimizing the number of training encoder layers, redesigning the MLP layer, and optimizing hyperparameters along with various model size. Overall, ICEv2 shows higher efficiency, performance, robustness, and scalability than ICE. On the ImageNet-Segmentation dataset, ICEv2 outperformed all explainability visualization methods in all cases depending on the model size. On the Pascal VOC dataset, ICEv2 outperformed both self-supervised and supervised methods on Jaccard similarity. In the unsupervised single object discovery, where untrained classes are present in the images, ICEv2 effectively distinguished between foreground and background, showing performance comparable to the previous state-of-the-art. Lastly, ICEv2 can be trained with significantly lower training computational complexity.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"67 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142718529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hard negative generation aims to generate informative negative samples that help to determine the decision boundaries and thus facilitate advancing deep metric learning. Current works select pair/triplet samples, learn their correlations, and fuse them to generate hard negatives. However, these works merely consider the local correlations of selected samples, ignoring global sample correlations that would provide more significant information to generate more informative negatives. In this work, we propose a globally correlation-aware hard negative generation (GCA-HNG) framework, which first learns sample correlations from a global perspective and exploits these correlations to guide generating hardness-adaptive and diverse negatives. Specifically, this approach begins by constructing a structured graph to model sample correlations, where each node represents a specific sample and each edge represents the correlations between corresponding samples. Then, we introduce an iterative graph message propagation to propagate the messages of node and edge through the whole graph and thus learn the sample correlations globally. Finally, with the guidance of the learned global correlations, we propose a channel-adaptive manner to combine an anchor and multiple negatives for HNG. Compared to current methods, GCA-HNG allows perceiving sample correlations with numerous negatives from a global and comprehensive perspective and generates the negatives with better hardness and diversity. Extensive experiment results demonstrate that the proposed GCA-HNG is superior to related methods on four image retrieval benchmark datasets.
{"title":"Globally Correlation-Aware Hard Negative Generation","authors":"Wenjie Peng, Hongxiang Huang, Tianshui Chen, Quhui Ke, Gang Dai, Shuangping Huang","doi":"10.1007/s11263-024-02288-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02288-0","url":null,"abstract":"<p>Hard negative generation aims to generate informative negative samples that help to determine the decision boundaries and thus facilitate advancing deep metric learning. Current works select pair/triplet samples, learn their correlations, and fuse them to generate hard negatives. However, these works merely consider the local correlations of selected samples, ignoring global sample correlations that would provide more significant information to generate more informative negatives. In this work, we propose a globally correlation-aware hard negative generation (GCA-HNG) framework, which first learns sample correlations from a global perspective and exploits these correlations to guide generating hardness-adaptive and diverse negatives. Specifically, this approach begins by constructing a structured graph to model sample correlations, where each node represents a specific sample and each edge represents the correlations between corresponding samples. Then, we introduce an iterative graph message propagation to propagate the messages of node and edge through the whole graph and thus learn the sample correlations globally. Finally, with the guidance of the learned global correlations, we propose a channel-adaptive manner to combine an anchor and multiple negatives for HNG. Compared to current methods, GCA-HNG allows perceiving sample correlations with numerous negatives from a global and comprehensive perspective and generates the negatives with better hardness and diversity. Extensive experiment results demonstrate that the proposed GCA-HNG is superior to related methods on four image retrieval benchmark datasets.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"80 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142697109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-25DOI: 10.1007/s11263-024-02293-3
Shuwei Shao, Zhongcai Pei, Weihai Chen, Peter C. Y. Chen, Zhengguo Li
Monocular depth estimation and completion are fundamental aspects of geometric computer vision, serving as essential techniques for various downstream applications. In recent developments, several methods have reformulated these two tasks as a classification-regression problem, deriving depth with a linear combination of predicted probabilistic distribution and bin centers. In this paper, we introduce an innovative concept termed iterative elastic bins (IEBins) for the classification-regression-based monocular depth estimation and completion. The IEBins involves the idea of iterative division of bins. In the initialization stage, a coarse and uniform discretization is applied to the entire depth range. Subsequent update stages then iteratively identify and uniformly discretize the target bin, by leveraging it as the new depth range for further refinement. To mitigate the risk of error accumulation during iterations, we propose a novel elastic target bin, replacing the original one. The width of this elastic bin is dynamically adapted according to the depth uncertainty. Furthermore, we develop dedicated frameworks to instantiate the IEBins. Extensive experiments on the KITTI, NYU-Depth-v2, SUN RGB-D, ScanNet and DIODE datasets indicate that our method outperforms prior state-of-the-art monocular depth estimation and completion competitors.
{"title":"IEBins: Iterative Elastic Bins for Monocular Depth Estimation and Completion","authors":"Shuwei Shao, Zhongcai Pei, Weihai Chen, Peter C. Y. Chen, Zhengguo Li","doi":"10.1007/s11263-024-02293-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02293-3","url":null,"abstract":"<p>Monocular depth estimation and completion are fundamental aspects of geometric computer vision, serving as essential techniques for various downstream applications. In recent developments, several methods have reformulated these two tasks as a <i>classification-regression</i> problem, deriving depth with a linear combination of predicted probabilistic distribution and bin centers. In this paper, we introduce an innovative concept termed <b>iterative elastic bins (IEBins)</b> for the classification-regression-based monocular depth estimation and completion. The IEBins involves the idea of iterative division of bins. In the initialization stage, a coarse and uniform discretization is applied to the entire depth range. Subsequent update stages then iteratively identify and uniformly discretize the target bin, by leveraging it as the new depth range for further refinement. To mitigate the risk of error accumulation during iterations, we propose a novel elastic target bin, replacing the original one. The width of this elastic bin is dynamically adapted according to the depth uncertainty. Furthermore, we develop dedicated frameworks to instantiate the IEBins. Extensive experiments on the KITTI, NYU-Depth-v2, SUN RGB-D, ScanNet and DIODE datasets indicate that our method outperforms prior state-of-the-art monocular depth estimation and completion competitors.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"43 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142712464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-23DOI: 10.1007/s11263-024-02284-4
Mang Ye, Shuoyi Chen, Chenyue Li, Wei-Shi Zheng, David Crandall, Bo Du
Object Re-identification (Re-ID) aims to identify specific objects across different times and scenes, which is a widely researched task in computer vision. For a prolonged period, this field has been predominantly driven by deep learning technology based on convolutional neural networks. In recent years, the emergence of Vision Transformers has spurred a growing number of studies delving deeper into Transformer-based Re-ID, continuously breaking performance records and witnessing significant progress in the Re-ID field. Offering a powerful, flexible, and unified solution, Transformers cater to a wide array of Re-ID tasks with unparalleled efficacy. This paper provides a comprehensive review and in-depth analysis of the Transformer-based Re-ID. In categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, we thoroughly elucidate the advantages demonstrated by the Transformer in addressing a multitude of challenges across these domains. Considering the trending unsupervised Re-ID, we propose a new Transformer baseline, UntransReID, achieving state-of-the-art performance on both single/cross modal tasks. For the under-explored animal Re-ID, we devise a standardized experimental benchmark and conduct extensive experiments to explore the applicability of Transformer for this task and facilitate future research. Finally, we discuss some important yet under-investigated open issues in the large foundation model era, we believe it will serve as a new handbook for researchers in this field. A periodically updated website will be available at https://github.com/mangye16/ReID-Survey.
{"title":"Transformer for Object Re-identification: A Survey","authors":"Mang Ye, Shuoyi Chen, Chenyue Li, Wei-Shi Zheng, David Crandall, Bo Du","doi":"10.1007/s11263-024-02284-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02284-4","url":null,"abstract":"<p>Object Re-identification (Re-ID) aims to identify specific objects across different times and scenes, which is a widely researched task in computer vision. For a prolonged period, this field has been predominantly driven by deep learning technology based on convolutional neural networks. In recent years, the emergence of Vision Transformers has spurred a growing number of studies delving deeper into Transformer-based Re-ID, continuously breaking performance records and witnessing significant progress in the Re-ID field. Offering a powerful, flexible, and unified solution, Transformers cater to a wide array of Re-ID tasks with unparalleled efficacy. This paper provides a comprehensive review and in-depth analysis of the Transformer-based Re-ID. In categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, we thoroughly elucidate the advantages demonstrated by the Transformer in addressing a multitude of challenges across these domains. Considering the trending unsupervised Re-ID, we propose a new Transformer baseline, UntransReID, achieving state-of-the-art performance on both single/cross modal tasks. For the under-explored animal Re-ID, we devise a standardized experimental benchmark and conduct extensive experiments to explore the applicability of Transformer for this task and facilitate future research. Finally, we discuss some important yet under-investigated open issues in the large foundation model era, we believe it will serve as a new handbook for researchers in this field. A periodically updated website will be available at https://github.com/mangye16/ReID-Survey.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"15 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142690532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}