Pub Date : 2024-08-08DOI: 10.1016/j.cviu.2024.104101
Methods based on class activation maps (CAM) provide a simple mechanism to interpret predictions of convolutional neural networks by using linear combinations of feature maps as saliency maps. By contrast, masking-based methods optimize a saliency map directly in the image space or learn it by training another network on additional data. In this work we introduce Opti-CAM, combining ideas from CAM-based and masking-based approaches. Our saliency map is a linear combination of feature maps, where weights are optimized per image such that the logit of the masked image for a given class is maximized. We also fix a fundamental flaw in two of the most common evaluation metrics of attribution methods. On several datasets, Opti-CAM largely outperforms other CAM-based approaches according to the most relevant classification metrics. We provide empirical evidence supporting that localization and classifier interpretability are not necessarily aligned.
{"title":"Opti-CAM: Optimizing saliency maps for interpretability","authors":"","doi":"10.1016/j.cviu.2024.104101","DOIUrl":"10.1016/j.cviu.2024.104101","url":null,"abstract":"<div><p>Methods based on <em>class activation maps</em> (CAM) provide a simple mechanism to interpret predictions of convolutional neural networks by using linear combinations of feature maps as saliency maps. By contrast, masking-based methods optimize a saliency map directly in the image space or learn it by training another network on additional data. In this work we introduce Opti-CAM, combining ideas from CAM-based and masking-based approaches. Our saliency map is a linear combination of feature maps, where weights are optimized per image such that the logit of the masked image for a given class is maximized. We also fix a fundamental flaw in two of the most common evaluation metrics of attribution methods. On several datasets, Opti-CAM largely outperforms other CAM-based approaches according to the most relevant classification metrics. We provide empirical evidence supporting that localization and classifier interpretability are not necessarily aligned.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001826/pdfft?md5=bb10084a23cbb5c9ee9c37c96c7ca368&pid=1-s2.0-S1077314224001826-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1016/j.cviu.2024.104107
Pedestrian trajectory prediction plays a key role in understanding human behavior and guiding autonomous driving. It is a difficult task due to the multi-modal nature of human motion. Recent advances have mainly focused on modeling this multi-modality, either by using implicit generative models or explicit pre-defined anchors. However, the former is limited by the sampling problem, while the latter introduces strong prior to the data, both of which require extra tricks to achieve better performance. To address these issues, we propose a simple yet effective framework called Efficient Multi-modal Predictors (EMP), which casts off the generative paradigm and predicts multi-modal trajectories in an end-to-end style. It is achieved by combining a set of parallel predictors with a model error based sparse selector. During training, the entire set of parallel multi-modal predictors will converge into disjoint subsets, with each subset specializing in one mode, thus obtaining multi-modal prediction with no human prior and reducing the problems of above two genres. Experiments on SDD/ETH-UCY/NBA datasets show that EMP achieves state-of-the-art performance with the highest inference speed. Additionally, we show that by replacing multi-modal modules with EMP, state-of-the-art works outperform their baselines, which further validate the versatility of EMP. Moreover, we formally prove that EMP can alleviate the problem of modal collapse and has a low test error bound.
{"title":"End-to-end pedestrian trajectory prediction via Efficient Multi-modal Predictors","authors":"","doi":"10.1016/j.cviu.2024.104107","DOIUrl":"10.1016/j.cviu.2024.104107","url":null,"abstract":"<div><p>Pedestrian trajectory prediction plays a key role in understanding human behavior and guiding autonomous driving. It is a difficult task due to the multi-modal nature of human motion. Recent advances have mainly focused on modeling this multi-modality, either by using implicit generative models or explicit pre-defined anchors. However, the former is limited by the sampling problem, while the latter introduces strong prior to the data, both of which require extra tricks to achieve better performance. To address these issues, we propose a simple yet effective framework called Efficient Multi-modal Predictors (EMP), which casts off the generative paradigm and predicts multi-modal trajectories in an end-to-end style. It is achieved by combining a set of parallel predictors with a model error based sparse selector. During training, the entire set of parallel multi-modal predictors will converge into disjoint subsets, with each subset specializing in one mode, thus obtaining multi-modal prediction with no human prior and reducing the problems of above two genres. Experiments on SDD/ETH-UCY/NBA datasets show that EMP achieves state-of-the-art performance with the highest inference speed. Additionally, we show that by replacing multi-modal modules with EMP, state-of-the-art works outperform their baselines, which further validate the versatility of EMP. Moreover, we formally prove that EMP can alleviate the problem of modal collapse and has a low test error bound.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1016/j.cviu.2024.104104
Super-resolution image reconstruction techniques have advanced quickly, leading to the generation of a sizable number of super-resolution images using different super-resolution techniques. Nevertheless, accurately assessing the quality of super-resolution images remains a formidable challenge. This paper introduces a novel Multi-Frequency Cascade Transformers (MFCT) for evaluating super-resolution image quality (SR-IQA). In the first step, we develop a unique Frequency-Divided Module (FDM) to transform the super-resolution images into three different frequency bands. Subsequently, the Cascade Transformer Blocks (CAF) incorporating hierarchical self-attention mechanisms are employed to capture cross-window features for quality perception. Ultimately, the image quality scores from different frequency bands are fused to derive the overall image quality score. The experimental results show that, on the chosen SR-IQA databases, the proposed MFCT-based SR-IQA method can consistently outperforms all the compared Image Quality Assessment (IQA) models. Furthermore, a collection of thorough ablation studies demonstrates that, when compared to other earlier rivals, the newly proposed approach exhibits impressive generalization ability. The code will be available at https://github.com/kbzhang0505/MFCT.
{"title":"MFCT: Multi-Frequency Cascade Transformers for no-reference SR-IQA","authors":"","doi":"10.1016/j.cviu.2024.104104","DOIUrl":"10.1016/j.cviu.2024.104104","url":null,"abstract":"<div><p>Super-resolution image reconstruction techniques have advanced quickly, leading to the generation of a sizable number of super-resolution images using different super-resolution techniques. Nevertheless, accurately assessing the quality of super-resolution images remains a formidable challenge. This paper introduces a novel Multi-Frequency Cascade Transformers (MFCT) for evaluating super-resolution image quality (SR-IQA). In the first step, we develop a unique Frequency-Divided Module (FDM) to transform the super-resolution images into three different frequency bands. Subsequently, the Cascade Transformer Blocks (CAF) incorporating hierarchical self-attention mechanisms are employed to capture cross-window features for quality perception. Ultimately, the image quality scores from different frequency bands are fused to derive the overall image quality score. The experimental results show that, on the chosen SR-IQA databases, the proposed MFCT-based SR-IQA method can consistently outperforms all the compared Image Quality Assessment (IQA) models. Furthermore, a collection of thorough ablation studies demonstrates that, when compared to other earlier rivals, the newly proposed approach exhibits impressive generalization ability. The code will be available at <span><span>https://github.com/kbzhang0505/MFCT</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141990639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-06DOI: 10.1016/j.cviu.2024.104106
Detection Transformer (DETR) and its variants have emerged a new paradigm to object detection, but their high computational cost hinders practical applications. By investigating their essential components, we found that the transformer-based head usually occupies a significant amount of computation. Through further comparing heavy and light transformer heads, we observed that both heads produced satisfactory results for easy images while showing a noticeable difference for hard images. Inspired by these findings, we propose a dynamic head switching (DHS) strategy to dynamically select the proper head for each image at inference for a better balance of efficiency and accuracy. Specifically, our DETR model incorporates multiple heads with different computational complexity and a lightweight module which selects proper heads for given images. This module is optimized to maximize detection accuracy while adhering to the overall computational budget limitations. To minimize the potential accuracy drop when executing the lighter heads, we propose online head distillation (OHD) to improve the accuracy of the lighter heads with the help of the heavier head. Extensive experiments on the MS COCO dataset validated the effectiveness of the proposed method, which demonstrated a better accuracy–efficiency trade-off compared to the baseline using static heads.
检测变压器(DETR)及其变体已成为物体检测的一种新模式,但其高昂的计算成本阻碍了其实际应用。通过研究它们的基本组件,我们发现基于变压器的探测头通常会占用大量计算资源。通过进一步比较重型和轻型变压器磁头,我们发现这两种磁头在简单图像中都能产生令人满意的结果,而在困难图像中则表现出明显的差异。受这些发现的启发,我们提出了一种动态磁头切换(DHS)策略,在推理时为每幅图像动态选择合适的磁头,以更好地平衡效率和准确性。具体来说,我们的 DETR 模型包含具有不同计算复杂度的多个磁头和一个轻量级模块,该模块可为给定图像选择合适的磁头。该模块经过优化,可在遵守总体计算预算限制的同时最大限度地提高检测精度。为了最大限度地减少执行轻型侦测头时可能出现的精度下降,我们提出了在线侦测头蒸馏(OHD)技术,以便在重型侦测头的帮助下提高轻型侦测头的精度。在 MS COCO 数据集上进行的大量实验验证了所提方法的有效性,与使用静态磁头的基线方法相比,该方法在准确性和效率之间实现了更好的权衡。
{"title":"DHS-DETR: Efficient DETRs with dynamic head switching","authors":"","doi":"10.1016/j.cviu.2024.104106","DOIUrl":"10.1016/j.cviu.2024.104106","url":null,"abstract":"<div><p>Detection Transformer (DETR) and its variants have emerged a new paradigm to object detection, but their high computational cost hinders practical applications. By investigating their essential components, we found that the transformer-based head usually occupies a significant amount of computation. Through further comparing heavy and light transformer heads, we observed that both heads produced satisfactory results for easy images while showing a noticeable difference for hard images. Inspired by these findings, we propose a dynamic head switching (DHS) strategy to dynamically select the proper head for each image at inference for a better balance of efficiency and accuracy. Specifically, our DETR model incorporates multiple heads with different computational complexity and a lightweight module which selects proper heads for given images. This module is optimized to maximize detection accuracy while adhering to the overall computational budget limitations. To minimize the potential accuracy drop when executing the lighter heads, we propose online head distillation (OHD) to improve the accuracy of the lighter heads with the help of the heavier head. Extensive experiments on the MS COCO dataset validated the effectiveness of the proposed method, which demonstrated a better accuracy–efficiency trade-off compared to the baseline using static heads.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-06DOI: 10.1016/j.cviu.2024.104098
Multi-view 3D shape classification, which identifies a 3D shape based on its 2D views rendered from different viewpoints, has emerged as a promising method of shape understanding. A key building block in these methods is cross-view feature aggregation. However, existing methods dominantly follow the “extract-then-aggregate” pipeline for view-level global feature aggregation, leaving cross-view pixel-level feature interaction under-explored. To tackle this issue, we develop a “fuse-while-extract” pipeline, with a novel View-aligned Pixel-level Fusion (VPF) module to fuse cross-view pixel-level features originating from the same 3D part. We first reconstruct the 3D coordinate of each feature via the rasterization results, then match and fuse the features via spatial neighbor searching. Incorporating the proposed VPF module with ResNet18 backbone, we build a novel view-aligned multi-view network, which conducts feature extraction and cross-view fusion alternatively. Extensive experiments have demonstrated the effectiveness of the VPF module as well as the excellent performance of the proposed network.
{"title":"View-aligned pixel-level feature aggregation for 3D shape classification","authors":"","doi":"10.1016/j.cviu.2024.104098","DOIUrl":"10.1016/j.cviu.2024.104098","url":null,"abstract":"<div><p>Multi-view 3D shape classification, which identifies a 3D shape based on its 2D views rendered from different viewpoints, has emerged as a promising method of shape understanding. A key building block in these methods is cross-view feature aggregation. However, existing methods dominantly follow the “extract-then-aggregate” pipeline for view-level global feature aggregation, leaving cross-view pixel-level feature interaction under-explored. To tackle this issue, we develop a “fuse-while-extract” pipeline, with a novel View-aligned Pixel-level Fusion (VPF) module to fuse cross-view pixel-level features originating from the same 3D part. We first reconstruct the 3D coordinate of each feature via the rasterization results, then match and fuse the features via spatial neighbor searching. Incorporating the proposed VPF module with ResNet18 backbone, we build a novel view-aligned multi-view network, which conducts feature extraction and cross-view fusion alternatively. Extensive experiments have demonstrated the effectiveness of the VPF module as well as the excellent performance of the proposed network.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1016/j.cviu.2024.104096
Hyperspectral (HS) image always suffers from the deficiency of low spatial resolution, compared with conventional optical image types, which has limited its further applications in remote sensing areas. Therefore, HS image super-resolution (SR) techniques are broadly employed in order to observe finer spatial structures while preserving the spectra of ground covers. In this paper, a novel multi-dimensional attention-aided transposed convolutional long-short term memory (LSTM) network is proposed for single HS image super-resolution task. The proposed network employs the convolutional bi-directional LSTM for the purpose of local and non-local spatial–spectral feature explorations, and transposed convolution for the purpose of image amplification and reconstruction. Moreover, a multi-dimensional attention module is proposed, aiming to capture the salient features on spectral, channel, and spatial dimensions, simultaneously, to further improve the learning abilities of network. Experiments on four commonly-used HS images demonstrate the effectiveness of this approach, compared with several state-of-the-art deep learning-based SR methods.
与传统光学图像相比,高光谱(HS)图像始终存在空间分辨率低的缺陷,这限制了其在遥感领域的进一步应用。因此,高光谱图像超分辨率(SR)技术被广泛应用,以便在保留地面覆盖物光谱的同时,观察到更精细的空间结构。本文针对单 HS 图像超分辨率任务提出了一种新型多维注意力辅助转置卷积长短期记忆(LSTM)网络。该网络利用卷积双向 LSTM 进行局部和非局部空间光谱特征探索,并利用转置卷积进行图像放大和重建。此外,还提出了一个多维注意力模块,旨在同时捕捉光谱、信道和空间维度上的突出特征,以进一步提高网络的学习能力。与几种最先进的基于深度学习的 SR 方法相比,四种常用 HS 图像的实验证明了这种方法的有效性。
{"title":"Multi-dimensional attention-aided transposed ConvBiLSTM network for hyperspectral image super-resolution","authors":"","doi":"10.1016/j.cviu.2024.104096","DOIUrl":"10.1016/j.cviu.2024.104096","url":null,"abstract":"<div><p>Hyperspectral (HS) image always suffers from the deficiency of low spatial resolution, compared with conventional optical image types, which has limited its further applications in remote sensing areas. Therefore, HS image super-resolution (SR) techniques are broadly employed in order to observe finer spatial structures while preserving the spectra of ground covers. In this paper, a novel multi-dimensional attention-aided transposed convolutional long-short term memory (LSTM) network is proposed for single HS image super-resolution task. The proposed network employs the convolutional bi-directional LSTM for the purpose of local and non-local spatial–spectral feature explorations, and transposed convolution for the purpose of image amplification and reconstruction. Moreover, a multi-dimensional attention module is proposed, aiming to capture the salient features on spectral, channel, and spatial dimensions, simultaneously, to further improve the learning abilities of network. Experiments on four commonly-used HS images demonstrate the effectiveness of this approach, compared with several state-of-the-art deep learning-based SR methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141962989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1016/j.cviu.2024.104103
The prominence of high-quality video services has become so substantial that by 2030, it is estimated that approximately 80% of internet traffic will consist of videos. On the contrary, video denoising remains a relatively unexplored and intricate field, presenting more substantial challenges compared to image denoising. Many published deep learning video denoising algorithms typically rely on simple, efficient single encoder–decoder networks, but they have inherent limitations in preserving intricate image details and effectively managing noise information propagation for noise residue modelling. In response to these challenges, the proposed work introduces an innovative approach; in terms of utilization of cascaded UNets for progressive noise residual prediction in video denoising. This multi-stage encoder–decoder architecture is meticulously designed to accurately predict noise residual maps, thereby preserving the locally fine details within video content as represented by SSIM. The proposed network has undergone extensive end-to-end training from scratch without explicit motion compensation to reduce complexity. In terms of the more rigorous SSIM metric, the proposed network outperformed all video denoising methods while maintaining a comparable PSNR.
{"title":"Cascaded UNet for progressive noise residual prediction for structure-preserving video denoising","authors":"","doi":"10.1016/j.cviu.2024.104103","DOIUrl":"10.1016/j.cviu.2024.104103","url":null,"abstract":"<div><p>The prominence of high-quality video services has become so substantial that by 2030, it is estimated that approximately 80% of internet traffic will consist of videos. On the contrary, video denoising remains a relatively unexplored and intricate field, presenting more substantial challenges compared to image denoising. Many published deep learning video denoising algorithms typically rely on simple, efficient single encoder–decoder networks, but they have inherent limitations in preserving intricate image details and effectively managing noise information propagation for noise residue modelling. In response to these challenges, the proposed work introduces an innovative approach; in terms of utilization of cascaded UNets for progressive noise residual prediction in video denoising. This multi-stage encoder–decoder architecture is meticulously designed to accurately predict noise residual maps, thereby preserving the locally fine details within video content as represented by SSIM. The proposed network has undergone extensive end-to-end training from scratch without explicit motion compensation to reduce complexity. In terms of the more rigorous SSIM metric, the proposed network outperformed all video denoising methods while maintaining a comparable PSNR.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-02DOI: 10.1016/j.cviu.2024.104105
Eye gaze provides valuable cues about human intent, making gaze estimation a hot topic. Extracting multi-scale information has recently proven effective for gaze estimation in complex scenarios. However, existing methods for estimating gaze based on multi-scale features tend to focus only on information from single-level feature maps. Furthermore, information across different scales may also lack relevance. To address these issues, we propose a novel joint pyramidal perceptual attention and hierarchical consistency constraint (PaCo) for gaze estimation. The proposed PaCo consists of two main components: pyramidal perceptual attention module (PPAM) and hierarchical consistency constraint (HCC). Specifically, PPAM first extracts multi-scale spatial features using a pyramid structure, and then aggregates information from coarse granularity to fine granularity. In this way, PPAM enables the model to simultaneously focus on both the eye region and facial region at multiple scales. Then, HCC makes constrains consistency on low-level and high-level features, aiming to enhance the gaze semantic consistency between different feature levels. With the combination of PPAM and HCC, PaCo can learn more discriminative features in complex situations. Extensive experimental results show that PaCo achieves significant performance improvements on challenging datasets such as Gaze360, MPIIFaceGaze, and RT-GENE,reducing errors to 10.27, 3.23, 6.46, respectively.
{"title":"Joint pyramidal perceptual attention and hierarchical consistency constraint for gaze estimation","authors":"","doi":"10.1016/j.cviu.2024.104105","DOIUrl":"10.1016/j.cviu.2024.104105","url":null,"abstract":"<div><p>Eye gaze provides valuable cues about human intent, making gaze estimation a hot topic. Extracting multi-scale information has recently proven effective for gaze estimation in complex scenarios. However, existing methods for estimating gaze based on multi-scale features tend to focus only on information from single-level feature maps. Furthermore, information across different scales may also lack relevance. To address these issues, we propose a novel joint pyramidal perceptual attention and hierarchical consistency constraint (PaCo) for gaze estimation. The proposed PaCo consists of two main components: pyramidal perceptual attention module (PPAM) and hierarchical consistency constraint (HCC). Specifically, PPAM first extracts multi-scale spatial features using a pyramid structure, and then aggregates information from coarse granularity to fine granularity. In this way, PPAM enables the model to simultaneously focus on both the eye region and facial region at multiple scales. Then, HCC makes constrains consistency on low-level and high-level features, aiming to enhance the gaze semantic consistency between different feature levels. With the combination of PPAM and HCC, PaCo can learn more discriminative features in complex situations. Extensive experimental results show that PaCo achieves significant performance improvements on challenging datasets such as Gaze360, MPIIFaceGaze, and RT-GENE,reducing errors to 10.27<span><math><mo>°</mo></math></span>, 3.23<span><math><mo>°</mo></math></span>, 6.46<span><math><mo>°</mo></math></span>, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1016/j.cviu.2024.104097
While convolutional neural networks (CNN) have achieved remarkable performance in single image deraining tasks, it is still a very challenging task due to CNN’s limited receptive field and the unreality of the output image. In this paper, UC-former, an effective and efficient U-shaped architecture based on transformer for image deraining was presented. In UC-former, there are two core designs to avoid heavy self-attention computation and inefficient communications across encoder and decoder. First, we propose a novel channel across Transformer block, which computes self-attention between channels. It significantly reduces the computational complexity of high-resolution rain maps while capturing global context. Second, we propose a multi-scale feature fusion module between the encoder and decoder to combine low-level local features and high-level non-local features. In addition, we employ depth-wise convolution and H-Swish non-linear activation function in Transformer Blocks to enhance rain removal authenticity. Extensive experiments indicate that our method outperforms the state-of-the-art deraining approaches on synthetic and real-world rainy datasets.
虽然卷积神经网络(CNN)在单幅图像派生任务中取得了不俗的表现,但由于 CNN 的感受野有限以及输出图像的不真实性,这仍然是一项极具挑战性的任务。本文提出了一种基于变压器的高效 U 型架构--UC-former,用于图像推导。UC-former 有两个核心设计,以避免繁重的自注意计算和编码器与解码器之间的低效通信。首先,我们提出了一种新颖的跨变换器信道块,它可以计算信道间的自注意。它大大降低了高分辨率雨图的计算复杂度,同时还能捕捉全局背景。其次,我们在编码器和解码器之间提出了一个多尺度特征融合模块,以结合低级局部特征和高级非局部特征。此外,我们还在变换器块中采用了深度卷积和 H-Swish 非线性激活函数,以增强雨水去除的真实性。广泛的实验表明,我们的方法在合成和真实世界雨天数据集上的表现优于最先进的去污方法。
{"title":"UC-former: A multi-scale image deraining network using enhanced transformer","authors":"","doi":"10.1016/j.cviu.2024.104097","DOIUrl":"10.1016/j.cviu.2024.104097","url":null,"abstract":"<div><p>While convolutional neural networks (CNN) have achieved remarkable performance in single image deraining tasks, it is still a very challenging task due to CNN’s limited receptive field and the unreality of the output image. In this paper, UC-former, an effective and efficient U-shaped architecture based on transformer for image deraining was presented. In UC-former, there are two core designs to avoid heavy self-attention computation and inefficient communications across encoder and decoder. First, we propose a novel channel across Transformer block, which computes self-attention between channels. It significantly reduces the computational complexity of high-resolution rain maps while capturing global context. Second, we propose a multi-scale feature fusion module between the encoder and decoder to combine low-level local features and high-level non-local features. In addition, we employ depth-wise convolution and H-Swish non-linear activation function in Transformer Blocks to enhance rain removal authenticity. Extensive experiments indicate that our method outperforms the state-of-the-art deraining approaches on synthetic and real-world rainy datasets.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1016/j.cviu.2024.104099
The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the RGB-Thermal Cross Attention Network (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data can be found at https://github.com/logic112358/RT-CAN.
{"title":"Invisible gas detection: An RGB-thermal cross attention network and a new benchmark","authors":"","doi":"10.1016/j.cviu.2024.104099","DOIUrl":"10.1016/j.cviu.2024.104099","url":null,"abstract":"<div><p>The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the <strong>R</strong>GB-<strong>T</strong>hermal <strong>C</strong>ross <strong>A</strong>ttention <strong>N</strong>etwork (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data can be found at <span><span>https://github.com/logic112358/RT-CAN</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}