首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Opti-CAM: Optimizing saliency maps for interpretability Opti-CAM:优化突出图,提高可解释性
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-08 DOI: 10.1016/j.cviu.2024.104101

Methods based on class activation maps (CAM) provide a simple mechanism to interpret predictions of convolutional neural networks by using linear combinations of feature maps as saliency maps. By contrast, masking-based methods optimize a saliency map directly in the image space or learn it by training another network on additional data. In this work we introduce Opti-CAM, combining ideas from CAM-based and masking-based approaches. Our saliency map is a linear combination of feature maps, where weights are optimized per image such that the logit of the masked image for a given class is maximized. We also fix a fundamental flaw in two of the most common evaluation metrics of attribution methods. On several datasets, Opti-CAM largely outperforms other CAM-based approaches according to the most relevant classification metrics. We provide empirical evidence supporting that localization and classifier interpretability are not necessarily aligned.

基于类激活图(CAM)的方法提供了一种简单的机制,通过使用特征图的线性组合作为显著性图来解释卷积神经网络的预测。相比之下,基于掩码的方法直接在图像空间中优化显著性图,或通过在额外数据上训练另一个网络来学习显著性图。在这项工作中,我们介绍了 Opti-CAM,它结合了基于 CAM 和基于掩蔽的方法的理念。我们的显著性图谱是特征图谱的线性组合,每幅图像的权重都经过优化,从而使给定类别的掩蔽图像的对数最大化。我们还修正了归因方法最常用的两个评估指标中的一个基本缺陷。在多个数据集上,根据最相关的分类指标,Opti-CAM 在很大程度上优于其他基于 CAM 的方法。我们提供的经验证据证明,定位和分类器的可解释性并不一定是一致的。
{"title":"Opti-CAM: Optimizing saliency maps for interpretability","authors":"","doi":"10.1016/j.cviu.2024.104101","DOIUrl":"10.1016/j.cviu.2024.104101","url":null,"abstract":"<div><p>Methods based on <em>class activation maps</em> (CAM) provide a simple mechanism to interpret predictions of convolutional neural networks by using linear combinations of feature maps as saliency maps. By contrast, masking-based methods optimize a saliency map directly in the image space or learn it by training another network on additional data. In this work we introduce Opti-CAM, combining ideas from CAM-based and masking-based approaches. Our saliency map is a linear combination of feature maps, where weights are optimized per image such that the logit of the masked image for a given class is maximized. We also fix a fundamental flaw in two of the most common evaluation metrics of attribution methods. On several datasets, Opti-CAM largely outperforms other CAM-based approaches according to the most relevant classification metrics. We provide empirical evidence supporting that localization and classifier interpretability are not necessarily aligned.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001826/pdfft?md5=bb10084a23cbb5c9ee9c37c96c7ca368&pid=1-s2.0-S1077314224001826-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-end pedestrian trajectory prediction via Efficient Multi-modal Predictors 通过高效多模态预测器进行端到端行人轨迹预测
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-08 DOI: 10.1016/j.cviu.2024.104107

Pedestrian trajectory prediction plays a key role in understanding human behavior and guiding autonomous driving. It is a difficult task due to the multi-modal nature of human motion. Recent advances have mainly focused on modeling this multi-modality, either by using implicit generative models or explicit pre-defined anchors. However, the former is limited by the sampling problem, while the latter introduces strong prior to the data, both of which require extra tricks to achieve better performance. To address these issues, we propose a simple yet effective framework called Efficient Multi-modal Predictors (EMP), which casts off the generative paradigm and predicts multi-modal trajectories in an end-to-end style. It is achieved by combining a set of parallel predictors with a model error based sparse selector. During training, the entire set of parallel multi-modal predictors will converge into disjoint subsets, with each subset specializing in one mode, thus obtaining multi-modal prediction with no human prior and reducing the problems of above two genres. Experiments on SDD/ETH-UCY/NBA datasets show that EMP achieves state-of-the-art performance with the highest inference speed. Additionally, we show that by replacing multi-modal modules with EMP, state-of-the-art works outperform their baselines, which further validate the versatility of EMP. Moreover, we formally prove that EMP can alleviate the problem of modal collapse and has a low test error bound.

行人轨迹预测在理解人类行为和指导自动驾驶方面发挥着关键作用。由于人类运动的多模态特性,这是一项艰巨的任务。最近的进展主要集中在通过使用隐式生成模型或显式预定义锚对这种多模态进行建模。然而,前者受到采样问题的限制,后者则在数据中引入了较强的先验性,这两种方法都需要额外的技巧才能取得更好的性能。为了解决这些问题,我们提出了一个简单而有效的框架,称为高效多模态预测器(EMP),它抛弃了生成式模式,以端到端的方式预测多模态轨迹。它通过将一组并行预测器与基于模型误差的稀疏选择器相结合来实现。在训练过程中,整套并行多模式预测器将收敛为不相交的子集,每个子集专攻一种模式,从而在没有人为先验的情况下获得多模式预测,并减少上述两种流派的问题。在 SDD/ETH-UCY/NBA 数据集上的实验表明,EMP 实现了最先进的性能和最高的推理速度。此外,我们还表明,用 EMP 替代多模态模块后,最先进作品的性能超过了它们的基线,这进一步验证了 EMP 的多功能性。此外,我们还正式证明了 EMP 可以缓解模态崩溃问题,并具有较低的测试误差约束。
{"title":"End-to-end pedestrian trajectory prediction via Efficient Multi-modal Predictors","authors":"","doi":"10.1016/j.cviu.2024.104107","DOIUrl":"10.1016/j.cviu.2024.104107","url":null,"abstract":"<div><p>Pedestrian trajectory prediction plays a key role in understanding human behavior and guiding autonomous driving. It is a difficult task due to the multi-modal nature of human motion. Recent advances have mainly focused on modeling this multi-modality, either by using implicit generative models or explicit pre-defined anchors. However, the former is limited by the sampling problem, while the latter introduces strong prior to the data, both of which require extra tricks to achieve better performance. To address these issues, we propose a simple yet effective framework called Efficient Multi-modal Predictors (EMP), which casts off the generative paradigm and predicts multi-modal trajectories in an end-to-end style. It is achieved by combining a set of parallel predictors with a model error based sparse selector. During training, the entire set of parallel multi-modal predictors will converge into disjoint subsets, with each subset specializing in one mode, thus obtaining multi-modal prediction with no human prior and reducing the problems of above two genres. Experiments on SDD/ETH-UCY/NBA datasets show that EMP achieves state-of-the-art performance with the highest inference speed. Additionally, we show that by replacing multi-modal modules with EMP, state-of-the-art works outperform their baselines, which further validate the versatility of EMP. Moreover, we formally prove that EMP can alleviate the problem of modal collapse and has a low test error bound.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MFCT: Multi-Frequency Cascade Transformers for no-reference SR-IQA MFCT: 用于无参考 SR-IQA 的多频级联变压器
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-08 DOI: 10.1016/j.cviu.2024.104104

Super-resolution image reconstruction techniques have advanced quickly, leading to the generation of a sizable number of super-resolution images using different super-resolution techniques. Nevertheless, accurately assessing the quality of super-resolution images remains a formidable challenge. This paper introduces a novel Multi-Frequency Cascade Transformers (MFCT) for evaluating super-resolution image quality (SR-IQA). In the first step, we develop a unique Frequency-Divided Module (FDM) to transform the super-resolution images into three different frequency bands. Subsequently, the Cascade Transformer Blocks (CAF) incorporating hierarchical self-attention mechanisms are employed to capture cross-window features for quality perception. Ultimately, the image quality scores from different frequency bands are fused to derive the overall image quality score. The experimental results show that, on the chosen SR-IQA databases, the proposed MFCT-based SR-IQA method can consistently outperforms all the compared Image Quality Assessment (IQA) models. Furthermore, a collection of thorough ablation studies demonstrates that, when compared to other earlier rivals, the newly proposed approach exhibits impressive generalization ability. The code will be available at https://github.com/kbzhang0505/MFCT.

超分辨率图像重建技术发展迅速,利用不同的超分辨率技术生成了大量超分辨率图像。然而,准确评估超分辨率图像的质量仍然是一项艰巨的挑战。本文介绍了一种用于评估超分辨率图像质量(SR-IQA)的新型多频级联变换器(MFCT)。首先,我们开发了一种独特的分频模块(FDM),将超分辨率图像转换成三个不同的频段。随后,我们采用包含分层自我注意机制的级联变换器块(CAF)来捕捉跨窗口特征,以实现质量感知。最后,融合不同频段的图像质量得分,得出整体图像质量得分。实验结果表明,在所选的 SR-IQA 数据库中,所提出的基于 MFCT 的 SR-IQA 方法始终优于所有比较过的图像质量评估 (IQA) 模型。此外,一系列彻底的消融研究表明,与其他早期竞争对手相比,新提出的方法表现出令人印象深刻的概括能力。代码可在 https://github.com/kbzhang0505/MFCT 上获取。
{"title":"MFCT: Multi-Frequency Cascade Transformers for no-reference SR-IQA","authors":"","doi":"10.1016/j.cviu.2024.104104","DOIUrl":"10.1016/j.cviu.2024.104104","url":null,"abstract":"<div><p>Super-resolution image reconstruction techniques have advanced quickly, leading to the generation of a sizable number of super-resolution images using different super-resolution techniques. Nevertheless, accurately assessing the quality of super-resolution images remains a formidable challenge. This paper introduces a novel Multi-Frequency Cascade Transformers (MFCT) for evaluating super-resolution image quality (SR-IQA). In the first step, we develop a unique Frequency-Divided Module (FDM) to transform the super-resolution images into three different frequency bands. Subsequently, the Cascade Transformer Blocks (CAF) incorporating hierarchical self-attention mechanisms are employed to capture cross-window features for quality perception. Ultimately, the image quality scores from different frequency bands are fused to derive the overall image quality score. The experimental results show that, on the chosen SR-IQA databases, the proposed MFCT-based SR-IQA method can consistently outperforms all the compared Image Quality Assessment (IQA) models. Furthermore, a collection of thorough ablation studies demonstrates that, when compared to other earlier rivals, the newly proposed approach exhibits impressive generalization ability. The code will be available at <span><span>https://github.com/kbzhang0505/MFCT</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141990639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DHS-DETR: Efficient DETRs with dynamic head switching DHS-DETR:具有动态磁头切换功能的高效 DETR
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-06 DOI: 10.1016/j.cviu.2024.104106

Detection Transformer (DETR) and its variants have emerged a new paradigm to object detection, but their high computational cost hinders practical applications. By investigating their essential components, we found that the transformer-based head usually occupies a significant amount of computation. Through further comparing heavy and light transformer heads, we observed that both heads produced satisfactory results for easy images while showing a noticeable difference for hard images. Inspired by these findings, we propose a dynamic head switching (DHS) strategy to dynamically select the proper head for each image at inference for a better balance of efficiency and accuracy. Specifically, our DETR model incorporates multiple heads with different computational complexity and a lightweight module which selects proper heads for given images. This module is optimized to maximize detection accuracy while adhering to the overall computational budget limitations. To minimize the potential accuracy drop when executing the lighter heads, we propose online head distillation (OHD) to improve the accuracy of the lighter heads with the help of the heavier head. Extensive experiments on the MS COCO dataset validated the effectiveness of the proposed method, which demonstrated a better accuracy–efficiency trade-off compared to the baseline using static heads.

检测变压器(DETR)及其变体已成为物体检测的一种新模式,但其高昂的计算成本阻碍了其实际应用。通过研究它们的基本组件,我们发现基于变压器的探测头通常会占用大量计算资源。通过进一步比较重型和轻型变压器磁头,我们发现这两种磁头在简单图像中都能产生令人满意的结果,而在困难图像中则表现出明显的差异。受这些发现的启发,我们提出了一种动态磁头切换(DHS)策略,在推理时为每幅图像动态选择合适的磁头,以更好地平衡效率和准确性。具体来说,我们的 DETR 模型包含具有不同计算复杂度的多个磁头和一个轻量级模块,该模块可为给定图像选择合适的磁头。该模块经过优化,可在遵守总体计算预算限制的同时最大限度地提高检测精度。为了最大限度地减少执行轻型侦测头时可能出现的精度下降,我们提出了在线侦测头蒸馏(OHD)技术,以便在重型侦测头的帮助下提高轻型侦测头的精度。在 MS COCO 数据集上进行的大量实验验证了所提方法的有效性,与使用静态磁头的基线方法相比,该方法在准确性和效率之间实现了更好的权衡。
{"title":"DHS-DETR: Efficient DETRs with dynamic head switching","authors":"","doi":"10.1016/j.cviu.2024.104106","DOIUrl":"10.1016/j.cviu.2024.104106","url":null,"abstract":"<div><p>Detection Transformer (DETR) and its variants have emerged a new paradigm to object detection, but their high computational cost hinders practical applications. By investigating their essential components, we found that the transformer-based head usually occupies a significant amount of computation. Through further comparing heavy and light transformer heads, we observed that both heads produced satisfactory results for easy images while showing a noticeable difference for hard images. Inspired by these findings, we propose a dynamic head switching (DHS) strategy to dynamically select the proper head for each image at inference for a better balance of efficiency and accuracy. Specifically, our DETR model incorporates multiple heads with different computational complexity and a lightweight module which selects proper heads for given images. This module is optimized to maximize detection accuracy while adhering to the overall computational budget limitations. To minimize the potential accuracy drop when executing the lighter heads, we propose online head distillation (OHD) to improve the accuracy of the lighter heads with the help of the heavier head. Extensive experiments on the MS COCO dataset validated the effectiveness of the proposed method, which demonstrated a better accuracy–efficiency trade-off compared to the baseline using static heads.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
View-aligned pixel-level feature aggregation for 3D shape classification 用于三维形状分类的视图对齐像素级特征聚合
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-06 DOI: 10.1016/j.cviu.2024.104098

Multi-view 3D shape classification, which identifies a 3D shape based on its 2D views rendered from different viewpoints, has emerged as a promising method of shape understanding. A key building block in these methods is cross-view feature aggregation. However, existing methods dominantly follow the “extract-then-aggregate” pipeline for view-level global feature aggregation, leaving cross-view pixel-level feature interaction under-explored. To tackle this issue, we develop a “fuse-while-extract” pipeline, with a novel View-aligned Pixel-level Fusion (VPF) module to fuse cross-view pixel-level features originating from the same 3D part. We first reconstruct the 3D coordinate of each feature via the rasterization results, then match and fuse the features via spatial neighbor searching. Incorporating the proposed VPF module with ResNet18 backbone, we build a novel view-aligned multi-view network, which conducts feature extraction and cross-view fusion alternatively. Extensive experiments have demonstrated the effectiveness of the VPF module as well as the excellent performance of the proposed network.

多视角三维形状分类是根据从不同视角渲染的二维视图来识别三维形状,已成为一种很有前途的形状理解方法。这些方法的一个关键组成部分是跨视角特征聚合。然而,现有的方法在视图级全局特征聚合方面主要采用 "提取-然后-聚合 "的流程,跨视图像素级特征交互尚未得到充分探索。为了解决这个问题,我们开发了一种 "边提取边融合 "的管道,其中包含一个新颖的视图对齐像素级融合(VPF)模块,用于融合源自同一三维部分的跨视图像素级特征。我们首先通过光栅化结果重建每个特征的三维坐标,然后通过空间邻域搜索对特征进行匹配和融合。将所提出的 VPF 模块与 ResNet18 骨干网相结合,我们构建了一个新颖的视图对齐多视图网络,可交替进行特征提取和跨视图融合。广泛的实验证明了 VPF 模块的有效性以及所提网络的卓越性能。
{"title":"View-aligned pixel-level feature aggregation for 3D shape classification","authors":"","doi":"10.1016/j.cviu.2024.104098","DOIUrl":"10.1016/j.cviu.2024.104098","url":null,"abstract":"<div><p>Multi-view 3D shape classification, which identifies a 3D shape based on its 2D views rendered from different viewpoints, has emerged as a promising method of shape understanding. A key building block in these methods is cross-view feature aggregation. However, existing methods dominantly follow the “extract-then-aggregate” pipeline for view-level global feature aggregation, leaving cross-view pixel-level feature interaction under-explored. To tackle this issue, we develop a “fuse-while-extract” pipeline, with a novel View-aligned Pixel-level Fusion (VPF) module to fuse cross-view pixel-level features originating from the same 3D part. We first reconstruct the 3D coordinate of each feature via the rasterization results, then match and fuse the features via spatial neighbor searching. Incorporating the proposed VPF module with ResNet18 backbone, we build a novel view-aligned multi-view network, which conducts feature extraction and cross-view fusion alternatively. Extensive experiments have demonstrated the effectiveness of the VPF module as well as the excellent performance of the proposed network.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-dimensional attention-aided transposed ConvBiLSTM network for hyperspectral image super-resolution 用于高光谱图像超分辨率的多维注意力辅助转置 ConvBiLSTM 网络
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-05 DOI: 10.1016/j.cviu.2024.104096

Hyperspectral (HS) image always suffers from the deficiency of low spatial resolution, compared with conventional optical image types, which has limited its further applications in remote sensing areas. Therefore, HS image super-resolution (SR) techniques are broadly employed in order to observe finer spatial structures while preserving the spectra of ground covers. In this paper, a novel multi-dimensional attention-aided transposed convolutional long-short term memory (LSTM) network is proposed for single HS image super-resolution task. The proposed network employs the convolutional bi-directional LSTM for the purpose of local and non-local spatial–spectral feature explorations, and transposed convolution for the purpose of image amplification and reconstruction. Moreover, a multi-dimensional attention module is proposed, aiming to capture the salient features on spectral, channel, and spatial dimensions, simultaneously, to further improve the learning abilities of network. Experiments on four commonly-used HS images demonstrate the effectiveness of this approach, compared with several state-of-the-art deep learning-based SR methods.

与传统光学图像相比,高光谱(HS)图像始终存在空间分辨率低的缺陷,这限制了其在遥感领域的进一步应用。因此,高光谱图像超分辨率(SR)技术被广泛应用,以便在保留地面覆盖物光谱的同时,观察到更精细的空间结构。本文针对单 HS 图像超分辨率任务提出了一种新型多维注意力辅助转置卷积长短期记忆(LSTM)网络。该网络利用卷积双向 LSTM 进行局部和非局部空间光谱特征探索,并利用转置卷积进行图像放大和重建。此外,还提出了一个多维注意力模块,旨在同时捕捉光谱、信道和空间维度上的突出特征,以进一步提高网络的学习能力。与几种最先进的基于深度学习的 SR 方法相比,四种常用 HS 图像的实验证明了这种方法的有效性。
{"title":"Multi-dimensional attention-aided transposed ConvBiLSTM network for hyperspectral image super-resolution","authors":"","doi":"10.1016/j.cviu.2024.104096","DOIUrl":"10.1016/j.cviu.2024.104096","url":null,"abstract":"<div><p>Hyperspectral (HS) image always suffers from the deficiency of low spatial resolution, compared with conventional optical image types, which has limited its further applications in remote sensing areas. Therefore, HS image super-resolution (SR) techniques are broadly employed in order to observe finer spatial structures while preserving the spectra of ground covers. In this paper, a novel multi-dimensional attention-aided transposed convolutional long-short term memory (LSTM) network is proposed for single HS image super-resolution task. The proposed network employs the convolutional bi-directional LSTM for the purpose of local and non-local spatial–spectral feature explorations, and transposed convolution for the purpose of image amplification and reconstruction. Moreover, a multi-dimensional attention module is proposed, aiming to capture the salient features on spectral, channel, and spatial dimensions, simultaneously, to further improve the learning abilities of network. Experiments on four commonly-used HS images demonstrate the effectiveness of this approach, compared with several state-of-the-art deep learning-based SR methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141962989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cascaded UNet for progressive noise residual prediction for structure-preserving video denoising 级联 UNet 用于渐进式噪声残差预测,以实现结构保持型视频去噪
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-05 DOI: 10.1016/j.cviu.2024.104103

The prominence of high-quality video services has become so substantial that by 2030, it is estimated that approximately 80% of internet traffic will consist of videos. On the contrary, video denoising remains a relatively unexplored and intricate field, presenting more substantial challenges compared to image denoising. Many published deep learning video denoising algorithms typically rely on simple, efficient single encoder–decoder networks, but they have inherent limitations in preserving intricate image details and effectively managing noise information propagation for noise residue modelling. In response to these challenges, the proposed work introduces an innovative approach; in terms of utilization of cascaded UNets for progressive noise residual prediction in video denoising. This multi-stage encoder–decoder architecture is meticulously designed to accurately predict noise residual maps, thereby preserving the locally fine details within video content as represented by SSIM. The proposed network has undergone extensive end-to-end training from scratch without explicit motion compensation to reduce complexity. In terms of the more rigorous SSIM metric, the proposed network outperformed all video denoising methods while maintaining a comparable PSNR.

高质量视频服务已变得如此重要,据估计,到 2030 年,大约 80% 的互联网流量将由视频组成。相反,视频去噪仍然是一个相对尚未开发的复杂领域,与图像去噪相比,它面临着更大的挑战。许多已发布的深度学习视频去噪算法通常依赖于简单、高效的单一编码器-解码器网络,但它们在保留复杂的图像细节和有效管理噪声信息传播以建立噪声残留模型方面存在固有的局限性。为了应对这些挑战,本文提出了一种创新方法,即在视频去噪中利用级联 UNets 进行渐进式噪声残留预测。这种多级编码器-解码器架构经过精心设计,可准确预测噪声残留图,从而保留 SSIM 所代表的视频内容中的局部精细细节。为了降低复杂性,所提出的网络从零开始进行了大量端到端训练,没有明确的运动补偿。就更严格的 SSIM 指标而言,所提出的网络性能优于所有视频去噪方法,同时保持了相当的 PSNR。
{"title":"Cascaded UNet for progressive noise residual prediction for structure-preserving video denoising","authors":"","doi":"10.1016/j.cviu.2024.104103","DOIUrl":"10.1016/j.cviu.2024.104103","url":null,"abstract":"<div><p>The prominence of high-quality video services has become so substantial that by 2030, it is estimated that approximately 80% of internet traffic will consist of videos. On the contrary, video denoising remains a relatively unexplored and intricate field, presenting more substantial challenges compared to image denoising. Many published deep learning video denoising algorithms typically rely on simple, efficient single encoder–decoder networks, but they have inherent limitations in preserving intricate image details and effectively managing noise information propagation for noise residue modelling. In response to these challenges, the proposed work introduces an innovative approach; in terms of utilization of cascaded UNets for progressive noise residual prediction in video denoising. This multi-stage encoder–decoder architecture is meticulously designed to accurately predict noise residual maps, thereby preserving the locally fine details within video content as represented by SSIM. The proposed network has undergone extensive end-to-end training from scratch without explicit motion compensation to reduce complexity. In terms of the more rigorous SSIM metric, the proposed network outperformed all video denoising methods while maintaining a comparable PSNR.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint pyramidal perceptual attention and hierarchical consistency constraint for gaze estimation 联合金字塔知觉注意力和分层一致性约束进行凝视估计
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-02 DOI: 10.1016/j.cviu.2024.104105

Eye gaze provides valuable cues about human intent, making gaze estimation a hot topic. Extracting multi-scale information has recently proven effective for gaze estimation in complex scenarios. However, existing methods for estimating gaze based on multi-scale features tend to focus only on information from single-level feature maps. Furthermore, information across different scales may also lack relevance. To address these issues, we propose a novel joint pyramidal perceptual attention and hierarchical consistency constraint (PaCo) for gaze estimation. The proposed PaCo consists of two main components: pyramidal perceptual attention module (PPAM) and hierarchical consistency constraint (HCC). Specifically, PPAM first extracts multi-scale spatial features using a pyramid structure, and then aggregates information from coarse granularity to fine granularity. In this way, PPAM enables the model to simultaneously focus on both the eye region and facial region at multiple scales. Then, HCC makes constrains consistency on low-level and high-level features, aiming to enhance the gaze semantic consistency between different feature levels. With the combination of PPAM and HCC, PaCo can learn more discriminative features in complex situations. Extensive experimental results show that PaCo achieves significant performance improvements on challenging datasets such as Gaze360, MPIIFaceGaze, and RT-GENE,reducing errors to 10.27°, 3.23°, 6.46°, respectively.

注视提供了有关人类意图的宝贵线索,因此注视估计成为一个热门话题。提取多尺度信息最近已被证明对复杂场景中的注视估计有效。然而,现有的基于多尺度特征的注视估计方法往往只关注单级特征图中的信息。此外,不同尺度的信息也可能缺乏相关性。为了解决这些问题,我们提出了一种用于凝视估计的新型金字塔知觉注意力和分层一致性约束(PaCo)联合方法。拟议的 PaCo 由两个主要部分组成:金字塔知觉注意模块(PPAM)和层次一致性约束(HCC)。具体来说,PPAM 首先利用金字塔结构提取多尺度空间特征,然后将信息从粗粒度聚合到细粒度。这样,PPAM 就能使模型在多个尺度上同时关注眼睛区域和面部区域。然后,HCC 对低层次特征和高层次特征进行一致性约束,旨在增强不同特征层次之间的注视语义一致性。通过 PPAM 和 HCC 的结合,PaCo 可以在复杂情况下学习到更多的判别特征。广泛的实验结果表明,PaCo 在 Gaze360、MPIIFaceGaze 和 RT-GENE 等具有挑战性的数据集上取得了显著的性能提升,误差分别降低到 10.27、3.23 和 6.46。
{"title":"Joint pyramidal perceptual attention and hierarchical consistency constraint for gaze estimation","authors":"","doi":"10.1016/j.cviu.2024.104105","DOIUrl":"10.1016/j.cviu.2024.104105","url":null,"abstract":"<div><p>Eye gaze provides valuable cues about human intent, making gaze estimation a hot topic. Extracting multi-scale information has recently proven effective for gaze estimation in complex scenarios. However, existing methods for estimating gaze based on multi-scale features tend to focus only on information from single-level feature maps. Furthermore, information across different scales may also lack relevance. To address these issues, we propose a novel joint pyramidal perceptual attention and hierarchical consistency constraint (PaCo) for gaze estimation. The proposed PaCo consists of two main components: pyramidal perceptual attention module (PPAM) and hierarchical consistency constraint (HCC). Specifically, PPAM first extracts multi-scale spatial features using a pyramid structure, and then aggregates information from coarse granularity to fine granularity. In this way, PPAM enables the model to simultaneously focus on both the eye region and facial region at multiple scales. Then, HCC makes constrains consistency on low-level and high-level features, aiming to enhance the gaze semantic consistency between different feature levels. With the combination of PPAM and HCC, PaCo can learn more discriminative features in complex situations. Extensive experimental results show that PaCo achieves significant performance improvements on challenging datasets such as Gaze360, MPIIFaceGaze, and RT-GENE,reducing errors to 10.27<span><math><mo>°</mo></math></span>, 3.23<span><math><mo>°</mo></math></span>, 6.46<span><math><mo>°</mo></math></span>, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UC-former: A multi-scale image deraining network using enhanced transformer UC-former:使用增强变换器的多尺度图像衍生网络
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-31 DOI: 10.1016/j.cviu.2024.104097

While convolutional neural networks (CNN) have achieved remarkable performance in single image deraining tasks, it is still a very challenging task due to CNN’s limited receptive field and the unreality of the output image. In this paper, UC-former, an effective and efficient U-shaped architecture based on transformer for image deraining was presented. In UC-former, there are two core designs to avoid heavy self-attention computation and inefficient communications across encoder and decoder. First, we propose a novel channel across Transformer block, which computes self-attention between channels. It significantly reduces the computational complexity of high-resolution rain maps while capturing global context. Second, we propose a multi-scale feature fusion module between the encoder and decoder to combine low-level local features and high-level non-local features. In addition, we employ depth-wise convolution and H-Swish non-linear activation function in Transformer Blocks to enhance rain removal authenticity. Extensive experiments indicate that our method outperforms the state-of-the-art deraining approaches on synthetic and real-world rainy datasets.

虽然卷积神经网络(CNN)在单幅图像派生任务中取得了不俗的表现,但由于 CNN 的感受野有限以及输出图像的不真实性,这仍然是一项极具挑战性的任务。本文提出了一种基于变压器的高效 U 型架构--UC-former,用于图像推导。UC-former 有两个核心设计,以避免繁重的自注意计算和编码器与解码器之间的低效通信。首先,我们提出了一种新颖的跨变换器信道块,它可以计算信道间的自注意。它大大降低了高分辨率雨图的计算复杂度,同时还能捕捉全局背景。其次,我们在编码器和解码器之间提出了一个多尺度特征融合模块,以结合低级局部特征和高级非局部特征。此外,我们还在变换器块中采用了深度卷积和 H-Swish 非线性激活函数,以增强雨水去除的真实性。广泛的实验表明,我们的方法在合成和真实世界雨天数据集上的表现优于最先进的去污方法。
{"title":"UC-former: A multi-scale image deraining network using enhanced transformer","authors":"","doi":"10.1016/j.cviu.2024.104097","DOIUrl":"10.1016/j.cviu.2024.104097","url":null,"abstract":"<div><p>While convolutional neural networks (CNN) have achieved remarkable performance in single image deraining tasks, it is still a very challenging task due to CNN’s limited receptive field and the unreality of the output image. In this paper, UC-former, an effective and efficient U-shaped architecture based on transformer for image deraining was presented. In UC-former, there are two core designs to avoid heavy self-attention computation and inefficient communications across encoder and decoder. First, we propose a novel channel across Transformer block, which computes self-attention between channels. It significantly reduces the computational complexity of high-resolution rain maps while capturing global context. Second, we propose a multi-scale feature fusion module between the encoder and decoder to combine low-level local features and high-level non-local features. In addition, we employ depth-wise convolution and H-Swish non-linear activation function in Transformer Blocks to enhance rain removal authenticity. Extensive experiments indicate that our method outperforms the state-of-the-art deraining approaches on synthetic and real-world rainy datasets.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Invisible gas detection: An RGB-thermal cross attention network and a new benchmark 隐形气体检测:RGB 热交叉注意网络和新基准
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-31 DOI: 10.1016/j.cviu.2024.104099

The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the RGB-Thermal Cross Attention Network (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data can be found at https://github.com/logic112358/RT-CAN.

鉴于各种化学气体的剧毒性,在工业生产过程中广泛使用这些气体需要采取有效措施防止其在运输和储存过程中发生泄漏。基于热红外的计算机视觉检测技术为识别气体泄漏区域提供了一种直接的方法。然而,由于热图像的纹理较低以及缺乏开源数据集,开发高质量算法一直面临挑战。在本文中,我们提出了 GB-热像仪网络(RT-CAN),它采用 RGB 辅助双流网络架构,将 RGB 图像中的纹理信息和热图像中的气体区域信息整合在一起。此外,为了促进对隐形气体检测的研究,我们引入了 Gas-DB,这是一个广泛的开源气体检测数据库,包含约 1.3K 幅注释良好的 RGB-热图像和 8 个不同的采集场景。实验结果表明,我们的方法成功地利用了两种模式的优势,在 RGB 热图像方法中取得了最先进的(SOTA)性能,在准确率、联合交叉(IoU)和 F2 指标方面分别超过单流 SOTA 模型 4.86%、5.65% 和 4.88%。代码和数据可在以下网址找到。
{"title":"Invisible gas detection: An RGB-thermal cross attention network and a new benchmark","authors":"","doi":"10.1016/j.cviu.2024.104099","DOIUrl":"10.1016/j.cviu.2024.104099","url":null,"abstract":"<div><p>The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the <strong>R</strong>GB-<strong>T</strong>hermal <strong>C</strong>ross <strong>A</strong>ttention <strong>N</strong>etwork (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data can be found at <span><span>https://github.com/logic112358/RT-CAN</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1