Computer Vision and Image Understanding最新文献_第9页

Leaf cultivar identification via prototype-enhanced learning 通过原型强化学习识别叶片栽培品种

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-05 DOI: 10.1016/j.cviu.2024.104221

Yiyi Zhang , Zhiwen Ying , Ying Zheng , Cuiling Wu , Nannan Li , Fangfang Wang , Jun Wang , Xianzhong Feng , Xiaogang Xu

Leaf cultivar identification, as a typical task of ultra-fine-grained visual classification (UFGVC), is facing a huge challenge due to the high similarity among different varieties. In practice, an instance may be related to multiple varieties to varying degrees, especially in the UFGVC datasets. However, deep learning methods trained on one-hot labels fail to reflect patterns shared across categories and thus perform poorly on this task. As an analogy to natural language processing (NLP), by capturing the co-relation between labels, label embedding can select the most informative words and neglect irrelevant ones when predicting different labels. Based on this intuition, we propose a novel method named Prototype-enhanced Learning (PEL), which is predicated on the assumption that label embedding encoded with the inter-class relationships would force the image classification model to focus on discriminative patterns. In addition, a new prototype update module is put forward to learn inter-class relations by capturing label semantic overlap and iteratively update prototypes to generate continuously enhanced soft targets. Prototype-enhanced soft labels not only contain original one-hot label information, but also introduce rich inter-category semantic association information, thus providing more effective supervision for deep model training. Extensive experimental results on 7 public datasets show that our method can significantly improve the performance on the task of ultra-fine-grained visual classification. The code is available at https://github.com/YIYIZH/PEL.

叶片栽培品种识别是超精细视觉分类（UFGVC）的一项典型任务，由于不同品种之间的高度相似性，这项任务面临着巨大的挑战。在实践中，一个实例可能在不同程度上与多个品种相关，尤其是在 UFGVC 数据集中。然而，根据单点标签训练的深度学习方法无法反映不同类别之间的共享模式，因此在这项任务中表现不佳。与自然语言处理（NLP）类似，通过捕捉标签之间的相互关系，标签嵌入可以在预测不同标签时选择信息量最大的单词，而忽略无关的单词。基于这一直觉，我们提出了一种名为 "原型增强学习"（Prototype-enhanced Learning，PEL）的新方法，其前提假设是，以类间关系为编码的标签嵌入会迫使图像分类模型专注于具有区分性的模式。此外，还提出了一个新的原型更新模块，通过捕捉标签语义重叠来学习类间关系，并迭代更新原型以生成持续增强的软目标。原型增强后的软标签不仅包含原始的单点标签信息，还引入了丰富的类间语义关联信息，从而为深度模型训练提供更有效的监督。在 7 个公开数据集上的大量实验结果表明，我们的方法可以显著提高超细粒度视觉分类任务的性能。代码见 https://github.com/YIYIZH/PEL。

{"title":"Leaf cultivar identification via prototype-enhanced learning","authors":"Yiyi Zhang , Zhiwen Ying , Ying Zheng , Cuiling Wu , Nannan Li , Fangfang Wang , Jun Wang , Xianzhong Feng , Xiaogang Xu","doi":"10.1016/j.cviu.2024.104221","DOIUrl":"10.1016/j.cviu.2024.104221","url":null,"abstract":"<div><div>Leaf cultivar identification, as a typical task of ultra-fine-grained visual classification (UFGVC), is facing a huge challenge due to the high similarity among different varieties. In practice, an instance may be related to multiple varieties to varying degrees, especially in the UFGVC datasets. However, deep learning methods trained on one-hot labels fail to reflect patterns shared across categories and thus perform poorly on this task. As an analogy to natural language processing (NLP), by capturing the co-relation between labels, label embedding can select the most informative words and neglect irrelevant ones when predicting different labels. Based on this intuition, we propose a novel method named Prototype-enhanced Learning (PEL), which is predicated on the assumption that label embedding encoded with the inter-class relationships would force the image classification model to focus on discriminative patterns. In addition, a new prototype update module is put forward to learn inter-class relations by capturing label semantic overlap and iteratively update prototypes to generate continuously enhanced soft targets. Prototype-enhanced soft labels not only contain original one-hot label information, but also introduce rich inter-category semantic association information, thus providing more effective supervision for deep model training. Extensive experimental results on 7 public datasets show that our method can significantly improve the performance on the task of ultra-fine-grained visual classification. The code is available at <span><span>https://github.com/YIYIZH/PEL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104221"},"PeriodicalIF":4.3,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced local multi-windows attention network for lightweight image super-resolution 用于轻量级图像超分辨率的增强型本地多窗口注意力网络

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-05 DOI: 10.1016/j.cviu.2024.104217

Yanheng Lv, Lulu Pan, Ke Xu, Guo Li, Wenbo Zhang, Lingxiao Li, Le Lei

Since the global self-attention mechanism can capture long-distance dependencies well, Transformer-based methods have achieved remarkable performance in many vision tasks, including single-image super-resolution (SISR). However, there are strong local self-similarities in images, if the global self-attention mechanism is still used for image processing, it may lead to excessive use of computing resources on parts of the image with weak correlation. Especially in the high-resolution large-size image, the global self-attention will lead to a large number of redundant calculations. To solve this problem, we propose the Enhanced Local Multi-windows Attention Network (ELMA), which contains two main designs. First, different from the traditional self-attention based on square window partition, we propose a Multi-windows Self-Attention (M-WSA) which uses a new window partitioning mechanism to obtain different types of local long-distance dependencies. Compared with original self-attention mechanisms commonly used in other SR networks, M-WSA reduces computational complexity and achieves superior performance through analysis and experiments. Secondly, we propose a Spatial Gated Network (SGN) as a feed-forward network, which can effectively overcome the problem of intermediate channel redundancy in traditional MLP, thereby improving the parameter utilization and computational efficiency of the network. Meanwhile, SGN introduces spatial information into the feed-forward network that traditional MLP cannot obtain. It can better understand and use the spatial structure information in the image, and enhances the network performance and generalization ability. Extensive experiments show that ELMA achieves competitive performance compared to state-of-the-art methods while maintaining fewer parameters and computational costs.

由于全局自注意机制能很好地捕捉长距离依赖关系，基于变换器的方法在许多视觉任务中都取得了显著的性能，包括单图像超分辨率（SISR）。然而，图像中存在很强的局部自相似性，如果仍然使用全局自注意机制进行图像处理，可能会导致在图像中相关性较弱的部分过度使用计算资源。特别是在高分辨率的大尺寸图像中，全局自注意会导致大量冗余计算。为了解决这个问题，我们提出了增强型局部多窗口注意力网络（ELMA），它包含两个主要设计。首先，与传统的基于方形窗口分割的自注意不同，我们提出了一种多窗口自注意（M-WSA），它使用一种新的窗口分割机制来获得不同类型的本地长距离依赖关系。与其他 SR 网络常用的原始自注意机制相比，M-WSA 降低了计算复杂度，并通过分析和实验实现了更优越的性能。其次，我们提出了空间门控网络（SGN）作为前馈网络，它能有效克服传统 MLP 中的中间信道冗余问题，从而提高网络的参数利用率和计算效率。同时，SGN 在前馈网络中引入了传统 MLP 无法获取的空间信息。它能更好地理解和利用图像中的空间结构信息，提高网络性能和泛化能力。广泛的实验表明，与最先进的方法相比，ELMA 在保持较少参数和计算成本的同时，实现了具有竞争力的性能。

{"title":"Enhanced local multi-windows attention network for lightweight image super-resolution","authors":"Yanheng Lv, Lulu Pan, Ke Xu, Guo Li, Wenbo Zhang, Lingxiao Li, Le Lei","doi":"10.1016/j.cviu.2024.104217","DOIUrl":"10.1016/j.cviu.2024.104217","url":null,"abstract":"<div><div>Since the global self-attention mechanism can capture long-distance dependencies well, Transformer-based methods have achieved remarkable performance in many vision tasks, including single-image super-resolution (SISR). However, there are strong local self-similarities in images, if the global self-attention mechanism is still used for image processing, it may lead to excessive use of computing resources on parts of the image with weak correlation. Especially in the high-resolution large-size image, the global self-attention will lead to a large number of redundant calculations. To solve this problem, we propose the Enhanced Local Multi-windows Attention Network (ELMA), which contains two main designs. First, different from the traditional self-attention based on square window partition, we propose a Multi-windows Self-Attention (M-WSA) which uses a new window partitioning mechanism to obtain different types of local long-distance dependencies. Compared with original self-attention mechanisms commonly used in other SR networks, M-WSA reduces computational complexity and achieves superior performance through analysis and experiments. Secondly, we propose a Spatial Gated Network (SGN) as a feed-forward network, which can effectively overcome the problem of intermediate channel redundancy in traditional MLP, thereby improving the parameter utilization and computational efficiency of the network. Meanwhile, SGN introduces spatial information into the feed-forward network that traditional MLP cannot obtain. It can better understand and use the spatial structure information in the image, and enhances the network performance and generalization ability. Extensive experiments show that ELMA achieves competitive performance compared to state-of-the-art methods while maintaining fewer parameters and computational costs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104217"},"PeriodicalIF":4.3,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simultaneous image denoising and completion through convolutional sparse representation and nonlocal self-similarity 通过卷积稀疏表示和非局部自相似性同时实现图像去噪和补全

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-04 DOI: 10.1016/j.cviu.2024.104216

Weimin Yuan , Yuanyuan Wang , Ruirui Fan , Yuxuan Zhang , Guangmei Wei , Cai Meng , Xiangzhi Bai

Low rank matrix approximation (LRMA) has been widely studied due to its capability of approximating original image from the degraded image. According to the characteristics of degraded images, image denoising and image completion have become research objects. Existing methods are usually designed for a single task. In this paper, focusing on the task of simultaneous image denoising and completion, we propose a weighted low rank sparse representation model and the corresponding efficient algorithm based on LRMA. The proposed method integrates convolutional analysis sparse representation (ASR) and nonlocal statistical modeling to maintain local smoothness and nonlocal self-similarity (NLSM) of natural images. More importantly, we explore the alternating direction method of multipliers (ADMM) to solve the above inverse problem efficiently due to the complexity of simultaneous image denoising and completion. We conduct experiments on image completion for partial random samples and mask removal with different noise levels. Extensive experiments on four datasets, i.e., Set12, Kodak, McMaster, and CBSD68, show that the proposed method prevents the transmission of noise while completing images and has achieved better quantitative results and human visual quality compared to 17 methods. The proposed method achieves (1.9%, 1.8%, 4.2%, and 3.7%) gains in average PSNR and (4.2%, 2.9%, 6.7%, and 6.6%) gains in average SSIM over the sub-optimal method across the four datasets, respectively. We also demonstrate that our method can handle the challenging scenarios well. Source code is available at https://github.com/weimin581/demo_CSRNS.

低秩矩阵逼近（LRMA）因其能够从退化图像逼近原始图像而被广泛研究。根据退化图像的特点，图像去噪和图像补全已成为研究对象。现有方法通常针对单一任务而设计。本文针对同时进行图像去噪和补全的任务，提出了一种基于 LRMA 的加权低秩稀疏表示模型和相应的高效算法。所提出的方法整合了卷积分析稀疏表示（ASR）和非局部统计建模，以保持自然图像的局部平滑度和非局部自相似性（NLSM）。更重要的是，由于同时进行图像去噪和补全的复杂性，我们探索了交替方向乘法（ADMM）来高效解决上述逆问题。我们针对部分随机样本和不同噪声水平下的掩码去除进行了图像补全实验。在 Set12、Kodak、McMaster 和 CBSD68 四个数据集上的广泛实验表明，与 17 种方法相比，所提出的方法在完成图像时防止了噪声的传播，取得了更好的定量结果和人的视觉质量。与四个数据集的次优方法相比，所提出的方法在平均 PSNR 和平均 SSIM 方面分别实现了（1.9%、1.8%、4.2% 和 3.7%）提升和（4.2%、2.9%、6.7% 和 6.6%）提升。我们还证明，我们的方法可以很好地处理具有挑战性的场景。源代码见 https://github.com/weimin581/demo_CSRNS。

{"title":"Simultaneous image denoising and completion through convolutional sparse representation and nonlocal self-similarity","authors":"Weimin Yuan , Yuanyuan Wang , Ruirui Fan , Yuxuan Zhang , Guangmei Wei , Cai Meng , Xiangzhi Bai","doi":"10.1016/j.cviu.2024.104216","DOIUrl":"10.1016/j.cviu.2024.104216","url":null,"abstract":"<div><div>Low rank matrix approximation (LRMA) has been widely studied due to its capability of approximating original image from the degraded image. According to the characteristics of degraded images, image denoising and image completion have become research objects. Existing methods are usually designed for a single task. In this paper, focusing on the task of simultaneous image denoising and completion, we propose a weighted low rank sparse representation model and the corresponding efficient algorithm based on LRMA. The proposed method integrates convolutional analysis sparse representation (ASR) and nonlocal statistical modeling to maintain local smoothness and nonlocal self-similarity (NLSM) of natural images. More importantly, we explore the alternating direction method of multipliers (ADMM) to solve the above inverse problem efficiently due to the complexity of simultaneous image denoising and completion. We conduct experiments on image completion for partial random samples and mask removal with different noise levels. Extensive experiments on four datasets, i.e., Set12, Kodak, McMaster, and CBSD68, show that the proposed method prevents the transmission of noise while completing images and has achieved better quantitative results and human visual quality compared to 17 methods. The proposed method achieves (1.9%, 1.8%, 4.2%, and 3.7%) gains in average PSNR and (4.2%, 2.9%, 6.7%, and 6.6%) gains in average SSIM over the sub-optimal method across the four datasets, respectively. We also demonstrate that our method can handle the challenging scenarios well. Source code is available at <span><span>https://github.com/weimin581/demo_CSRNS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104216"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Seam estimation based on dense matching for parallax-tolerant image stitching 基于密集匹配的接缝估计，用于视差容忍图像拼接

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-04 DOI: 10.1016/j.cviu.2024.104219

Zhihao Zhang , Jie He , Mouquan Shen , Xianqiang Yang

Image stitching with large parallax poses a significant challenge in the field of computer vision. Existing seam-based approaches attempt to address parallax artifacts by stitching images along seams. However, issues such as object mismatches, disappearances, and duplications still arise occasionally, primarily due to inaccurate alignment of dense pixels or inappropriate seam estimation methods. In this paper, we propose a robust seam-based parallax-tolerant image stitching method that leverages dense flow estimation from state-of-the-art approaches. Firstly, we develop a seam estimation method that does not require pre-estimation of image warping model. Instead, it directly estimates the seam by measuring the local smoothness of the optical flow field and incorporating a penalty term for duplications. Subsequently, we design an iterative algorithm that utilizes the location of estimated seam to solve a spatial smooth warping model and eliminate outlier corresponding pairs. By employing this approach, we effectively address the intertwined challenges of estimating the warping model and seam. Experiment on real-world images shows that our proposed method achieves superior local alignment accuracy near the stitching seam and outperforms other state-of-the-art techniques on visual stitching result. Code is available at https://github.com/zhihao0512/dense-matching-image-stitching.

大视差图像拼接是计算机视觉领域的一项重大挑战。现有的基于接缝的方法试图通过沿接缝拼接图像来解决视差伪影问题。然而，偶尔仍会出现物体不匹配、消失和重复等问题，这主要是由于密集像素对齐不准确或接缝估计方法不当造成的。在本文中，我们提出了一种基于接缝的鲁棒视差容错图像拼接方法，该方法利用了最先进方法中的密集流估计。首先，我们开发了一种无需预先估计图像扭曲模型的接缝估计方法。相反，该方法通过测量光流场的局部平滑度，并加入对重复的惩罚项，直接估算接缝。随后，我们设计了一种迭代算法，利用估算出的接缝位置来求解空间平滑翘曲模型，并消除离群的对应对。通过采用这种方法，我们有效地解决了估算翘曲模型和接缝这两个相互交织的难题。在真实图像上的实验表明，我们提出的方法在拼接缝附近实现了卓越的局部对齐精度，在视觉拼接效果上优于其他最先进的技术。代码见 https://github.com/zhihao0512/dense-matching-image-stitching。

{"title":"Seam estimation based on dense matching for parallax-tolerant image stitching","authors":"Zhihao Zhang , Jie He , Mouquan Shen , Xianqiang Yang","doi":"10.1016/j.cviu.2024.104219","DOIUrl":"10.1016/j.cviu.2024.104219","url":null,"abstract":"<div><div>Image stitching with large parallax poses a significant challenge in the field of computer vision. Existing seam-based approaches attempt to address parallax artifacts by stitching images along seams. However, issues such as object mismatches, disappearances, and duplications still arise occasionally, primarily due to inaccurate alignment of dense pixels or inappropriate seam estimation methods. In this paper, we propose a robust seam-based parallax-tolerant image stitching method that leverages dense flow estimation from state-of-the-art approaches. Firstly, we develop a seam estimation method that does not require pre-estimation of image warping model. Instead, it directly estimates the seam by measuring the local smoothness of the optical flow field and incorporating a penalty term for duplications. Subsequently, we design an iterative algorithm that utilizes the location of estimated seam to solve a spatial smooth warping model and eliminate outlier corresponding pairs. By employing this approach, we effectively address the intertwined challenges of estimating the warping model and seam. Experiment on real-world images shows that our proposed method achieves superior local alignment accuracy near the stitching seam and outperforms other state-of-the-art techniques on visual stitching result. Code is available at <span><span>https://github.com/zhihao0512/dense-matching-image-stitching</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104219"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Monocular depth estimation with boundary attention mechanism and Shifted Window Adaptive Bins 利用边界注意机制和移位窗口自适应分层进行单目深度估算

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-04 DOI: 10.1016/j.cviu.2024.104220

Hengjia Hu , Mengnan Liang , Congcong Wang, Meng Zhao, Fan Shi, Chao Zhang, Yilin Han

Monocular depth estimation is a classic research topic in computer vision. In recent years, development of Convolutional Neural Networks (CNNs) has facilitated significant breakthroughs in this field. However, there still exist two challenges: (1) The network struggles to effectively fuse edge features in the feature fusion stage, which ultimately results in the loss of structure or boundary distortion of objects in the scene. (2) Classification based studies typically depend on Transformers for global modeling, a process that often introduces substantial computational complexity overhead as described in Equation 2. In this paper, we propose two modules to address the aforementioned issues. The first module is the Boundary Attention Module (BAM), which leverages the attention mechanism to enhance the ability of the network to perceive object boundaries during the feature fusion stage. In addition, to mitigate the computational complexity overhead resulting from predicting adaptive bins, we propose a Shift Window Adaptive Bins (SWAB) module to reduce the amount of computation in global modeling. The proposed method is evaluated on three public datasets, NYU Depth V2, KITTI and SUNRGB-D, and demonstrates state-of-the-art (SOTA) performance.

单目深度估计是计算机视觉领域的一个经典研究课题。近年来，卷积神经网络（CNN）的发展促进了这一领域的重大突破。然而，目前仍存在两个挑战：（1）在特征融合阶段，网络难以有效融合边缘特征，最终导致场景中物体的结构丢失或边界失真。(2) 基于分类的研究通常依赖变换器进行全局建模，而这一过程通常会带来大量的计算复杂度开销，如等式 2 所述。在本文中，我们提出了两个模块来解决上述问题。第一个模块是边界注意模块（BAM），它利用注意机制来增强网络在特征融合阶段感知物体边界的能力。此外，为了减轻预测自适应分层带来的计算复杂度开销，我们提出了移位窗口自适应分层（SWAB）模块，以减少全局建模的计算量。我们在纽约大学深度 V2、KITTI 和 SUNRGB-D 这三个公共数据集上对所提出的方法进行了评估，结果表明该方法具有最先进的 (SOTA) 性能。

{"title":"Monocular depth estimation with boundary attention mechanism and Shifted Window Adaptive Bins","authors":"Hengjia Hu , Mengnan Liang , Congcong Wang, Meng Zhao, Fan Shi, Chao Zhang, Yilin Han","doi":"10.1016/j.cviu.2024.104220","DOIUrl":"10.1016/j.cviu.2024.104220","url":null,"abstract":"<div><div>Monocular depth estimation is a classic research topic in computer vision. In recent years, development of Convolutional Neural Networks (CNNs) has facilitated significant breakthroughs in this field. However, there still exist two challenges: (1) The network struggles to effectively fuse edge features in the feature fusion stage, which ultimately results in the loss of structure or boundary distortion of objects in the scene. (2) Classification based studies typically depend on Transformers for global modeling, a process that often introduces substantial computational complexity overhead as described in Equation 2. In this paper, we propose two modules to address the aforementioned issues. The first module is the Boundary Attention Module (BAM), which leverages the attention mechanism to enhance the ability of the network to perceive object boundaries during the feature fusion stage. In addition, to mitigate the computational complexity overhead resulting from predicting adaptive bins, we propose a Shift Window Adaptive Bins (SWAB) module to reduce the amount of computation in global modeling. The proposed method is evaluated on three public datasets, NYU Depth V2, KITTI and SUNRGB-D, and demonstrates state-of-the-art (SOTA) performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104220"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multivariate prototype representation for domain-generalized incremental learning 用于领域泛化增量学习的多变量原型表示法

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-30 DOI: 10.1016/j.cviu.2024.104215

Can Peng , Piotr Koniusz , Kaiyu Guo , Brian C. Lovell , Peyman Moghadam

Deep learning models often suffer from catastrophic forgetting when fine-tuned with samples of new classes. This issue becomes even more challenging when there is a domain shift between training and testing data. In this paper, we address the critical yet less explored Domain-Generalized Class-Incremental Learning (DGCIL) task. We propose a DGCIL approach designed to memorize old classes, adapt to new classes, and reliably classify objects from unseen domains. Specifically, our loss formulation maintains classification boundaries while suppressing domain-specific information for each class. Without storing old exemplars, we employ knowledge distillation and estimate the drift of old class prototypes as incremental training progresses. Our prototype representations are based on multivariate Normal distributions, with means and covariances continually adapted to reflect evolving model features, providing effective representations for old classes. We then sample pseudo-features for these old classes from the adapted Normal distributions using Cholesky decomposition. Unlike previous pseudo-feature sampling strategies that rely solely on average mean prototypes, our method captures richer semantic variations. Experiments on several benchmarks demonstrate the superior performance of our method compared to the state of the art.

深度学习模型在使用新类别样本进行微调时，经常会出现灾难性遗忘。当训练数据和测试数据之间发生领域转换时，这个问题就变得更具挑战性。在本文中，我们将讨论领域通用类增量学习（DGCIL）这一关键但探索较少的任务。我们提出了一种 DGCIL 方法，旨在记忆旧类、适应新类，并对来自未见领域的对象进行可靠分类。具体来说，我们的损失公式在保持分类界限的同时，抑制了每个类别的特定领域信息。在不存储旧范例的情况下，我们采用知识提炼的方法，并随着增量训练的进行估算旧类原型的漂移。我们的原型表示法基于多元正态分布，其均值和协方差会不断调整以反映不断变化的模型特征，从而为旧类提供有效的表示法。然后，我们使用 Cholesky 分解法从调整后的正态分布中抽取这些旧类的伪特征。与以往仅依赖平均值原型的伪特征采样策略不同，我们的方法能捕捉到更丰富的语义变化。在多个基准上进行的实验证明，与现有技术相比，我们的方法性能更优越。

{"title":"Multivariate prototype representation for domain-generalized incremental learning","authors":"Can Peng , Piotr Koniusz , Kaiyu Guo , Brian C. Lovell , Peyman Moghadam","doi":"10.1016/j.cviu.2024.104215","DOIUrl":"10.1016/j.cviu.2024.104215","url":null,"abstract":"<div><div>Deep learning models often suffer from catastrophic forgetting when fine-tuned with samples of new classes. This issue becomes even more challenging when there is a domain shift between training and testing data. In this paper, we address the critical yet less explored Domain-Generalized Class-Incremental Learning (DGCIL) task. We propose a DGCIL approach designed to memorize old classes, adapt to new classes, and reliably classify objects from unseen domains. Specifically, our loss formulation maintains classification boundaries while suppressing domain-specific information for each class. Without storing old exemplars, we employ knowledge distillation and estimate the drift of old class prototypes as incremental training progresses. Our prototype representations are based on multivariate Normal distributions, with means and covariances continually adapted to reflect evolving model features, providing effective representations for old classes. We then sample pseudo-features for these old classes from the adapted Normal distributions using Cholesky decomposition. Unlike previous pseudo-feature sampling strategies that rely solely on average mean prototypes, our method captures richer semantic variations. Experiments on several benchmarks demonstrate the superior performance of our method compared to the state of the art.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104215"},"PeriodicalIF":4.3,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diffusion Models for Counterfactual Explanations 用于反事实解释的扩散模型

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-29 DOI: 10.1016/j.cviu.2024.104207

Guillaume Jeanneret, Loïc Simon, Frédéric Jurie

Counterfactual explanations have demonstrated promising results as a post-hoc framework to improve the explanatory power of image classifiers. Herein, this paper proposes DiME, a method that allows the generation of counterfactual images using the latest diffusion models. The proposed method uses a guided generative diffusion process to exploit the gradients of the target classifier to generate counterfactual explanations of the input instances. Furthermore, we examine present strategies for assessing spurious correlations and expand the assessment methods by presenting a novel measure, Correlation Difference, which is more efficient at detecting such correlations. The provided work includes a comprehensive ablation study and a thorough experimental validation demonstrating that the proposed algorithm outperforms previous state-of-the-art results on the CelebA, CelebAHQ and BDD100k datasets.

反事实解释作为一种事后框架，在提高图像分类器的解释能力方面取得了可喜的成果。本文提出的 DiME 是一种利用最新扩散模型生成反事实图像的方法。该方法使用引导生成扩散过程，利用目标分类器的梯度生成输入实例的反事实解释。此外，我们还研究了评估虚假相关性的现有策略，并提出了一种新的评估方法--相关性差异，它能更有效地检测出此类相关性，从而扩展了评估方法。所提供的工作包括全面的消融研究和彻底的实验验证，证明所提出的算法在 CelebA、CelebAHQ 和 BDD100k 数据集上的表现优于之前最先进的结果。

引用次数: 0

3D scene generation for zero-shot learning using ChatGPT guided language prompts 利用 ChatGPT 引导式语言提示为零镜头学习生成 3D 场景

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-29 DOI: 10.1016/j.cviu.2024.104211

Sahar Ahmadi , Ali Cheraghian , Townim Faisal Chowdhury , Morteza Saberi , Shafin Rahman

Zero-shot learning in the realm of 3D point cloud data remains relatively unexplored compared to its 2D image counterpart. This domain introduces fresh challenges due to the absence of robust pre-trained feature extraction models. To tackle this, we introduce a prompt-guided method for 3D scene generation and supervision, enhancing the network’s ability to comprehend the intricate relationships between seen and unseen objects. Initially, we utilize basic prompts resembling scene annotations generated from one or two point cloud objects. Recognizing the limited diversity of basic prompts, we employ ChatGPT to expand them, enriching the contextual information within the descriptions. Subsequently, leveraging these descriptions, we arrange point cloud objects’ coordinates to fabricate augmented 3D scenes. Lastly, employing contrastive learning, we train our proposed architecture end-to-end, utilizing pairs of 3D scenes and prompt-based captions. We posit that 3D scenes facilitate more efficient object relationships than individual objects, as demonstrated by the effectiveness of language models like BERT in contextual understanding. Our prompt-guided scene generation method amalgamates data augmentation and prompt-based annotation, thereby enhancing 3D ZSL performance. We present ZSL and generalized ZSL results on both synthetic (ModelNet40, ModelNet10, and ShapeNet) and real-scanned (ScanOjbectNN) 3D object datasets. Furthermore, we challenge the model by training with synthetic data and testing with real-scanned data, achieving state-of-the-art performance compared to existing 2D and 3D ZSL methods in the literature. Codes and models are available at: https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D.

与二维图像相比，三维点云数据领域的零点学习仍处于相对探索阶段。由于缺乏稳健的预训练特征提取模型，这一领域面临着新的挑战。为了解决这个问题，我们引入了一种用于三维场景生成和监督的提示引导方法，以增强网络理解可见物体和未见物体之间错综复杂关系的能力。最初，我们使用的基本提示类似于由一个或两个点云对象生成的场景注释。由于基本提示的多样性有限，我们采用 ChatGPT 对其进行扩展，丰富了描述中的上下文信息。随后，利用这些描述，我们排列点云对象的坐标，创建增强 3D 场景。最后，通过对比学习，我们利用成对的三维场景和基于提示的字幕，对我们提出的架构进行端到端训练。我们认为，与单个物体相比，三维场景能更有效地促进物体关系，BERT 等语言模型在上下文理解方面的有效性就证明了这一点。我们的提示引导场景生成方法融合了数据增强和基于提示的注释，从而提高了 3D ZSL 的性能。我们展示了在合成（ModelNet40、ModelNet10 和 ShapeNet）和真实扫描（ScanOjbectNN）三维物体数据集上的 ZSL 和广义 ZSL 结果。此外，我们通过合成数据训练和真实扫描数据测试对模型进行了挑战，与文献中现有的二维和三维 ZSL 方法相比，取得了最先进的性能。代码和模型请访问：https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D。

{"title":"3D scene generation for zero-shot learning using ChatGPT guided language prompts","authors":"Sahar Ahmadi , Ali Cheraghian , Townim Faisal Chowdhury , Morteza Saberi , Shafin Rahman","doi":"10.1016/j.cviu.2024.104211","DOIUrl":"10.1016/j.cviu.2024.104211","url":null,"abstract":"<div><div>Zero-shot learning in the realm of 3D point cloud data remains relatively unexplored compared to its 2D image counterpart. This domain introduces fresh challenges due to the absence of robust pre-trained feature extraction models. To tackle this, we introduce a prompt-guided method for 3D scene generation and supervision, enhancing the network’s ability to comprehend the intricate relationships between seen and unseen objects. Initially, we utilize basic prompts resembling scene annotations generated from one or two point cloud objects. Recognizing the limited diversity of basic prompts, we employ ChatGPT to expand them, enriching the contextual information within the descriptions. Subsequently, leveraging these descriptions, we arrange point cloud objects’ coordinates to fabricate augmented 3D scenes. Lastly, employing contrastive learning, we train our proposed architecture end-to-end, utilizing pairs of 3D scenes and prompt-based captions. We posit that 3D scenes facilitate more efficient object relationships than individual objects, as demonstrated by the effectiveness of language models like BERT in contextual understanding. Our prompt-guided scene generation method amalgamates data augmentation and prompt-based annotation, thereby enhancing 3D ZSL performance. We present ZSL and generalized ZSL results on both synthetic (ModelNet40, ModelNet10, and ShapeNet) and real-scanned (ScanOjbectNN) 3D object datasets. Furthermore, we challenge the model by training with synthetic data and testing with real-scanned data, achieving state-of-the-art performance compared to existing 2D and 3D ZSL methods in the literature. Codes and models are available at: <span><span>https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104211"},"PeriodicalIF":4.3,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A large corpus for the recognition of Greek Sign Language gestures 识别希腊手语手势的大型语料库

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-29 DOI: 10.1016/j.cviu.2024.104212

Katerina Papadimitriou , Galini Sapountzaki , Kyriaki Vasilaki , Eleni Efthimiou , Stavroula-Evita Fotinea , Gerasimos Potamianos

Sign language recognition (SLR) from videos constitutes a captivating problem in gesture recognition, requiring the interpretation of hand movements, facial expressions, and body postures. The complexity of sign formation, signing variability among signers, and the technical hurdles of visual detection and tracking render SLR a challenging task. At the same time, the scarcity of large-scale SLR datasets, which are critical for developing robust data-intensive deep-learning SLR models, exacerbates these issues. In this article, we introduce a multi-signer video corpus of Greek Sign Language (GSL), which is the largest GSL database to date, serving as a valuable resource for SLR research. This corpus comprises an extensive RGB+D video collection that conveys rich lexical content in a multi-modal fashion, encompassing three subsets: (i) isolated signs; (ii) continuous signing; and (iii) continuous alphabet fingerspelling of words. Moreover, we introduce a comprehensive experimental setup that paves the way for more accurate and robust SLR solutions. In particular, except for the multi-signer (MS) and signer-independent (SI) settings, we employ a signer-adapted (SA) experimental paradigm, facilitating a comprehensive evaluation of system performance across various scenarios. Further, we provide three baseline SLR systems for isolated signs, continuous signing, and continuous fingerspelling. These systems leverage cutting-edge methods in deep learning and sequence modeling to capture the intricate temporal dynamics inherent in sign gestures. The models are evaluated on the three corpus subsets, setting their state-of-the-art recognition benchmark. The SL-ReDu GSL corpus, including its recommended experimental frameworks, is publicly available at https://sl-redu.e-ce.uth.gr/corpus.

视频手语识别（SLR）是手势识别领域的一个难题，需要对手部动作、面部表情和身体姿势进行解读。手势形成的复杂性、手语者之间的差异性以及视觉检测和跟踪的技术障碍使得手语识别成为一项具有挑战性的任务。同时，大规模 SLR 数据集的缺乏也加剧了这些问题，而大规模 SLR 数据集对于开发稳健的数据密集型深度学习 SLR 模型至关重要。在本文中，我们介绍了希腊手语（GSL）的多手语视频语料库，这是迄今为止最大的希腊手语数据库，是 SLR 研究的宝贵资源。该语料库由大量 RGB+D 视频组成，以多模态方式传递丰富的词汇内容，包括三个子集：(i) 孤立手势；(ii) 连续手势；(iii) 连续字母指拼单词。此外，我们还引入了一个综合实验装置，为更准确、更稳健的 SLR 解决方案铺平了道路。特别是，除了多签名者（MS）和独立签名者（SI）设置外，我们还采用了签名者适应（SA）实验范例，便于在各种情况下全面评估系统性能。此外，我们还提供了针对孤立符号、连续签名和连续指拼的三种基线 SLR 系统。这些系统利用深度学习和序列建模的前沿方法捕捉手势中固有的复杂时间动态。这些模型在三个语料子集上进行了评估，设定了最先进的识别基准。SL-ReDu GSL 语料库，包括其推荐的实验框架，可在 https://sl-redu.e-ce.uth.gr/corpus 上公开获取。

{"title":"A large corpus for the recognition of Greek Sign Language gestures","authors":"Katerina Papadimitriou , Galini Sapountzaki , Kyriaki Vasilaki , Eleni Efthimiou , Stavroula-Evita Fotinea , Gerasimos Potamianos","doi":"10.1016/j.cviu.2024.104212","DOIUrl":"10.1016/j.cviu.2024.104212","url":null,"abstract":"<div><div>Sign language recognition (SLR) from videos constitutes a captivating problem in gesture recognition, requiring the interpretation of hand movements, facial expressions, and body postures. The complexity of sign formation, signing variability among signers, and the technical hurdles of visual detection and tracking render SLR a challenging task. At the same time, the scarcity of large-scale SLR datasets, which are critical for developing robust data-intensive deep-learning SLR models, exacerbates these issues. In this article, we introduce a multi-signer video corpus of Greek Sign Language (GSL), which is the largest GSL database to date, serving as a valuable resource for SLR research. This corpus comprises an extensive RGB+D video collection that conveys rich lexical content in a multi-modal fashion, encompassing three subsets: (i) isolated signs; (ii) continuous signing; and (iii) continuous alphabet fingerspelling of words. Moreover, we introduce a comprehensive experimental setup that paves the way for more accurate and robust SLR solutions. In particular, except for the multi-signer (MS) and signer-independent (SI) settings, we employ a signer-adapted (SA) experimental paradigm, facilitating a comprehensive evaluation of system performance across various scenarios. Further, we provide three baseline SLR systems for isolated signs, continuous signing, and continuous fingerspelling. These systems leverage cutting-edge methods in deep learning and sequence modeling to capture the intricate temporal dynamics inherent in sign gestures. The models are evaluated on the three corpus subsets, setting their state-of-the-art recognition benchmark. The SL-ReDu GSL corpus, including its recommended experimental frameworks, is publicly available at <span><span>https://sl-redu.e-ce.uth.gr/corpus</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104212"},"PeriodicalIF":4.3,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Image compressive sensing reconstruction via nonlocal low-rank residual-based ADMM framework 通过基于非局部低阶残差的 ADMM 框架进行图像压缩传感重建

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-28 DOI: 10.1016/j.cviu.2024.104204

Junhao Zhang , Kim-Hui Yap , Lap-Pui Chau , Ce Zhu

The nonlocal low-rank (LR) modeling has proven to be an effective approach in image compressive sensing (CS) reconstruction, which starts by clustering similar patches using the nonlocal self-similarity (NSS) prior into nonlocal image group and then imposes an LR penalty on each nonlocal image group. However, most existing methods only approximate the LR matrix directly from the degraded nonlocal image group, which may lead to suboptimal LR matrix approximation and thus obtain unsatisfactory reconstruction results. In this paper, we propose a novel nonlocal low-rank residual (NLRR) approach for image CS reconstruction, which progressively approximates the underlying LR matrix by minimizing the LR residual. To do this, we first use the NSS prior to obtaining a good estimate of the original nonlocal image group, and then the LR residual between the degraded nonlocal image group and the estimated nonlocal image group is minimized to derive a more accurate LR matrix. To ensure the optimization is both feasible and reliable, we employ an alternative direction multiplier method (ADMM) to solve the NLRR-based image CS reconstruction problem. Our experimental results show that the proposed NLRR algorithm achieves superior performance against many popular or state-of-the-art image CS reconstruction methods, both in objective metrics and subjective perceptual quality.

非局部低阶（LR）建模已被证明是图像压缩传感（CS）重建中的一种有效方法，它首先利用非局部自相似性（NSS）先验将相似斑块聚类为非局部图像组，然后对每个非局部图像组施加 LR 惩罚。然而，大多数现有方法只是直接从退化的非局部图像组近似 LR 矩阵，这可能会导致 LR 矩阵近似效果不理想，从而得到不尽人意的重建结果。在本文中，我们提出了一种用于图像 CS 重建的新型非局部低阶残差（NLRR）方法，该方法通过最小化 LR 残差逐步逼近底层 LR 矩阵。为此，我们首先使用 NSS 先验法获得原始非本地图像组的良好估计值，然后最小化退化的非本地图像组和估计的非本地图像组之间的 LR 残差，从而得出更精确的 LR 矩阵。为确保优化的可行性和可靠性，我们采用了另一种方向乘法（ADMM）来解决基于 NLRR 的图像 CS 重建问题。我们的实验结果表明，与许多流行的或最先进的图像 CS 重建方法相比，所提出的 NLRR 算法在客观指标和主观感知质量方面都取得了优异的性能。

{"title":"Image compressive sensing reconstruction via nonlocal low-rank residual-based ADMM framework","authors":"Junhao Zhang , Kim-Hui Yap , Lap-Pui Chau , Ce Zhu","doi":"10.1016/j.cviu.2024.104204","DOIUrl":"10.1016/j.cviu.2024.104204","url":null,"abstract":"<div><div>The nonlocal low-rank (LR) modeling has proven to be an effective approach in image compressive sensing (CS) reconstruction, which starts by clustering similar patches using the nonlocal self-similarity (NSS) prior into nonlocal image group and then imposes an LR penalty on each nonlocal image group. However, most existing methods only approximate the LR matrix directly from the degraded nonlocal image group, which may lead to suboptimal LR matrix approximation and thus obtain unsatisfactory reconstruction results. In this paper, we propose a novel nonlocal low-rank residual (NLRR) approach for image CS reconstruction, which progressively approximates the underlying LR matrix by minimizing the LR residual. To do this, we first use the NSS prior to obtaining a good estimate of the original nonlocal image group, and then the LR residual between the degraded nonlocal image group and the estimated nonlocal image group is minimized to derive a more accurate LR matrix. To ensure the optimization is both feasible and reliable, we employ an alternative direction multiplier method (ADMM) to solve the NLRR-based image CS reconstruction problem. Our experimental results show that the proposed NLRR algorithm achieves superior performance against many popular or state-of-the-art image CS reconstruction methods, both in objective metrics and subjective perceptual quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104204"},"PeriodicalIF":4.3,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0