Computer Vision and Image Understanding最新文献_第5页

3D Pose Nowcasting: Forecast the future to improve the present 3D姿态临近投射：预测未来以改善现在

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-20 DOI: 10.1016/j.cviu.2024.104233

Alessandro Simoni , Francesco Marchetti , Guido Borghi , Federico Becattini , Lorenzo Seidenari , Roberto Vezzani , Alberto Del Bimbo

Technologies to enable safe and effective collaboration and coexistence between humans and robots have gained significant importance in the last few years. A critical component useful for realizing this collaborative paradigm is the understanding of human and robot 3D poses using non-invasive systems. Therefore, in this paper, we propose a novel vision-based system leveraging depth data to accurately establish the 3D locations of skeleton joints. Specifically, we introduce the concept of Pose Nowcasting, denoting the capability of the proposed system to enhance its current pose estimation accuracy by jointly learning to forecast future poses. The experimental evaluation is conducted on two different datasets, providing accurate and real-time performance and confirming the validity of the proposed method on both the robotic and human scenarios.

在过去的几年中，使人类和机器人之间安全有效的协作和共存的技术变得非常重要。实现这种协作范例的一个关键组件是使用非侵入性系统理解人类和机器人的3D姿势。因此，在本文中，我们提出了一种新的基于视觉的系统，利用深度数据来准确地建立骨骼关节的三维位置。具体来说，我们引入了姿态临近投射的概念，表明系统能够通过联合学习预测未来姿态来提高当前姿态估计的精度。在两个不同的数据集上进行了实验评估，提供了准确和实时的性能，并验证了所提出方法在机器人和人类场景下的有效性。

引用次数: 0

Multi-Scale Adaptive Skeleton Transformer for action recognition 用于动作识别的多尺度自适应骨架变换器

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-19 DOI: 10.1016/j.cviu.2024.104229

Xiaotian Wang , Kai Chen , Zhifu Zhao , Guangming Shi , Xuemei Xie , Xiang Jiang , Yifan Yang

Transformer has demonstrated remarkable performance in various computer vision tasks. However, its potential is not fully explored in skeleton-based action recognition. On one hand, existing methods primarily utilize fixed function or pre-learned matrix to encode position information, while overlooking the sample-specific position information. On the other hand, these approaches focus on single-scale spatial relationships, while neglecting the discriminative fine-grained and coarse-grained spatial features. To address these issues, we propose a Multi-Scale Adaptive Skeleton Transformer (MSAST), including Adaptive Skeleton Position Encoding Module (ASPEM), Multi-Scale Embedding Module (MSEM), and Adaptive Relative Location Module (ARLM). ASPEM decouples spatial–temporal information in the position encoding procedure, which acquires inherent dependencies of skeleton sequences. ASPEM is also designed to be dependent on input tokens, which can learn sample-specific position information. The MSEM employs multi-scale pooling to generate multi-scale tokens that contain multi-grained features. Then, the spatial transformer captures multi-scale relations to address the subtle differences between various actions. Another contribution of this paper is that ARLM is presented to mine suitable location information for better recognition performance. Extensive experiments conducted on three benchmark datasets demonstrate that the proposed model achieves Top-1 accuracy of 94.9%/97.5% on NTU-60 C-Sub/C-View, 88.7%/91.6% on NTU-120 X-Sub/X-Set and 97.4% on NW-UCLA, respectively.

Transformer 在各种计算机视觉任务中都表现出了不俗的性能。然而，在基于骨骼的动作识别中，它的潜力还没有被充分挖掘出来。一方面，现有方法主要利用固定函数或预学习矩阵来编码位置信息，而忽略了特定样本的位置信息。另一方面，这些方法只关注单尺度空间关系，而忽略了具有区分性的细粒度和粗粒度空间特征。为了解决这些问题，我们提出了多尺度自适应骨架变换器（MSAST），包括自适应骨架位置编码模块（ASPEM）、多尺度嵌入模块（MSEM）和自适应相对位置模块（ARLM）。ASPEM 在位置编码过程中分离了时空信息，从而获得了骨架序列的固有依赖性。ASPEM 的设计还依赖于输入标记，可以学习特定样本的位置信息。MSEM 采用多尺度池化技术生成包含多粒度特征的多尺度标记。然后，空间转换器捕捉多尺度关系，以解决各种动作之间的细微差别。本文的另一个贡献是提出了 ARLM，以挖掘合适的位置信息，提高识别性能。在三个基准数据集上进行的广泛实验表明，所提出的模型在 NTU-60 C-Sub/C-View 上的 Top-1 准确率分别为 94.9%/97.5%，在 NTU-120 X-Sub/X-Set 上的 Top-1 准确率分别为 88.7%/91.6%，在 NW-UCLA 上的 Top-1 准确率为 97.4%。

{"title":"Multi-Scale Adaptive Skeleton Transformer for action recognition","authors":"Xiaotian Wang , Kai Chen , Zhifu Zhao , Guangming Shi , Xuemei Xie , Xiang Jiang , Yifan Yang","doi":"10.1016/j.cviu.2024.104229","DOIUrl":"10.1016/j.cviu.2024.104229","url":null,"abstract":"<div><div>Transformer has demonstrated remarkable performance in various computer vision tasks. However, its potential is not fully explored in skeleton-based action recognition. On one hand, existing methods primarily utilize fixed function or pre-learned matrix to encode position information, while overlooking the sample-specific position information. On the other hand, these approaches focus on single-scale spatial relationships, while neglecting the discriminative fine-grained and coarse-grained spatial features. To address these issues, we propose a Multi-Scale Adaptive Skeleton Transformer (MSAST), including Adaptive Skeleton Position Encoding Module (ASPEM), Multi-Scale Embedding Module (MSEM), and Adaptive Relative Location Module (ARLM). ASPEM decouples spatial–temporal information in the position encoding procedure, which acquires inherent dependencies of skeleton sequences. ASPEM is also designed to be dependent on input tokens, which can learn sample-specific position information. The MSEM employs multi-scale pooling to generate multi-scale tokens that contain multi-grained features. Then, the spatial transformer captures multi-scale relations to address the subtle differences between various actions. Another contribution of this paper is that ARLM is presented to mine suitable location information for better recognition performance. Extensive experiments conducted on three benchmark datasets demonstrate that the proposed model achieves Top-1 accuracy of 94.9%/97.5% on NTU-60 C-Sub/C-View, 88.7%/91.6% on NTU-120 X-Sub/X-Set and 97.4% on NW-UCLA, respectively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104229"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Open-set domain adaptation with visual-language foundation models 利用视觉语言基础模型进行开放集领域适应性调整

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-19 DOI: 10.1016/j.cviu.2024.104230

Qing Yu , Go Irie , Kiyoharu Aizawa

Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.

事实证明，无监督领域适应（UDA）可以非常有效地将从有标签数据的源领域获得的知识转移到无标签数据的目标领域。由于目标域中缺乏标注数据，可能存在未知类别，因此开放集域适应（ODA）成为在训练阶段识别这些类别的潜在解决方案。虽然现有的 ODA 方法旨在解决源域和目标域之间的分布偏移问题，但大多数方法都是在源域上对 ImageNet 预训练模型进行微调，然后在目标域上进行适配。最新的视觉语言基础模型（VLFM），如对比语言-图像预训练（CLIP），对许多分布偏移具有鲁棒性，因此应能大幅提高 ODA 的性能。在这项工作中，我们探索了将 CLIP（一种流行的 VLFM）用于 ODA 的通用方法。我们研究了使用 CLIP 进行零点预测的性能，然后提出了一种熵优化策略，利用 CLIP 的输出来辅助 ODA 模型。所提出的方法在各种基准测试中取得了最先进的结果，证明了它在解决 ODA 问题方面的有效性。

{"title":"Open-set domain adaptation with visual-language foundation models","authors":"Qing Yu , Go Irie , Kiyoharu Aizawa","doi":"10.1016/j.cviu.2024.104230","DOIUrl":"10.1016/j.cviu.2024.104230","url":null,"abstract":"<div><div>Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104230"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142722825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Action assessment in rehabilitation: Leveraging machine learning and vision-based analysis 康复中的行动评估：利用机器学习和基于视觉的分析

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-19 DOI: 10.1016/j.cviu.2024.104228

Alaa Kryeem , Noy Boutboul , Itai Bear , Shmuel Raz , Dana Eluz , Dorit Itah , Hagit Hel-Or , Ilan Shimshoni

Post-hip replacement rehabilitation often depends on exercises under medical supervision. Yet, the lack of therapists, financial limits, and inconsistent evaluations call for a more user-friendly, accessible approach. Our proposed solution is a scalable, affordable system based on computer vision, leveraging machine learning and 2D cameras to provide tailored monitoring. This system is designed to address the shortcomings of conventional rehab methods, facilitating effective healthcare at home. The system’s key feature is the use of DTAN deep learning approach to synchronize exercise data over time, which guarantees precise analysis and evaluation. We also introduce a ‘Golden Feature’—a spatio-temporal element that embodies the essential movement of the exercise, serving as the foundation for aligning signals and identifying crucial exercise intervals. The system employs automated feature extraction and selection, offering valuable insights into the execution of exercises and enhancing the system’s precision. Moreover, it includes a multi-label ML model that not only predicts exercise scores but also forecasts therapists’ feedback for exercises performed partially. Performance of the proposed system is shown to be predict exercise scores with accuracy between 82% and 95%. Due to the automatic feature selection, and alignment methods, the proposed framework is easily scalable to additional exercises.

髋关节置换术后的康复通常依赖于在医疗监督下的锻炼。然而，缺乏治疗师、财政限制和不一致的评估需要一种更友好、更容易获得的方法。我们提出的解决方案是基于计算机视觉的可扩展，经济实惠的系统，利用机器学习和2D相机提供量身定制的监控。该系统旨在解决传统康复方法的缺点，促进在家有效的医疗保健。该系统的主要特点是使用DTAN深度学习方法随着时间的推移同步运动数据，从而保证精确的分析和评估。我们还引入了一个“黄金特征”——一个体现锻炼基本运动的时空元素，作为对齐信号和识别关键锻炼间隔的基础。该系统采用自动特征提取和选择，为练习的执行提供了有价值的见解，并提高了系统的精度。此外，它还包括一个多标签ML模型，不仅可以预测运动分数，还可以预测治疗师对部分运动的反馈。该系统预测运动分数的准确率在82%到95%之间。由于采用自动特征选择和对齐方法，所提出的框架很容易扩展到其他练习。

{"title":"Action assessment in rehabilitation: Leveraging machine learning and vision-based analysis","authors":"Alaa Kryeem , Noy Boutboul , Itai Bear , Shmuel Raz , Dana Eluz , Dorit Itah , Hagit Hel-Or , Ilan Shimshoni","doi":"10.1016/j.cviu.2024.104228","DOIUrl":"10.1016/j.cviu.2024.104228","url":null,"abstract":"<div><div>Post-hip replacement rehabilitation often depends on exercises under medical supervision. Yet, the lack of therapists, financial limits, and inconsistent evaluations call for a more user-friendly, accessible approach. Our proposed solution is a scalable, affordable system based on computer vision, leveraging machine learning and 2D cameras to provide tailored monitoring. This system is designed to address the shortcomings of conventional rehab methods, facilitating effective healthcare at home. The system’s key feature is the use of DTAN deep learning approach to synchronize exercise data over time, which guarantees precise analysis and evaluation. We also introduce a ‘Golden Feature’—a spatio-temporal element that embodies the essential movement of the exercise, serving as the foundation for aligning signals and identifying crucial exercise intervals. The system employs automated feature extraction and selection, offering valuable insights into the execution of exercises and enhancing the system’s precision. Moreover, it includes a multi-label ML model that not only predicts exercise scores but also forecasts therapists’ feedback for exercises performed partially. Performance of the proposed system is shown to be predict exercise scores with accuracy between 82% and 95%. Due to the automatic feature selection, and alignment methods, the proposed framework is easily scalable to additional exercises.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104228"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142757044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging vision-language prompts for real-world image restoration and enhancement 利用视觉语言提示进行真实世界图像修复和增强

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-16 DOI: 10.1016/j.cviu.2024.104222

Yanyan Wei , Yilin Zhang , Kun Li , Fei Wang , Shengeng Tang , Zhao Zhang

Significant advancements have been made in image restoration methods aimed at removing adverse weather effects. However, due to natural constraints, it is challenging to collect real-world datasets for adverse weather removal tasks. Consequently, existing methods predominantly rely on synthetic datasets, which struggle to generalize to real-world data, thereby limiting their practical utility. While some real-world adverse weather removal datasets have emerged, their design, which involves capturing ground truths at a different moment, inevitably introduces interfering discrepancies between the degraded images and the ground truths. These discrepancies include variations in brightness, color, contrast, and minor misalignments. Meanwhile, real-world datasets typically involve complex rather than singular degradation types. In many samples, degradation features are not overt, which poses immense challenges to real-world adverse weather removal methodologies. To tackle these issues, we introduce the recently prominent vision-language model, CLIP, to aid in the image restoration process. An expanded and fine-tuned CLIP model acts as an ‘expert’, leveraging the image priors acquired through large-scale pre-training to guide the operation of the image restoration model. Additionally, we generate a set of pseudo-ground-truths on sequences of degraded images to further alleviate the difficulty for the model in fitting the data. To imbue the model with more prior knowledge about degradation characteristics, we also incorporate additional synthetic training data. Lastly, the progressive learning and fine-tuning strategies employed during training enhance the model’s final performance, enabling our method to surpass existing approaches in both visual quality and objective image quality assessment metrics.

旨在消除不利天气影响的图像修复方法取得了重大进展。然而，由于自然条件的限制，收集真实世界的数据集来完成消除不利天气影响的任务具有挑战性。因此，现有的方法主要依赖于合成数据集，而合成数据集很难推广到真实世界的数据中，从而限制了这些方法的实用性。虽然已经出现了一些真实世界的不利天气消除数据集，但其设计涉及在不同时刻捕捉地面实况，不可避免地会在降级图像和地面实况之间引入干扰性差异。这些差异包括亮度、颜色、对比度的变化以及轻微的错位。同时，真实世界的数据集通常涉及复杂而非单一的退化类型。在许多样本中，退化特征并不明显，这给现实世界中的不利天气消除方法带来了巨大挑战。为了解决这些问题，我们引入了最近突出的视觉语言模型 CLIP 来帮助图像修复过程。经过扩展和微调的 CLIP 模型就像一个 "专家"，利用通过大规模预训练获得的图像先验来指导图像修复模型的运行。此外，我们还在退化图像序列上生成了一组伪地面真实值，以进一步减轻模型拟合数据的难度。为了给模型注入更多关于降解特征的先验知识，我们还加入了额外的合成训练数据。最后，在训练过程中采用的渐进学习和微调策略提高了模型的最终性能，使我们的方法在视觉质量和客观图像质量评估指标方面都超越了现有方法。

{"title":"Leveraging vision-language prompts for real-world image restoration and enhancement","authors":"Yanyan Wei , Yilin Zhang , Kun Li , Fei Wang , Shengeng Tang , Zhao Zhang","doi":"10.1016/j.cviu.2024.104222","DOIUrl":"10.1016/j.cviu.2024.104222","url":null,"abstract":"<div><div>Significant advancements have been made in image restoration methods aimed at removing adverse weather effects. However, due to natural constraints, it is challenging to collect real-world datasets for adverse weather removal tasks. Consequently, existing methods predominantly rely on synthetic datasets, which struggle to generalize to real-world data, thereby limiting their practical utility. While some real-world adverse weather removal datasets have emerged, their design, which involves capturing ground truths at a different moment, inevitably introduces interfering discrepancies between the degraded images and the ground truths. These discrepancies include variations in brightness, color, contrast, and minor misalignments. Meanwhile, real-world datasets typically involve complex rather than singular degradation types. In many samples, degradation features are not overt, which poses immense challenges to real-world adverse weather removal methodologies. To tackle these issues, we introduce the recently prominent vision-language model, CLIP, to aid in the image restoration process. An expanded and fine-tuned CLIP model acts as an ‘expert’, leveraging the image priors acquired through large-scale pre-training to guide the operation of the image restoration model. Additionally, we generate a set of pseudo-ground-truths on sequences of degraded images to further alleviate the difficulty for the model in fitting the data. To imbue the model with more prior knowledge about degradation characteristics, we also incorporate additional synthetic training data. Lastly, the progressive learning and fine-tuning strategies employed during training enhance the model’s final performance, enabling our method to surpass existing approaches in both visual quality and objective image quality assessment metrics.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104222"},"PeriodicalIF":4.3,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RetSeg3D: Retention-based 3D semantic segmentation for autonomous driving RetSeg3D：用于自动驾驶的基于保留的 3D 语义分割

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-15 DOI: 10.1016/j.cviu.2024.104231

Gopi Krishna Erabati, Helder Araujo

LiDAR semantic segmentation is one of the crucial tasks for scene understanding in autonomous driving. Recent trends suggest that voxel- or fusion-based methods obtain improved performance. However, the fusion-based methods are computationally expensive. On the other hand, the voxel-based methods uniformly employ local operators (e.g., 3D SparseConv) without considering the varying-density property of LiDAR point clouds, which result in inferior performance, specifically on far away sparse points due to limited receptive field. To tackle this issue, we propose novel retention block to capture long-range dependencies, maintain the receptive field of far away sparse points and design RetSeg3D, a retention-based 3D semantic segmentation model for autonomous driving. Instead of vanilla attention mechanism to model long-range dependencies, inspired by RetNet, we design cubic window multi-scale retentive self-attention (CW-MSRetSA) module with bidirectional and 3D explicit decay mechanism to introduce 3D spatial distance related prior information into the model to improve not only the receptive field but also the model capacity. Our novel retention block maintains the receptive field which significantly improve the performance of far away sparse points. We conduct extensive experiments and analysis on three large-scale datasets: SemanticKITTI, nuScenes and Waymo. Our method not only outperforms existing methods on far away sparse points but also on close and medium distance points and efficiently runs in real time at 52.1 FPS on a RTX 4090 GPU.

激光雷达语义分割是自动驾驶场景理解的关键任务之一。最近的趋势表明，基于体素或融合的方法可以提高性能。然而，基于融合的方法计算成本较高。另一方面，基于体素的方法统一采用局部算子（如 3D SparseConv），而不考虑激光雷达点云的密度变化特性，因此性能较差，特别是在远距离稀疏点上，因为感受野有限。为了解决这个问题，我们提出了新颖的保留块来捕捉长程依赖性，保持远距离稀疏点的感受野，并设计出基于保留的自动驾驶三维语义分割模型 RetSeg3D。在 RetNet 的启发下，我们设计了具有双向和三维显式衰减机制的立方窗口多尺度保持自我注意（CW-MSRetSA）模块，将三维空间距离相关的先验信息引入模型，从而不仅改善了感受野，还提高了模型容量，而不是采用虚无注意机制来建立长程依赖关系模型。我们新颖的保留块可以保持感受野，从而显著提高远距离稀疏点的性能。我们在三个大规模数据集上进行了广泛的实验和分析：我们在 SemanticKITTI、nuScenes 和 Waymo 三个大规模数据集上进行了广泛的实验和分析。我们的方法不仅在远距离稀疏点上优于现有方法，在近距离和中距离点上也是如此，并且能在 RTX 4090 GPU 上以 52.1 FPS 的速度高效地实时运行。

{"title":"RetSeg3D: Retention-based 3D semantic segmentation for autonomous driving","authors":"Gopi Krishna Erabati, Helder Araujo","doi":"10.1016/j.cviu.2024.104231","DOIUrl":"10.1016/j.cviu.2024.104231","url":null,"abstract":"<div><div>LiDAR semantic segmentation is one of the crucial tasks for scene understanding in autonomous driving. Recent trends suggest that voxel- or fusion-based methods obtain improved performance. However, the fusion-based methods are computationally expensive. On the other hand, the voxel-based methods uniformly employ local operators (e.g., 3D SparseConv) without considering the varying-density property of LiDAR point clouds, which result in inferior performance, specifically on far away sparse points due to limited receptive field. To tackle this issue, we propose novel retention block to capture long-range dependencies, maintain the receptive field of far away sparse points and design <strong>RetSeg3D</strong>, a retention-based 3D semantic segmentation model for autonomous driving. Instead of vanilla attention mechanism to model long-range dependencies, inspired by RetNet, we design cubic window multi-scale retentive self-attention (CW-MSRetSA) module with bidirectional and 3D explicit decay mechanism to introduce 3D spatial distance related prior information into the model to improve not only the receptive field but also the model capacity. Our novel retention block maintains the receptive field which significantly improve the performance of far away sparse points. We conduct extensive experiments and analysis on three large-scale datasets: SemanticKITTI, nuScenes and Waymo. Our method not only outperforms existing methods on far away sparse points but also on close and medium distance points and efficiently runs in real time at 52.1 FPS on a RTX 4090 GPU.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104231"},"PeriodicalIF":4.3,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SANet: Selective Aggregation Network for unsupervised object re-identification SANet：用于无监督对象再识别的选择性聚合网络

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-15 DOI: 10.1016/j.cviu.2024.104232

Minghui Lin, Jianhua Tang, Longbin Fu, Zhengrong Zuo

Recent advancements in unsupervised object re-identification have witnessed remarkable progress, which usually focuses on capturing fine-grained semantic information through partitioning or relying on auxiliary networks for optimizing label consistency. However, incorporating extra complex partitioning mechanisms and models leads to non-negligible optimization difficulties, resulting in limited performance gains. To address these problems, this paper presents a Selective Aggregation Network (SANet) to obtain high-quality features and labels for unsupervised object re-identification, which explores primitive fine-grained information of large-scale pre-trained models such as CLIP and designs customized modifications. Specifically, we propose an adaptive selective aggregation module that chooses a set of tokens based on CLIP’s attention scores to aggregate discriminative global features. Built upon the representations output by the adaptive selective aggregation module, we design a dynamic weighted clustering algorithm to obtain accurate confidence-weighted pseudo-class centers for contrastive learning. In addition, a dual confidence judgment strategy is introduced to refine and correct the pseudo-labels by assigning three categories of samples through their noise degree. By this means, the proposed SANet enables discriminative feature extraction and clustering refinement for more precise classification without complex architectures such as feature partitioning or auxiliary models. Extensive experiments on existing standard unsupervised object re-identification benchmarks, including Market1501, MSMT17, and Veri776, demonstrate the effectiveness of the proposed SANet method, and SANet achieves state-of-the-art results over other strong competitors.

近年来，无监督物体再识别技术取得了显著进步，其重点通常是通过分区捕捉细粒度语义信息，或依靠辅助网络优化标签一致性。然而，加入额外复杂的分区机制和模型会带来不可忽视的优化困难，导致性能提升有限。为了解决这些问题，本文提出了一种选择性聚合网络（SANet），以获得高质量的特征和标签，用于无监督对象再识别，该网络可探索大规模预训练模型（如 CLIP）的原始细粒度信息，并设计定制的修改。具体来说，我们提出了一个自适应选择性聚合模块，该模块会根据 CLIP 的注意力分数选择一组标记来聚合具有区分性的全局特征。在自适应选择性聚合模块输出的表征基础上，我们设计了一种动态加权聚类算法，以获得用于对比学习的精确置信度加权伪类中心。此外，我们还引入了双重置信度判断策略，通过对样本的噪声程度划分为三个类别来完善和修正伪标签。通过这种方法，所提出的 SANet 无需复杂的架构（如特征分割或辅助模型），就能进行判别特征提取和聚类细化，从而实现更精确的分类。在现有的标准无监督对象再识别基准（包括 Market1501、MSMT17 和 Veri776）上进行的广泛实验证明了所提出的 SANet 方法的有效性，SANet 取得了超越其他强大竞争对手的一流结果。

{"title":"SANet: Selective Aggregation Network for unsupervised object re-identification","authors":"Minghui Lin, Jianhua Tang, Longbin Fu, Zhengrong Zuo","doi":"10.1016/j.cviu.2024.104232","DOIUrl":"10.1016/j.cviu.2024.104232","url":null,"abstract":"<div><div>Recent advancements in unsupervised object re-identification have witnessed remarkable progress, which usually focuses on capturing fine-grained semantic information through partitioning or relying on auxiliary networks for optimizing label consistency. However, incorporating extra complex partitioning mechanisms and models leads to non-negligible optimization difficulties, resulting in limited performance gains. To address these problems, this paper presents a Selective Aggregation Network (SANet) to obtain high-quality features and labels for unsupervised object re-identification, which explores primitive fine-grained information of large-scale pre-trained models such as CLIP and designs customized modifications. Specifically, we propose an adaptive selective aggregation module that chooses a set of tokens based on CLIP’s attention scores to aggregate discriminative global features. Built upon the representations output by the adaptive selective aggregation module, we design a dynamic weighted clustering algorithm to obtain accurate confidence-weighted pseudo-class centers for contrastive learning. In addition, a dual confidence judgment strategy is introduced to refine and correct the pseudo-labels by assigning three categories of samples through their noise degree. By this means, the proposed SANet enables discriminative feature extraction and clustering refinement for more precise classification without complex architectures such as feature partitioning or auxiliary models. Extensive experiments on existing standard unsupervised object re-identification benchmarks, including Market1501, MSMT17, and Veri776, demonstrate the effectiveness of the proposed SANet method, and SANet achieves state-of-the-art results over other strong competitors.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104232"},"PeriodicalIF":4.3,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scene-cGAN: A GAN for underwater restoration and scene depth estimation Scene-cGAN：用于水下复原和场景深度估计的 GAN

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-13 DOI: 10.1016/j.cviu.2024.104225

Salma González-Sabbagh , Antonio Robles-Kelly , Shang Gao

Despite their wide scope of application, the development of underwater models for image restoration and scene depth estimation is not a straightforward task due to the limited size and quality of underwater datasets, as well as variations in water colours resulting from attenuation, absorption and scattering phenomena in the water column. To address these challenges, we present an all-in-one conditional generative adversarial network (cGAN) called Scene-cGAN. Our cGAN is a physics-based multi-domain model designed for image dewatering, restoration and depth estimation. It comprises three generators and one discriminator. To train our Scene-cGAN, we use a multi-term loss function based on uni-directional cycle-consistency and a novel dataset. This dataset is constructed from RGB-D in-air images using spectral data and concentrations of water constituents obtained from real-world water quality surveys. This approach allows us to produce imagery consistent with the radiance and veiling light corresponding to representative water types. Additionally, we compare Scene-cGAN with current state-of-the-art methods using various datasets. Results demonstrate its competitiveness in terms of colour restoration and its effectiveness in estimating the depth information for complex underwater scenes.

尽管水下模型的应用范围很广，但由于水下数据集的规模和质量有限，以及水体中的衰减、吸收和散射现象导致的水色变化，开发用于图像复原和场景深度估计的水下模型并非易事。为了应对这些挑战，我们提出了一种名为场景生成对抗网络（Scene-cGAN）的一体化条件生成对抗网络（cGAN）。我们的 cGAN 是一个基于物理的多域模型，设计用于图像脱水、还原和深度估计。它由三个生成器和一个判别器组成。为了训练 Scene-cGAN，我们使用了基于单向循环一致性的多期损失函数和一个新颖的数据集。该数据集由 RGB-D 空中图像构建而成，使用了从真实世界水质调查中获得的光谱数据和水成分浓度。通过这种方法，我们可以生成与代表性水体类型对应的辐射和纱光相一致的图像。此外，我们还利用各种数据集将 Scene-cGAN 与当前最先进的方法进行了比较。结果表明，该方法在色彩还原方面具有竞争力，在估计复杂水下场景的深度信息方面也很有效。

{"title":"Scene-cGAN: A GAN for underwater restoration and scene depth estimation","authors":"Salma González-Sabbagh , Antonio Robles-Kelly , Shang Gao","doi":"10.1016/j.cviu.2024.104225","DOIUrl":"10.1016/j.cviu.2024.104225","url":null,"abstract":"<div><div>Despite their wide scope of application, the development of underwater models for image restoration and scene depth estimation is not a straightforward task due to the limited size and quality of underwater datasets, as well as variations in water colours resulting from attenuation, absorption and scattering phenomena in the water column. To address these challenges, we present an all-in-one conditional generative adversarial network (cGAN) called Scene-cGAN. Our cGAN is a physics-based multi-domain model designed for image dewatering, restoration and depth estimation. It comprises three generators and one discriminator. To train our Scene-cGAN, we use a multi-term loss function based on uni-directional cycle-consistency and a novel dataset. This dataset is constructed from RGB-D in-air images using spectral data and concentrations of water constituents obtained from real-world water quality surveys. This approach allows us to produce imagery consistent with the radiance and veiling light corresponding to representative water types. Additionally, we compare Scene-cGAN with current state-of-the-art methods using various datasets. Results demonstrate its competitiveness in terms of colour restoration and its effectiveness in estimating the depth information for complex underwater scenes.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104225"},"PeriodicalIF":4.3,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

2S-SGCN: A two-stage stratified graph convolutional network model for facial landmark detection on 3D data 2S-SGCN：用于三维数据面部地标检测的两级分层图卷积网络模型

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-12 DOI: 10.1016/j.cviu.2024.104227

Jacopo Burger, Giorgio Blandano, Giuseppe Maurizio Facchi, Raffaella Lanzarotti

Facial Landmark Detection (FLD) algorithms play a crucial role in numerous computer vision applications, particularly in tasks such as face recognition, head pose estimation, and facial expression analysis. While FLD on images has long been the focus, the emergence of 3D data has led to a surge of interest in FLD on it due to its potential applications in various fields, including medical research. However, automating FLD in this context presents significant challenges, such as selecting suitable network architectures, refining outputs for precise landmark localization and optimizing computational efficiency. In response, this paper presents a novel approach, the 2-Stage Stratified Graph Convolutional Network (2S-SGCN), which addresses these challenges comprehensively. The first stage aims to detect landmark regions using heatmap regression, which leverages both local and long-range dependencies through a stratified approach. In the second stage, 3D landmarks are precisely determined using a new post-processing technique, namely MSE-over-mesh. 2S-SGCN ensures both efficiency and suitability for resource-constrained devices. Experimental results on 3D scans from the public Facescape and Headspace datasets, as well as on point clouds derived from FLAME meshes collected in the DAD-3DHeads dataset, demonstrate that the proposed method achieves state-of-the-art performance across various conditions. Source code is accessible at https://github.com/gfacchi-dev/CVIU-2S-SGCN.

面部地标检测（FLD）算法在众多计算机视觉应用中发挥着至关重要的作用，尤其是在人脸识别、头部姿态估计和面部表情分析等任务中。长期以来，图像上的 FLD 一直是人们关注的焦点，而三维数据的出现使人们对其产生了浓厚的兴趣，因为它在医学研究等各个领域都有潜在的应用价值。然而，在这种情况下实现 FLD 自动化面临着巨大的挑战，例如选择合适的网络架构、完善精确地标定位的输出以及优化计算效率。为此，本文提出了一种新方法--两阶段分层图卷积网络（2S-SGCN），它能全面应对这些挑战。第一阶段旨在利用热图回归检测地标区域，通过分层方法利用局部和长程依赖关系。在第二阶段，利用一种新的后处理技术（即网格上的 MSE）精确确定三维地标。2S-SGCN 既保证了效率，又适用于资源有限的设备。对来自公共 Facescape 和 Headspace 数据集的三维扫描以及从 DAD-3DHeads 数据集收集的 FLAME 网格中提取的点云的实验结果表明，所提出的方法在各种条件下都能达到最先进的性能。源代码请访问 https://github.com/gfacchi-dev/CVIU-2S-SGCN。

{"title":"2S-SGCN: A two-stage stratified graph convolutional network model for facial landmark detection on 3D data","authors":"Jacopo Burger, Giorgio Blandano, Giuseppe Maurizio Facchi, Raffaella Lanzarotti","doi":"10.1016/j.cviu.2024.104227","DOIUrl":"10.1016/j.cviu.2024.104227","url":null,"abstract":"<div><div>Facial Landmark Detection (FLD) algorithms play a crucial role in numerous computer vision applications, particularly in tasks such as face recognition, head pose estimation, and facial expression analysis. While FLD on images has long been the focus, the emergence of 3D data has led to a surge of interest in FLD on it due to its potential applications in various fields, including medical research. However, automating FLD in this context presents significant challenges, such as selecting suitable network architectures, refining outputs for precise landmark localization and optimizing computational efficiency. In response, this paper presents a novel approach, the 2-Stage Stratified Graph Convolutional Network (<span>2S-SGCN</span>), which addresses these challenges comprehensively. The first stage aims to detect landmark regions using heatmap regression, which leverages both local and long-range dependencies through a stratified approach. In the second stage, 3D landmarks are precisely determined using a new post-processing technique, namely <span>MSE-over-mesh</span>. <span>2S-SGCN</span> ensures both efficiency and suitability for resource-constrained devices. Experimental results on 3D scans from the public Facescape and Headspace datasets, as well as on point clouds derived from FLAME meshes collected in the DAD-3DHeads dataset, demonstrate that the proposed method achieves state-of-the-art performance across various conditions. Source code is accessible at <span><span>https://github.com/gfacchi-dev/CVIU-2S-SGCN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104227"},"PeriodicalIF":4.3,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual stage semantic information based generative adversarial network for image super-resolution 基于生成式对抗网络的双阶段语义信息图像超分辨率

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-11-11 DOI: 10.1016/j.cviu.2024.104226

Shailza Sharma , Abhinav Dhall , Shikhar Johri , Vinay Kumar , Vivek Singh

Deep learning has revolutionized image super-resolution, yet challenges persist in preserving intricate details and avoiding overly smooth reconstructions. In this work, we introduce a novel architecture, the Residue and Semantic Feature-based Dual Subpixel Generative Adversarial Network (RSF-DSGAN), which emphasizes the critical role of semantic information in addressing these issues. The proposed generator architecture is designed with two sequential stages: the Premier Residual Stage and the Deuxième Residual Stage. These stages are concatenated to form a dual-stage upsampling process, substantially augmenting the model’s capacity for feature learning. A central innovation of our approach is the integration of semantic information directly into the generator. Specifically, feature maps derived from a pre-trained network are fused with the primary feature maps of the first stage, enriching the generator with high-level contextual cues. This semantic infusion enhances the fidelity and sharpness of reconstructed images, particularly in preserving object details and textures. Inter- and intra-residual connections are employed within these stages to maintain high-frequency details and fine textures. Additionally, spectral normalization is introduced in the discriminator to stabilize training. Comprehensive evaluations, including visual perception and mean opinion scores, demonstrate that RSF-DSGAN, with its emphasis on semantic information, outperforms current state-of-the-art super-resolution methods.

深度学习为图像超分辨率带来了革命性的变化，但在保留复杂细节和避免过于平滑的重建方面仍然存在挑战。在这项工作中，我们介绍了一种新颖的架构--基于残差和语义特征的双子像素生成对抗网络（RSF-DSGAN），它强调了语义信息在解决这些问题中的关键作用。拟议的生成器架构设计有两个连续的阶段：第一残差阶段和第二残差阶段。这两个阶段连接起来形成一个双阶段的上采样过程，大大增强了模型的特征学习能力。我们方法的核心创新点是将语义信息直接整合到生成器中。具体来说，从预训练网络中提取的特征图与第一阶段的主要特征图融合在一起，通过高级上下文线索丰富生成器。这种语义注入可提高重建图像的保真度和清晰度，尤其是在保留物体细节和纹理方面。在这些阶段中，采用了残基间和残基内的连接，以保持高频细节和精细纹理。此外，还在判别器中引入了光谱归一化以稳定训练。包括视觉感知和平均意见分数在内的综合评估结果表明，RSF-DSGAN 重视语义信息，其效果优于目前最先进的超分辨率方法。

{"title":"Dual stage semantic information based generative adversarial network for image super-resolution","authors":"Shailza Sharma , Abhinav Dhall , Shikhar Johri , Vinay Kumar , Vivek Singh","doi":"10.1016/j.cviu.2024.104226","DOIUrl":"10.1016/j.cviu.2024.104226","url":null,"abstract":"<div><div>Deep learning has revolutionized image super-resolution, yet challenges persist in preserving intricate details and avoiding overly smooth reconstructions. In this work, we introduce a novel architecture, the Residue and Semantic Feature-based Dual Subpixel Generative Adversarial Network (RSF-DSGAN), which emphasizes the critical role of semantic information in addressing these issues. The proposed generator architecture is designed with two sequential stages: the Premier Residual Stage and the Deuxième Residual Stage. These stages are concatenated to form a dual-stage upsampling process, substantially augmenting the model’s capacity for feature learning. A central innovation of our approach is the integration of semantic information directly into the generator. Specifically, feature maps derived from a pre-trained network are fused with the primary feature maps of the first stage, enriching the generator with high-level contextual cues. This semantic infusion enhances the fidelity and sharpness of reconstructed images, particularly in preserving object details and textures. Inter- and intra-residual connections are employed within these stages to maintain high-frequency details and fine textures. Additionally, spectral normalization is introduced in the discriminator to stabilize training. Comprehensive evaluations, including visual perception and mean opinion scores, demonstrate that RSF-DSGAN, with its emphasis on semantic information, outperforms current state-of-the-art super-resolution methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104226"},"PeriodicalIF":4.3,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0