首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Bidirectional brain image translation using transfer learning from generic pre-trained models 利用通用预训练模型的迁移学习实现双向大脑图像翻译
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-31 DOI: 10.1016/j.cviu.2024.104100

Brain imaging plays a crucial role in the diagnosis and treatment of various neurological disorders, providing valuable insights into the structure and function of the brain. Techniques such as magnetic resonance imaging (MRI) and computed tomography (CT) enable non-invasive visualization of the brain, aiding in the understanding of brain anatomy, abnormalities, and functional connectivity. However, cost and radiation dose may limit the acquisition of specific image modalities, so medical image synthesis can be used to generate required medical images without actual addition. CycleGAN and other GANs are valuable tools for generating synthetic images across various fields. In the medical domain, where obtaining labeled medical images is labor-intensive and expensive, addressing data scarcity is a major challenge. Recent studies propose using transfer learning to overcome this issue. This involves adapting pre-trained CycleGAN models, initially trained on non-medical data, to generate realistic medical images. In this work, transfer learning was applied to the task of MR-CT image translation and vice versa using 18 pre-trained non-medical models, and the models were fine-tuned to have the best result. The models’ performance was evaluated using four widely used image quality metrics: Peak-signal-to-noise-ratio, Structural Similarity Index, Universal Quality Index, and Visual Information Fidelity. Quantitative evaluation and qualitative perceptual analysis by radiologists demonstrate the potential of transfer learning in medical imaging and the effectiveness of the generic pre-trained model. The results provide compelling evidence of the model’s exceptional performance, which can be attributed to the high quality and similarity of the training images to actual human brain images. These results underscore the significance of carefully selecting appropriate and representative training images to optimize performance in brain image analysis tasks.

脑成像技术在诊断和治疗各种神经系统疾病方面发挥着至关重要的作用,为人们深入了解大脑的结构和功能提供了宝贵的资料。磁共振成像(MRI)和计算机断层扫描(CT)等技术可对大脑进行非侵入式可视化,有助于了解大脑解剖、异常和功能连接。然而,成本和辐射剂量可能会限制特定图像模式的获取,因此医学图像合成可用于生成所需的医学图像,而无需实际添加。CycleGAN 和其他 GAN 是生成各领域合成图像的重要工具。在医疗领域,获取有标记的医学图像需要耗费大量人力和财力,因此解决数据稀缺问题是一项重大挑战。最近的研究提出利用迁移学习来解决这一问题。这涉及将最初在非医疗数据上训练的预训练 CycleGAN 模型调整为生成真实的医疗图像。在这项工作中,利用 18 个预先训练好的非医学模型,将迁移学习应用于 MR-CT 图像翻译任务,反之亦然,并对模型进行微调,以获得最佳效果。这些模型的性能使用四种广泛使用的图像质量指标进行评估:峰值信噪比、结构相似性指数、通用质量指数和视觉信息保真度。放射科医生的定量评估和定性感知分析证明了迁移学习在医学成像中的潜力以及通用预训练模型的有效性。结果令人信服地证明了该模型的卓越性能,这归功于训练图像的高质量和与实际人脑图像的相似性。这些结果凸显了精心选择适当且具有代表性的训练图像对优化脑图像分析任务性能的重要意义。
{"title":"Bidirectional brain image translation using transfer learning from generic pre-trained models","authors":"","doi":"10.1016/j.cviu.2024.104100","DOIUrl":"10.1016/j.cviu.2024.104100","url":null,"abstract":"<div><p>Brain imaging plays a crucial role in the diagnosis and treatment of various neurological disorders, providing valuable insights into the structure and function of the brain. Techniques such as magnetic resonance imaging (MRI) and computed tomography (CT) enable non-invasive visualization of the brain, aiding in the understanding of brain anatomy, abnormalities, and functional connectivity. However, cost and radiation dose may limit the acquisition of specific image modalities, so medical image synthesis can be used to generate required medical images without actual addition. CycleGAN and other GANs are valuable tools for generating synthetic images across various fields. In the medical domain, where obtaining labeled medical images is labor-intensive and expensive, addressing data scarcity is a major challenge. Recent studies propose using transfer learning to overcome this issue. This involves adapting pre-trained CycleGAN models, initially trained on non-medical data, to generate realistic medical images. In this work, transfer learning was applied to the task of MR-CT image translation and vice versa using 18 pre-trained non-medical models, and the models were fine-tuned to have the best result. The models’ performance was evaluated using four widely used image quality metrics: Peak-signal-to-noise-ratio, Structural Similarity Index, Universal Quality Index, and Visual Information Fidelity. Quantitative evaluation and qualitative perceptual analysis by radiologists demonstrate the potential of transfer learning in medical imaging and the effectiveness of the generic pre-trained model. The results provide compelling evidence of the model’s exceptional performance, which can be attributed to the high quality and similarity of the training images to actual human brain images. These results underscore the significance of carefully selecting appropriate and representative training images to optimize performance in brain image analysis tasks.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image semantic segmentation of indoor scenes: A survey 室内场景的图像语义分割:调查
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-31 DOI: 10.1016/j.cviu.2024.104102

This survey provides a comprehensive evaluation of various deep learning-based segmentation architectures. It covers a wide range of models, from traditional ones like FCN and PSPNet to more modern approaches like SegFormer and FAN. In addition to assessing the methods in terms of segmentation accuracy, we propose to also evaluate the methods in terms of temporal consistency and corruption vulnerability. Most of the existing surveys on semantic segmentation focus on outdoor datasets. In contrast, this survey focuses on indoor scenarios to enhance the applicability of segmentation methods in this specific domain. Furthermore, our evaluation consists of a performance analysis of the methods in prevalent real-world segmentation scenarios that pose particular challenges. These complex situations involve scenes impacted by diverse forms of noise, blur corruptions, camera movements, optical aberrations, among other factors. By jointly exploring the segmentation accuracy, temporal consistency, and corruption vulnerability in challenging real-world situations, our survey offers insights that go beyond existing surveys, facilitating the understanding and development of better image segmentation methods for indoor scenes.

本调查对各种基于深度学习的分割架构进行了全面评估。它涵盖了各种模型,从 FCN 和 PSPNet 等传统模型到 SegFormer 和 FAN 等更现代的方法。除了从分割准确性的角度对这些方法进行评估外,我们还建议从时间一致性和损坏脆弱性的角度对这些方法进行评估。现有的语义分割调查大多集中在室外数据集上。相比之下,本调查侧重于室内场景,以提高分割方法在这一特定领域的适用性。此外,我们的评估还包括对这些方法在现实世界中常见的分割场景中的性能分析,这些场景带来了特殊的挑战。这些复杂的场景受到各种形式的噪声、模糊损坏、相机移动、光学像差等因素的影响。通过共同探讨具有挑战性的真实场景中的分割准确性、时间一致性和易损坏性,我们的调查提供了超越现有调查的见解,有助于理解和开发更好的室内场景图像分割方法。
{"title":"Image semantic segmentation of indoor scenes: A survey","authors":"","doi":"10.1016/j.cviu.2024.104102","DOIUrl":"10.1016/j.cviu.2024.104102","url":null,"abstract":"<div><p>This survey provides a comprehensive evaluation of various deep learning-based segmentation architectures. It covers a wide range of models, from traditional ones like FCN and PSPNet to more modern approaches like SegFormer and FAN. In addition to assessing the methods in terms of segmentation accuracy, we propose to also evaluate the methods in terms of temporal consistency and corruption vulnerability. Most of the existing surveys on semantic segmentation focus on outdoor datasets. In contrast, this survey focuses on indoor scenarios to enhance the applicability of segmentation methods in this specific domain. Furthermore, our evaluation consists of a performance analysis of the methods in prevalent real-world segmentation scenarios that pose particular challenges. These complex situations involve scenes impacted by diverse forms of noise, blur corruptions, camera movements, optical aberrations, among other factors. By jointly exploring the segmentation accuracy, temporal consistency, and corruption vulnerability in challenging real-world situations, our survey offers insights that go beyond existing surveys, facilitating the understanding and development of better image segmentation methods for indoor scenes.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001838/pdfft?md5=2d19fe112ea2fe5f2c0ab7afa65c3059&pid=1-s2.0-S1077314224001838-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141963153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural image re-exposure 神经图像重曝
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-28 DOI: 10.1016/j.cviu.2024.104094

Images and videos often suffer from issues such as motion blur, video discontinuity, or rolling shutter artifacts. Prior studies typically focus on designing specific algorithms to address individual issues. In this paper, we highlight that these issues, albeit differently manifested, fundamentally stem from sub-optimal exposure processes. With this insight, we propose a paradigm termed re-exposure, which resolves the aforementioned issues by performing exposure simulation. Following this paradigm, we design a new architecture, which constructs visual content representation from images and event camera data, and performs exposure simulation in a controllable manner. Experiments demonstrate that, using only a single model, the proposed architecture can effectively address multiple visual issues, including motion blur, video discontinuity, and rolling shutter artifacts, even when these issues co-occur.

图像和视频经常会出现运动模糊、视频不连贯或卷帘快门伪影等问题。之前的研究通常侧重于设计特定算法来解决个别问题。在本文中,我们强调这些问题尽管表现形式不同,但从根本上说都源于次优的曝光过程。有鉴于此,我们提出了一种称为 "再曝光 "的范式,通过进行曝光模拟来解决上述问题。根据这一范例,我们设计了一种新的架构,该架构通过图像和事件相机数据构建视觉内容表示,并以可控方式执行曝光模拟。实验证明,仅使用一个模型,所提出的架构就能有效解决多种视觉问题,包括运动模糊、视频不连续和卷帘快门伪影,即使这些问题同时出现。
{"title":"Neural image re-exposure","authors":"","doi":"10.1016/j.cviu.2024.104094","DOIUrl":"10.1016/j.cviu.2024.104094","url":null,"abstract":"<div><p>Images and videos often suffer from issues such as motion blur, video discontinuity, or rolling shutter artifacts. Prior studies typically focus on designing specific algorithms to address individual issues. In this paper, we highlight that these issues, albeit differently manifested, fundamentally stem from sub-optimal exposure processes. With this insight, we propose a paradigm termed re-exposure, which resolves the aforementioned issues by performing exposure simulation. Following this paradigm, we design a new architecture, which constructs visual content representation from images and event camera data, and performs exposure simulation in a controllable manner. Experiments demonstrate that, using only a single model, the proposed architecture can effectively address multiple visual issues, including motion blur, video discontinuity, and rolling shutter artifacts, even when these issues co-occur.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141842874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uni MS-PS: A multi-scale encoder-decoder transformer for universal photometric stereo Uni MS-PS:用于通用光度立体测量的多尺度编码器-解码器变压器
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-27 DOI: 10.1016/j.cviu.2024.104093

Photometric Stereo (PS) addresses the challenge of reconstructing a three-dimensional (3D) representation of an object by estimating the 3D normals at all points on the object’s surface. This is achieved through the analysis of at least three photographs, all taken from the same viewpoint but with distinct lighting conditions. This paper introduces a novel approach for Universal PS, i.e., when both the active lighting conditions and the ambient illumination are unknown. Our method employs a multi-scale encoder–decoder architecture based on Transformers that allows to accommodates images of any resolutions as well as varying number of input images. We are able to scale up to very high resolution images like 6000 pixels by 8000 pixels without losing performance and maintaining a decent memory footprint. Moreover, experiments on publicly available datasets establish that our proposed architecture improves the accuracy of the estimated normal field by a significant factor compared to state-of-the-art methods. Code and dataset available at: https://clement-hardy.github.io/Uni-MS-PS/index.html.

光度立体(Photometric Stereo,PS)通过估算物体表面所有点的三维法线,来解决重建物体三维(3D)呈现的难题。这是通过分析至少三张照片来实现的,这些照片都是从同一视角拍摄的,但光照条件各不相同。本文介绍了一种适用于通用 PS 的新方法,即当主照明条件和环境照明条件都未知时。我们的方法采用了基于变换器的多尺度编码器-解码器架构,可以适应任何分辨率的图像以及不同数量的输入图像。我们能够在不降低性能和保持适当内存占用的情况下,将图像放大到 6000 像素乘 8000 像素的超高分辨率。此外,在公开数据集上进行的实验证明,与最先进的方法相比,我们提出的架构能显著提高估计法线场的准确性。代码和数据集见:https://clement-hardy.github.io/Uni-MS-PS/index.html。
{"title":"Uni MS-PS: A multi-scale encoder-decoder transformer for universal photometric stereo","authors":"","doi":"10.1016/j.cviu.2024.104093","DOIUrl":"10.1016/j.cviu.2024.104093","url":null,"abstract":"<div><p>Photometric Stereo (PS) addresses the challenge of reconstructing a three-dimensional (3D) representation of an object by estimating the 3D normals at all points on the object’s surface. This is achieved through the analysis of at least three photographs, all taken from the same viewpoint but with distinct lighting conditions. This paper introduces a novel approach for Universal PS, i.e., when both the active lighting conditions and the ambient illumination are unknown. Our method employs a multi-scale encoder–decoder architecture based on Transformers that allows to accommodates images of any resolutions as well as varying number of input images. We are able to scale up to very high resolution images like 6000 pixels by 8000 pixels without losing performance and maintaining a decent memory footprint. Moreover, experiments on publicly available datasets establish that our proposed architecture improves the accuracy of the estimated normal field by a significant factor compared to state-of-the-art methods. Code and dataset available at: <span><span>https://clement-hardy.github.io/Uni-MS-PS/index.html</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141841792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image-to-image translation based face photo de-meshing using GANs 基于图像到图像平移的人脸照片去网格化研究
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-25 DOI: 10.1016/j.cviu.2024.104080

Most of the existing face photo de-meshing methods have accomplished promising results; there are certain quality problems with these methods like the inpainted regions would appear blurry and unpleasant boundaries becoming visible. Such artifacts cause generated face photos unreal. Therefore, we propose an effective image-to-image translation framework called Face De-meshing Using Generative Adversarial Networks (De-mesh GANs). The De-mesh GANs is a two-stage model: (i) binary mask generating module, is a three convolution layers-based encoder–decoder network architecture that automatically generates a binary mask for the meshed region, and (ii) face photo de-meshing module, is a GANs-based network that eliminates the mesh mask and synthesizes the meshed area. An arrangement of careful losses (reconstruction loss, adversarial loss, and perceptual loss) is used to reassure the better quality of the de-mesh face photos. To facilitate the training of the proposed model, we have designed a dataset of clean/corrupted photo pairs using the CelebA dataset. Qualitative and quantitative evaluations of the De-mesh GANs on real-world corrupted face photo images show better performance than the previously proposed face photo de-meshing models. Furthermore, we also offer the ablation study for performance assessment of the additional network i.e., perceptual network.

大多数现有的人脸照片去网格化方法都取得了可喜的成果,但这些方法也存在一定的质量问题,如涂抹区域会显得模糊,令人不快的边界变得清晰可见。这些伪影会导致生成的人脸照片不真实。因此,我们提出了一种有效的图像到图像转换框架,称为 "使用生成对抗网络的人脸去网格化方法(De-mesh GANs)"。去网格化 GANs 是一个两阶段模型:(i) 二进制掩码生成模块,是一个基于三个卷积层的编码器-解码器网络架构,可自动生成网格区域的二进制掩码;(ii) 人脸照片去网格化模块,是一个基于 GANs 的网络,可消除网格掩码并合成网格区域。为了保证去网格化人脸照片的质量,采用了谨慎损失(重建损失、对抗损失和感知损失)的安排。为了便于训练所提出的模型,我们使用 CelebA 数据集设计了一个干净/损坏照片对数据集。对真实世界中损坏的人脸照片图像进行的定性和定量评估表明,去网格 GAN 的性能优于之前提出的人脸照片去网格模型。此外,我们还提供了用于评估附加网络(即感知网络)性能的消融研究。
{"title":"Image-to-image translation based face photo de-meshing using GANs","authors":"","doi":"10.1016/j.cviu.2024.104080","DOIUrl":"10.1016/j.cviu.2024.104080","url":null,"abstract":"<div><p>Most of the existing face photo de-meshing methods have accomplished promising results; there are certain quality problems with these methods like the inpainted regions would appear blurry and unpleasant boundaries becoming visible. Such artifacts cause generated face photos unreal. Therefore, we propose an effective image-to-image translation framework called Face De-meshing Using Generative Adversarial Networks (De-mesh GANs). The De-mesh GANs is a two-stage model: (i) binary mask generating module, is a three convolution layers-based encoder–decoder network architecture that automatically generates a binary mask for the meshed region, and (ii) face photo de-meshing module, is a GANs-based network that eliminates the mesh mask and synthesizes the meshed area. An arrangement of careful losses (reconstruction loss, adversarial loss, and perceptual loss) is used to reassure the better quality of the de-mesh face photos. To facilitate the training of the proposed model, we have designed a dataset of clean/corrupted photo pairs using the CelebA dataset. Qualitative and quantitative evaluations of the De-mesh GANs on real-world corrupted face photo images show better performance than the previously proposed face photo de-meshing models. Furthermore, we also offer the ablation study for performance assessment of the additional network i.e., perceptual network.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001619/pdfft?md5=69edb9b36e9f2ed6358c7a01f72da000&pid=1-s2.0-S1077314224001619-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141839568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Challenges and solutions for vision-based hand gesture interpretation: A review 基于视觉的手势解读面临的挑战和解决方案:综述
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-25 DOI: 10.1016/j.cviu.2024.104095

Hand gesture is one of the most efficient and natural interfaces in current human–computer interaction (HCI) systems. Despite the great progress achieved in hand gesture-based HCI, perceiving or tracking the hand pose from images remains challenging. In the past decade, several challenges have been indicated and explored, such as incomplete data issue, the requirement of large-scale annotated dataset, and 3D hand pose estimation from monocular RGB image; however, there is a lack of surveys to provide comprehensive collection and analysis for these challenges and corresponding solutions. To this end, this paper devotes effort to the general challenges of hand gesture interpretation techniques in HCI systems based on visual sensors and elaborates on the corresponding solutions in current state-of-the-arts, which can provide a systematic reminder for practical problems of hand gesture interpretation. Moreover, this paper provides informative cues for recent datasets to further point out the inherent differences and connections among them, such as the annotation of objects and the number of hands, which is important for conducting research yet ignored by previous reviews. In retrospect of recent developments, this paper also conjectures what the future work will concentrate on, from the perspectives of both hand gesture interpretation and dataset construction.

手势是当前人机交互(HCI)系统中最有效、最自然的界面之一。尽管在基于手势的人机交互方面取得了巨大进步,但从图像中感知或跟踪手的姿势仍然具有挑战性。在过去的十年中,人们指出并探讨了一些挑战,如数据不完整问题、大规模注释数据集的要求以及从单目 RGB 图像中估计三维手部姿势;然而,目前还缺乏针对这些挑战提供全面收集和分析以及相应解决方案的调查。为此,本文致力于研究基于视觉传感器的人机交互系统中手势解读技术的一般挑战,并阐述了当前技术水平下的相应解决方案,从而为手势解读的实际问题提供系统性的提醒。此外,本文还提供了最新数据集的信息线索,进一步指出了这些数据集之间的内在差异和联系,例如对象的注释和手的数量,这对于开展研究非常重要,但却被以往的综述所忽视。在回顾近期发展的同时,本文还从手势解释和数据集构建两个角度对未来工作的重点进行了猜想。
{"title":"Challenges and solutions for vision-based hand gesture interpretation: A review","authors":"","doi":"10.1016/j.cviu.2024.104095","DOIUrl":"10.1016/j.cviu.2024.104095","url":null,"abstract":"<div><p>Hand gesture is one of the most efficient and natural interfaces in current human–computer interaction (HCI) systems. Despite the great progress achieved in hand gesture-based HCI, perceiving or tracking the hand pose from images remains challenging. In the past decade, several challenges have been indicated and explored, such as incomplete data issue, the requirement of large-scale annotated dataset, and 3D hand pose estimation from monocular RGB image; however, there is a lack of surveys to provide comprehensive collection and analysis for these challenges and corresponding solutions. To this end, this paper devotes effort to the general challenges of hand gesture interpretation techniques in HCI systems based on visual sensors and elaborates on the corresponding solutions in current state-of-the-arts, which can provide a systematic reminder for practical problems of hand gesture interpretation. Moreover, this paper provides informative cues for recent datasets to further point out the inherent differences and connections among them, such as the annotation of objects and the number of hands, which is important for conducting research yet ignored by previous reviews. In retrospect of recent developments, this paper also conjectures what the future work will concentrate on, from the perspectives of both hand gesture interpretation and dataset construction.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141842200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Object discriminability re-extraction for distractor-aware visual object tracking 用于分心者感知视觉对象跟踪的对象可辨别性再提取
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-24 DOI: 10.1016/j.cviu.2024.104075

The similar distractor problem is one of the most difficult challenges for Siamese-based trackers. Since they formulate the visual tracking task as a similar matching problem, these trackers involve an essential problem that they are sensitive to the intra-class and inter-class instances with similar appearance confusion. To solve the problem, we propose an object discriminability re-extraction network (ODR-Net) for distractor-aware visual object tracking. The network first mines similar distractors from existing tracking information with a distractor capture module, and then re-extracts discriminative features to redetect the target from distractors with a discriminative feature re-extraction module. It solves the distractor problem in the decoding phase of a tracker and can be considered as a general block that applied to existing Siamese trackers to tackle the similar distractor problem. To demonstrate the effectiveness of the proposed method, extensive experiments and comparisons with state-of-the-art trackers are conducted on a variety of large-scale benchmark datasets, including GOT-10k, LaSOT, OTB-2015, TrackingNet, VOT2020, VOT2021, and VOT2022. Without bells and whistles, our ODR-Net achieves leading performance with a real-time speed.

相似干扰物问题是基于连体图像的跟踪器所面临的最大挑战之一。由于它们将视觉跟踪任务表述为相似匹配问题,因此这些跟踪器涉及到一个基本问题,即它们对具有相似外观混淆的类内和类间实例非常敏感。为了解决这个问题,我们提出了一种对象可辨别性再提取网络(ODR-Net),用于分心点感知视觉对象跟踪。该网络首先利用分心点捕捉模块从现有的跟踪信息中挖掘类似的分心点,然后利用分辨特征再提取模块从分心点中重新提取分辨特征,以重新检测目标。它解决了跟踪器解码阶段的分心问题,可被视为一个通用模块,应用于现有的连体跟踪器,以解决相似分心问题。为了证明所提方法的有效性,我们在 GOT-10k、LaSOT、OTB-2015、TrackingNet、VOT2020、VOT2021 和 VOT2022 等各种大规模基准数据集上进行了大量实验,并与最先进的跟踪器进行了比较。我们的 ODR-Net 不需要繁琐的程序,就能以实时的速度实现领先的性能。
{"title":"Object discriminability re-extraction for distractor-aware visual object tracking","authors":"","doi":"10.1016/j.cviu.2024.104075","DOIUrl":"10.1016/j.cviu.2024.104075","url":null,"abstract":"<div><p>The similar distractor problem is one of the most difficult challenges for Siamese-based trackers. Since they formulate the visual tracking task as a similar matching problem, these trackers involve an essential problem that they are sensitive to the intra-class and inter-class instances with similar appearance confusion. To solve the problem, we propose an object discriminability re-extraction network (ODR-Net) for distractor-aware visual object tracking. The network first mines similar distractors from existing tracking information with a distractor capture module, and then re-extracts discriminative features to redetect the target from distractors with a discriminative feature re-extraction module. It solves the distractor problem in the decoding phase of a tracker and can be considered as a general block that applied to existing Siamese trackers to tackle the similar distractor problem. To demonstrate the effectiveness of the proposed method, extensive experiments and comparisons with state-of-the-art trackers are conducted on a variety of large-scale benchmark datasets, including GOT-10k, LaSOT, OTB-2015, TrackingNet, VOT2020, VOT2021, and VOT2022. Without bells and whistles, our ODR-Net achieves leading performance with a real-time speed.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141951084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced local distribution learning for real image super-resolution 增强局部分布学习,实现真实图像超分辨率
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-20 DOI: 10.1016/j.cviu.2024.104092

Previous work has shown that CNN-based local distribution learning can efficiently reconstruct high-resolution images, but with limited performance improvement against complex degraded images. In this paper, we propose an enhanced local distribution learning framework, called ELDRN, which successfully generalizes local distribution learning to realistic images whose degradation process is complex and unknowable. The cores of our ELDRN are the parallel attention block and dilated neighborhood sampling. The former mines discriminative features at both spatial and channel levels, that is, parameters for constructing local distributions, thus improving the robustness of distributions to real degradation patterns. To deal with the fact that the reference range of the target sub-pixel is not exactly equal to its neighborhood, we explicitly increase the sampling density, i.e., fusing more sampled pixels to produce the target sub-pixel. Experiments conducted on RealSR dataset illustrate that our ELDRN outperforms recent learning-based SISR methods and reconstructs visually-pleasant high-quality images.

以往的研究表明,基于 CNN 的局部分布学习可以有效地重建高分辨率图像,但在处理复杂的退化图像时性能提升有限。在本文中,我们提出了一种增强型局部分布学习框架,称为 ELDRN,它成功地将局部分布学习推广到退化过程复杂且不可知的现实图像中。ELDRN 的核心是并行注意力块和扩张邻域采样。前者挖掘空间和信道两个层面的鉴别特征,即构建局部分布的参数,从而提高了分布对真实退化模式的鲁棒性。为了解决目标子像素的参考范围与其邻域不完全相等的问题,我们明确地增加了采样密度,即融合更多的采样像素来生成目标子像素。在 RealSR 数据集上进行的实验表明,我们的 ELDRN 优于最近基于学习的 SISR 方法,并能重建视觉愉悦的高质量图像。
{"title":"Enhanced local distribution learning for real image super-resolution","authors":"","doi":"10.1016/j.cviu.2024.104092","DOIUrl":"10.1016/j.cviu.2024.104092","url":null,"abstract":"<div><p>Previous work has shown that CNN-based local distribution learning can efficiently reconstruct high-resolution images, but with limited performance improvement against complex degraded images. In this paper, we propose an enhanced local distribution learning framework, called ELDRN, which successfully generalizes local distribution learning to realistic images whose degradation process is complex and unknowable. The cores of our ELDRN are the parallel attention block and dilated neighborhood sampling. The former mines discriminative features at both spatial and channel levels, that is, parameters for constructing local distributions, thus improving the robustness of distributions to real degradation patterns. To deal with the fact that the reference range of the target sub-pixel is not exactly equal to its neighborhood, we explicitly increase the sampling density, <em>i.e.</em>, fusing more sampled pixels to produce the target sub-pixel. Experiments conducted on RealSR dataset illustrate that our ELDRN outperforms recent learning-based SISR methods and reconstructs visually-pleasant high-quality images.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141844902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning depth-aware decomposition for single image dehazing 学习深度感知分解,实现单幅图像去毛刺
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-20 DOI: 10.1016/j.cviu.2024.104069

Image dehazing under deficient data is an ill-posed and challenging problem. Most existing methods tackle this task by developing either CycleGAN-based hazy-to-clean translation or physical-based haze decomposition. However, geometric structure is often not effectively incorporated in their straightforward hazy-clean projection framework, which might incur inaccurate estimation in distant areas. In this paper, we rethink the image dehazing task and propose a depth-aware perception framework, DehazeDP, for robust haze decomposition on deficient data. Our DehazeDP is insthe pired by Diffusion Probabilistic Model to form an end-to-end training pipeline that seamlessly ines the hazy image generation with haze disentanglement. Specifically, in the forward phase, the haze is added to a clean image step-by-step according to the depth distribution. Then, in the reverse phase, a unified U-Net is used to predict the haze and recover the clean image progressively. Extensive experiments on public datasets demonstrate that the proposed DehazeDP performs favorably against state-of-the-art approaches. We release the code and models at https://github.com/stallak/DehazeDP.

在数据不足的情况下,图像去雾是一个难题,也是一个具有挑战性的问题。现有的大多数方法都是通过开发基于 CycleGAN 的灰度到清晰度转换或基于物理的灰度分解来解决这一问题的。然而,在这些直接的雾度-清洁投影框架中,几何结构往往没有被有效地纳入,这可能会导致对远处区域的估计不准确。在本文中,我们重新思考了图像除霾任务,并提出了一种深度感知框架--DehazeDP,用于在数据不足的情况下进行稳健的雾霾分解。我们的 DehazeDP 采用扩散概率模型(Diffusion Probabilistic Model),形成一个端到端的训练流水线,将雾霾图像生成与雾霾分解无缝结合在一起。具体来说,在正向阶段,根据深度分布逐步将雾霾添加到干净图像中。然后,在反向阶段,使用统一的 U-Net 预测雾霾并逐步恢复干净图像。在公共数据集上进行的大量实验表明,与最先进的方法相比,所提出的 DehazeDP 性能更佳。我们在 https://github.com/stallak/DehazeDP 上发布了代码和模型。
{"title":"Learning depth-aware decomposition for single image dehazing","authors":"","doi":"10.1016/j.cviu.2024.104069","DOIUrl":"10.1016/j.cviu.2024.104069","url":null,"abstract":"<div><p>Image dehazing under deficient data is an ill-posed and challenging problem. Most existing methods tackle this task by developing either CycleGAN-based hazy-to-clean translation or physical-based haze decomposition. However, geometric structure is often not effectively incorporated in their straightforward hazy-clean projection framework, which might incur inaccurate estimation in distant areas. In this paper, we rethink the image dehazing task and propose a depth-aware perception framework, <strong>DehazeDP</strong>, for robust haze decomposition on deficient data. Our DehazeDP is insthe pired by Diffusion Probabilistic Model to form an end-to-end training pipeline that seamlessly ines the hazy image generation with haze disentanglement. Specifically, in the forward phase, the haze is added to a clean image step-by-step according to the depth distribution. Then, in the reverse phase, a unified U-Net is used to predict the haze and recover the clean image progressively. Extensive experiments on public datasets demonstrate that the proposed DehazeDP performs favorably against state-of-the-art approaches. We release the code and models at <span><span>https://github.com/stallak/DehazeDP</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141849335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UAHOI: Uncertainty-aware robust interaction learning for HOI detection UAHOI:用于 HOI 检测的不确定性感知鲁棒交互学习
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-20 DOI: 10.1016/j.cviu.2024.104091

This paper focuses on Human–Object Interaction (HOI) detection, addressing the challenge of identifying and understanding the interactions between humans and objects within a given image or video frame. Spearheaded by Detection Transformer (DETR), recent developments lead to significant improvements by replacing traditional region proposals by a set of learnable queries. However, despite the powerful representation capabilities provided by Transformers, existing Human–Object Interaction (HOI) detection methods still yield low confidence levels when dealing with complex interactions and are prone to overlooking interactive actions. To address these issues, we propose a novel approach UAHOI, Uncertainty-aware Robust Human–Object Interaction Learning that explicitly estimates prediction uncertainty during the training process to refine both detection and interaction predictions. Our model not only predicts the HOI triplets but also quantifies the uncertainty of these predictions. Specifically, we model this uncertainty through the variance of predictions and incorporate it into the optimization objective, allowing the model to adaptively adjust its confidence threshold based on prediction variance. This integration helps in mitigating the adverse effects of incorrect or ambiguous predictions that are common in traditional methods without any hand-designed components, serving as an automatic confidence threshold. Our method is flexible to existing HOI detection methods and demonstrates improved accuracy. We evaluate UAHOI on two standard benchmarks in the field: V-COCO and HICO-DET, which represent challenging scenarios for HOI detection. Through extensive experiments, we demonstrate that UAHOI achieves significant improvements over existing state-of-the-art methods, enhancing both the accuracy and robustness of HOI detection.

本文的重点是人-物互动(HOI)检测,以应对在给定图像或视频帧中识别和理解人与物体之间互动的挑战。在检测变换器(DETR)的引领下,最近的发展通过用一组可学习的查询来取代传统的区域建议,取得了显著的改进。然而,尽管变形器提供了强大的表示能力,现有的人-物交互(HOI)检测方法在处理复杂的交互时仍会产生较低的置信度,并且容易忽略交互动作。为了解决这些问题,我们提出了一种新方法 UAHOI(不确定性感知的鲁棒人-物交互学习),该方法在训练过程中明确估计预测的不确定性,以完善检测和交互预测。我们的模型不仅能预测 HOI 三胞胎,还能量化这些预测的不确定性。具体来说,我们通过预测方差对这种不确定性进行建模,并将其纳入优化目标,使模型能够根据预测方差自适应地调整其置信度阈值。这种整合有助于减轻传统方法中常见的不正确或模糊预测的不利影响,因为传统方法中没有任何手工设计的组件,可以作为自动置信度阈值。与现有的 HOI 检测方法相比,我们的方法非常灵活,而且准确性更高。我们在两个领域的标准基准上对 UAHOI 进行了评估:V-COCO 和 HICO-DET,它们代表了具有挑战性的 HOI 检测场景。通过大量实验,我们证明 UAHOI 比现有的最先进方法有了显著改进,提高了 HOI 检测的准确性和鲁棒性。
{"title":"UAHOI: Uncertainty-aware robust interaction learning for HOI detection","authors":"","doi":"10.1016/j.cviu.2024.104091","DOIUrl":"10.1016/j.cviu.2024.104091","url":null,"abstract":"<div><p>This paper focuses on Human–Object Interaction (HOI) detection, addressing the challenge of identifying and understanding the interactions between humans and objects within a given image or video frame. Spearheaded by Detection Transformer (DETR), recent developments lead to significant improvements by replacing traditional region proposals by a set of learnable queries. However, despite the powerful representation capabilities provided by Transformers, existing Human–Object Interaction (HOI) detection methods still yield low confidence levels when dealing with complex interactions and are prone to overlooking interactive actions. To address these issues, we propose a novel approach <span>UAHOI</span>, Uncertainty-aware Robust Human–Object Interaction Learning that explicitly estimates prediction uncertainty during the training process to refine both detection and interaction predictions. Our model not only predicts the HOI triplets but also quantifies the uncertainty of these predictions. Specifically, we model this uncertainty through the variance of predictions and incorporate it into the optimization objective, allowing the model to adaptively adjust its confidence threshold based on prediction variance. This integration helps in mitigating the adverse effects of incorrect or ambiguous predictions that are common in traditional methods without any hand-designed components, serving as an automatic confidence threshold. Our method is flexible to existing HOI detection methods and demonstrates improved accuracy. We evaluate <span>UAHOI</span> on two standard benchmarks in the field: V-COCO and HICO-DET, which represent challenging scenarios for HOI detection. Through extensive experiments, we demonstrate that <span>UAHOI</span> achieves significant improvements over existing state-of-the-art methods, enhancing both the accuracy and robustness of HOI detection.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141846821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1