首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Exploring the transformer-based and diffusion-based models for single image deblurring 探索基于变压器和基于扩散的单幅图像去模糊模型
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-27 DOI: 10.1016/j.jvcir.2026.104735
Seunghwan Park , Chaehun Shin , Jaihyun Lew , Sungroh Yoon
Image deblurring is a fundamental task in image restoration (IR) aimed at removing blurring artifacts caused by factors such as defocusing, motions, and others. Since a blurry image could be originated from various sharp images, deblurring is regarded as an ill-posed problem with multiple valid solutions. The evolution of deblurring techniques spans from rule-based algorithms to deep learning-based models. Early research focused on estimating blur kernels using maximum a posteriori (MAP) estimation, but advancements in deep learning have shifted the focus towards directly predicting sharp images by leveraging deep learning techniques such as convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and others. Building on these foundations, recent studies have advanced along two directions: transformer-based architectural innovations and diffusion-based algorithmic advances. This survey provides an in-depth investigation of recent deblurring models and traditional approaches. Furthermore, we conduct a fair re-evaluation under a unified evaluation protocol.
图像去模糊是图像恢复(IR)中的一项基本任务,旨在消除因散焦、运动等因素引起的模糊伪影。由于模糊图像可以由各种清晰图像产生,因此去模糊被认为是一个具有多个有效解的不适定问题。去模糊技术的发展跨越了从基于规则的算法到基于深度学习的模型。早期的研究主要集中在使用最大后验(MAP)估计来估计模糊核,但深度学习的进步已经将重点转移到通过利用深度学习技术(如卷积神经网络(cnn)、生成对抗网络(gan)、循环神经网络(rnn)等直接预测清晰图像。在这些基础上,最近的研究沿着两个方向发展:基于转换器的架构创新和基于扩散的算法进步。这项调查提供了一个深入的调查,最近的去模糊模型和传统的方法。此外,我们在统一的评估方案下进行公平的重新评估。
{"title":"Exploring the transformer-based and diffusion-based models for single image deblurring","authors":"Seunghwan Park ,&nbsp;Chaehun Shin ,&nbsp;Jaihyun Lew ,&nbsp;Sungroh Yoon","doi":"10.1016/j.jvcir.2026.104735","DOIUrl":"10.1016/j.jvcir.2026.104735","url":null,"abstract":"<div><div>Image deblurring is a fundamental task in image restoration (IR) aimed at removing blurring artifacts caused by factors such as defocusing, motions, and others. Since a blurry image could be originated from various sharp images, deblurring is regarded as an ill-posed problem with multiple valid solutions. The evolution of deblurring techniques spans from rule-based algorithms to deep learning-based models. Early research focused on estimating blur kernels using maximum a posteriori (MAP) estimation, but advancements in deep learning have shifted the focus towards directly predicting sharp images by leveraging deep learning techniques such as convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and others. Building on these foundations, recent studies have advanced along two directions: transformer-based architectural innovations and diffusion-based algorithmic advances. This survey provides an in-depth investigation of recent deblurring models and traditional approaches. Furthermore, we conduct a fair re-evaluation under a unified evaluation protocol.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104735"},"PeriodicalIF":3.1,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unified global–local feature modeling via reverse patch scaling for image manipulation localization 统一全局-局部特征建模,通过反向补丁缩放实现图像操作定位
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.jvcir.2026.104731
Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang
Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.
图像处理定位需要对全局特征和局部特征进行综合提取和融合。然而,现有的方法通常采用并行架构,分别处理语义上下文和局部细节,导致有限的交互和碎片化表示。此外,在所有层上使用统一的修补策略忽略了深层特征的不同语义角色和空间属性。为了解决这些问题,我们提出了一个统一的框架,直接从分层全局特征中派生局部表示。反向补丁缩放策略将较小的补丁大小和较大的重叠分配给更深的层,从而使密集的局部建模与增加的语义抽象相一致。一个不对称的交叉注意模块提高了功能的交互性和一致性。此外,双策略解码器通过串联和加法融合多尺度特征,而统计引导的边缘感知模块对预测掩模的局部方差和熵进行建模,以改进边界感知。大量的实验表明,我们的方法在准确性和鲁棒性方面都优于最先进的方法。
{"title":"Unified global–local feature modeling via reverse patch scaling for image manipulation localization","authors":"Jingying Cai ,&nbsp;Hang Cheng ,&nbsp;Jiabin Chen ,&nbsp;Haichou Wang ,&nbsp;Meiqing Wang","doi":"10.1016/j.jvcir.2026.104731","DOIUrl":"10.1016/j.jvcir.2026.104731","url":null,"abstract":"<div><div>Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104731"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global–local dual-branch network with local feature enhancement for visual tracking 基于局部特征增强的全局-局部双分支网络视觉跟踪
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.jvcir.2026.104725
Yuanyun Wang, Lingtao Zhou, Zhuo An, Lei Sun, Min Hu, Jun Wang
Vision Transformers (ViT) have been widely applied due to their excellent performance. Compared with CNN models, ViT models are more difficult to train and require more training samples because they cannot effectively utilize high-frequency local information. In this paper we propose an efficient tracking framework based on global and local feature extraction, and an enhancement module. To address the high-frequency local information neglected by general ViT-based trackers, we design an effective local branch architecture to capture the information. For local feature extraction and enhancement, we design a local branch, which aggregates local information by using shared weights; it utilizes the optimized context-aware weights to enhance the local features. The integration of the attention mechanism in the global and local branches enables the tracker to perceive both high-frequency local information and low-frequency global information simultaneously. Experimental comparisons show that the tracker achieves superior results and proves the generalization ability and effectiveness. Code will be available at https://github.com/WangJun-CV/GLDTrack.
视觉变压器以其优异的性能得到了广泛的应用。与CNN模型相比,ViT模型由于不能有效利用高频局部信息,训练难度更大,需要更多的训练样本。本文提出了一种基于全局和局部特征提取的高效跟踪框架和增强模块。为了解决一般基于vit的跟踪器所忽略的高频局部信息,我们设计了一个有效的局部分支架构来捕获这些信息。在局部特征提取和增强方面,设计了局部分支,利用共享权值对局部信息进行聚合;它利用优化的上下文感知权重来增强局部特征。全局分支和局部分支的注意机制集成,使跟踪器能够同时感知高频局部信息和低频全局信息。实验对比表明,该跟踪器取得了较好的效果,证明了该跟踪器的泛化能力和有效性。代码将在https://github.com/WangJun-CV/GLDTrack上提供。
{"title":"Global–local dual-branch network with local feature enhancement for visual tracking","authors":"Yuanyun Wang,&nbsp;Lingtao Zhou,&nbsp;Zhuo An,&nbsp;Lei Sun,&nbsp;Min Hu,&nbsp;Jun Wang","doi":"10.1016/j.jvcir.2026.104725","DOIUrl":"10.1016/j.jvcir.2026.104725","url":null,"abstract":"<div><div>Vision Transformers (ViT) have been widely applied due to their excellent performance. Compared with CNN models, ViT models are more difficult to train and require more training samples because they cannot effectively utilize high-frequency local information. In this paper we propose an efficient tracking framework based on global and local feature extraction, and an enhancement module. To address the high-frequency local information neglected by general ViT-based trackers, we design an effective local branch architecture to capture the information. For local feature extraction and enhancement, we design a local branch, which aggregates local information by using shared weights; it utilizes the optimized context-aware weights to enhance the local features. The integration of the attention mechanism in the global and local branches enables the tracker to perceive both high-frequency local information and low-frequency global information simultaneously. Experimental comparisons show that the tracker achieves superior results and proves the generalization ability and effectiveness. Code will be available at <span><span>https://github.com/WangJun-CV/GLDTrack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104725"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight whole-body mesh recovery with joints and depth aware hand detail optimization 轻量级的全身网格恢复与关节和深度感知手细节优化
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.jvcir.2026.104729
Zilong Yang, Shujun Zhang, Xiao Wang, Hu Jin, Limin Sun
Expressive whole-body mesh recovery aims to estimate 3D human pose and shape parameters, including the face and hands, from a monocular image. Since hand details play a crucial role in conveying human posture, accurate hand reconstruction is of great importance for applications in 3D human modeling. However, precise recovery of hands is highly challenging due to the relatively small spatial proportion of hands, high flexibility, diverse gestures, and frequent occlusions. In this work, we propose a lightweight whole-body mesh recovery framework that enhances hand detail reconstruction while reducing computational complexity. Specifically, we introduce a Joints and Depth Aware Fusion (JDAF) module that adaptively encodes geometric joints and depth cues from local hand regions. This module provides strong 3D priors and effectively guides the regression of accurate hand parameters. In addition, we propose an Adaptive Dual-branch Pooling Attention (ADPA) module that models global context and local fine-grained interactions in a lightweight manner. Compared with the traditional self-attention mechanism, this module significantly reduces the computational burden. Experiments on the EHF and UBody benchmarks demonstrate that our approach outperforms SOTA methods, reducing body MPVPE by 8.5% and hand PA-MPVPE by 6.2%, while significantly lowering the number of parameters and MACs. More importantly, its efficiency and lightweight make it particularly suitable for real-time visual communication scenarios such as immersive conferencing, sign language translation, and VR/AR interaction.
表达全身网格恢复的目的是估计三维人体姿态和形状参数,包括脸和手,从单眼图像。由于手部细节在人体姿态的传递中起着至关重要的作用,因此准确的手部重建对于人体三维建模的应用具有重要意义。然而,由于手部的空间比例相对较小,灵活性高,手势多样,并且经常发生闭塞,因此手部的精确恢复具有很高的挑战性。在这项工作中,我们提出了一种轻量级的全身网格恢复框架,增强了手部细节重建,同时降低了计算复杂度。具体来说,我们引入了一个关节和深度感知融合(JDAF)模块,该模块自适应地编码来自手部局部区域的几何关节和深度线索。该模块提供了强大的三维先验,有效地指导准确的手部参数回归。此外,我们提出了一个自适应双分支池注意(ADPA)模块,该模块以轻量级的方式对全局上下文和局部细粒度交互进行建模。与传统的自关注机制相比,该模块显著降低了计算量。在EHF和UBody基准测试上的实验表明,我们的方法优于SOTA方法,将车身MPVPE降低8.5%,手部PA-MPVPE降低6.2%,同时显著降低了参数和mac的数量。更重要的是,它的效率和轻量级使其特别适合沉浸式会议、手语翻译、VR/AR交互等实时视觉通信场景。
{"title":"Lightweight whole-body mesh recovery with joints and depth aware hand detail optimization","authors":"Zilong Yang,&nbsp;Shujun Zhang,&nbsp;Xiao Wang,&nbsp;Hu Jin,&nbsp;Limin Sun","doi":"10.1016/j.jvcir.2026.104729","DOIUrl":"10.1016/j.jvcir.2026.104729","url":null,"abstract":"<div><div>Expressive whole-body mesh recovery aims to estimate 3D human pose and shape parameters, including the face and hands, from a monocular image. Since hand details play a crucial role in conveying human posture, accurate hand reconstruction is of great importance for applications in 3D human modeling. However, precise recovery of hands is highly challenging due to the relatively small spatial proportion of hands, high flexibility, diverse gestures, and frequent occlusions. In this work, we propose a lightweight whole-body mesh recovery framework that enhances hand detail reconstruction while reducing computational complexity. Specifically, we introduce a Joints and Depth Aware Fusion (JDAF) module that adaptively encodes geometric joints and depth cues from local hand regions. This module provides strong 3D priors and effectively guides the regression of accurate hand parameters. In addition, we propose an Adaptive Dual-branch Pooling Attention (ADPA) module that models global context and local fine-grained interactions in a lightweight manner. Compared with the traditional self-attention mechanism, this module significantly reduces the computational burden. Experiments on the EHF and UBody benchmarks demonstrate that our approach outperforms SOTA methods, reducing body MPVPE by 8.5% and hand PA-MPVPE by 6.2%, while significantly lowering the number of parameters and MACs. More importantly, its efficiency and lightweight make it particularly suitable for real-time visual communication scenarios such as immersive conferencing, sign language translation, and VR/AR interaction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104729"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global–local co-regularization network for facial action unit detection 面部动作单元检测的全局-局部协同正则化网络
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-21 DOI: 10.1016/j.jvcir.2026.104728
Yumei Tan , Haiying Xia , Shuxiang Song
Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.
面部动作单元(AU)检测在捕捉判别性的局部特征和复杂的AU相关性方面提出了挑战。为了解决这一挑战,我们提出了一种以协作方式训练的有效的全局-局部协同正则化网络(Co-GLN)。Co-GLN由全球分支和局部分支组成,旨在在全球分支中建立全球特征级的相互关系,同时在局部分支中挖掘区域级的判别特征。具体而言,在全局分支中,设计了一个全局交互(GI)模块来增强跨像素关系,以捕获全局语义信息。本地分支由区域定位(RL)模块、特征内关系建模(IRM)模块和区域交互(RI)模块组成。RL模块根据预定义的人脸区域提取区域特征,IRM模块针对每个区域提取局部特征。随后,RI模块整合跨区域的互补信息。最后,使用协正则化约束来鼓励全局和局部分支之间的一致性。实验结果表明,Co-GLN在BP4D和DISFA数据集上的AU检测性能持续提高。
{"title":"Global–local co-regularization network for facial action unit detection","authors":"Yumei Tan ,&nbsp;Haiying Xia ,&nbsp;Shuxiang Song","doi":"10.1016/j.jvcir.2026.104728","DOIUrl":"10.1016/j.jvcir.2026.104728","url":null,"abstract":"<div><div>Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104728"},"PeriodicalIF":3.1,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
eGoRG: GPU-accelerated depth estimation for immersive video applications based on graph cuts 基于图形切割的沉浸式视频应用的gpu加速深度估计
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-19 DOI: 10.1016/j.jvcir.2026.104727
Jaime Sancho , Manuel Villa , Miguel Chavarrias , Rubén Salvador , Eduardo Juarez , César Sanz
Immersive video is gaining relevance across various fields, but its integration into real applications remains limited due to the technical challenges of depth estimation. Generating accurate depth maps is essential for 3D rendering, yet high-quality algorithms can require hundreds of seconds to produce a single frame. While real-time depth estimation solutions exist — particularly monocular deep learning-based methods and active sensors such as time-of-flight or plenoptic cameras — their depth accuracy and multiview consistency are often insufficient for depth image-based rendering (DIBR) and immersive video applications. This highlights the persistent challenge of jointly achieving real-time performance and high-quality, correlated depth across views. This paper introduces eGoRG, a GPU-accelerated depth estimation algorithm based on MPEG DERS, which employs graph cuts to achieve high-quality results. eGoRG contributes a novel GPU-based graph cuts stage, integrating block-based push-relabel acceleration and a simplified alpha expansion method. These optimizations deliver quality comparable to leading graph-cut approaches while greatly improving speed. Evaluation on an MPEG multiview dataset and a static NeRF dataset demonstrates the algorithm’s effectiveness across different scenarios.
沉浸式视频正在各个领域获得相关性,但由于深度估计的技术挑战,其与实际应用的集成仍然有限。生成精确的深度图对于3D渲染至关重要,然而高质量的算法可能需要数百秒才能生成单个帧。虽然存在实时深度估计解决方案,特别是基于单目深度学习的方法和主动传感器,如飞行时间或全光学相机,但它们的深度精度和多视图一致性通常不足以用于深度图像渲染(DIBR)和沉浸式视频应用。这突出了共同实现实时性能和高质量、跨视图相关深度的持续挑战。本文介绍了一种基于MPEG - DERS的gpu加速深度估计算法eGoRG,该算法利用图切来获得高质量的深度估计结果。eGoRG提供了一种新的基于gpu的图切割阶段,集成了基于块的推标签加速和简化的alpha展开方法。这些优化提供了与领先的图切割方法相当的质量,同时大大提高了速度。对MPEG多视图数据集和静态NeRF数据集的评估证明了该算法在不同场景下的有效性。
{"title":"eGoRG: GPU-accelerated depth estimation for immersive video applications based on graph cuts","authors":"Jaime Sancho ,&nbsp;Manuel Villa ,&nbsp;Miguel Chavarrias ,&nbsp;Rubén Salvador ,&nbsp;Eduardo Juarez ,&nbsp;César Sanz","doi":"10.1016/j.jvcir.2026.104727","DOIUrl":"10.1016/j.jvcir.2026.104727","url":null,"abstract":"<div><div>Immersive video is gaining relevance across various fields, but its integration into real applications remains limited due to the technical challenges of depth estimation. Generating accurate depth maps is essential for 3D rendering, yet high-quality algorithms can require hundreds of seconds to produce a single frame. While real-time depth estimation solutions exist — particularly monocular deep learning-based methods and active sensors such as time-of-flight or plenoptic cameras — their depth accuracy and multiview consistency are often insufficient for depth image-based rendering (DIBR) and immersive video applications. This highlights the persistent challenge of jointly achieving real-time performance and high-quality, correlated depth across views. This paper introduces eGoRG, a GPU-accelerated depth estimation algorithm based on MPEG DERS, which employs graph cuts to achieve high-quality results. eGoRG contributes a novel GPU-based graph cuts stage, integrating block-based push-relabel acceleration and a simplified alpha expansion method. These optimizations deliver quality comparable to leading graph-cut approaches while greatly improving speed. Evaluation on an MPEG multiview dataset and a static NeRF dataset demonstrates the algorithm’s effectiveness across different scenarios.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104727"},"PeriodicalIF":3.1,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MTPA: A multi-aspects perception assisted AIGV quality assessment model MTPA:一个多层面感知辅助的AIGV质量评估模型
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.jvcir.2026.104721
Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang
With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.
随着人工智能(AI)生成技术的发展,人工智能生成视频(AI generated video, AIGV)引起了人们的广泛关注。与传统视频中的视觉感知相比,AIGV有其独特的挑战,如视觉一致性、文本-视频对齐等。本文提出了一种多面向感知辅助的AIGV质量评价模型,从文本-视频对齐评分、视觉空间感知评分和视觉时间感知评分三个方面对AIGV质量进行综合评价。具体而言,采用预训练的视觉语言模块来研究文本到视频的对齐质量,使用语义感知模块来捕获视觉空间感知特征。此外,采用有效的视觉时间特征提取模块捕获多尺度时间特征。最后,综合文本-视频对齐特征、视觉空间特征、视觉时间感知特征和多尺度视觉融合特征,对图像质量进行综合评价。我们的模型在三个公共AIGV数据集上拥有最先进的结果,证明了它的有效性。
{"title":"MTPA: A multi-aspects perception assisted AIGV quality assessment model","authors":"Yun Liu,&nbsp;Daoxin Fan,&nbsp;Zihan Liu,&nbsp;Sifan Li,&nbsp;Haiyuan Wang","doi":"10.1016/j.jvcir.2026.104721","DOIUrl":"10.1016/j.jvcir.2026.104721","url":null,"abstract":"<div><div>With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104721"},"PeriodicalIF":3.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ATR-Net: Attention-based temporal-refinement network for efficient facial emotion recognition in human–robot interaction ATR-Net:人机交互中高效面部情感识别的基于注意力的时间优化网络
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.jvcir.2026.104720
Sougatamoy Biswas , Harshavardhan Reddy Gajarla , Anup Nandy , Asim Kumar Naskar
Facial Emotion Recognition (FER) enables human–robot interaction by allowing robots to interpret human emotions effectively. Traditional FER models achieve high accuracy but are often computationally intensive, limiting real-time application on resource-constrained devices. These models also face challenges in capturing subtle emotional expressions and addressing variations in facial poses. This study proposes a lightweight FER model based on EfficientNet-B0, balancing accuracy and efficiency for real-time deployment on embedded robotic systems. The proposed architecture integrates an Attention Augmented Convolution (AAC) layer with EfficientNet-B0 to enhance the model’s focus on subtle emotional cues, enabling robust performance in complex environments. Additionally, a Pyramid Channel-Gated Attention with a Temporal Refinement Block is introduced to capture spatial and channel dependencies, ensuring adaptability and efficiency on resource-limited devices. The model achieves accuracies of 74.22% on FER-2013, 99.14% on CK+, and 67.36% on AffectNet-7. These results demonstrate its efficiency and robustness for facial emotion recognition in human–robot interaction.
面部情绪识别(FER)通过允许机器人有效地解释人类的情绪来实现人机交互。传统的FER模型具有较高的精度,但通常计算量大,限制了在资源受限设备上的实时应用。这些模特在捕捉微妙的情绪表达和处理面部姿势的变化方面也面临着挑战。本研究提出了一种基于EfficientNet-B0的轻量级FER模型,平衡了嵌入式机器人系统实时部署的精度和效率。该架构将注意力增强卷积(Attention Augmented Convolution, AAC)层与effentnet - b0集成在一起,以增强模型对微妙情感线索的关注,从而在复杂环境中实现稳健的性能。此外,引入了具有时间细化块的金字塔通道门控注意力来捕获空间和通道依赖性,确保在资源有限的设备上的适应性和效率。该模型在FER-2013上的准确率为74.22%,在CK+上为99.14%,在AffectNet-7上为67.36%。实验结果证明了该方法在人机交互中人脸情感识别的有效性和鲁棒性。
{"title":"ATR-Net: Attention-based temporal-refinement network for efficient facial emotion recognition in human–robot interaction","authors":"Sougatamoy Biswas ,&nbsp;Harshavardhan Reddy Gajarla ,&nbsp;Anup Nandy ,&nbsp;Asim Kumar Naskar","doi":"10.1016/j.jvcir.2026.104720","DOIUrl":"10.1016/j.jvcir.2026.104720","url":null,"abstract":"<div><div>Facial Emotion Recognition (FER) enables human–robot interaction by allowing robots to interpret human emotions effectively. Traditional FER models achieve high accuracy but are often computationally intensive, limiting real-time application on resource-constrained devices. These models also face challenges in capturing subtle emotional expressions and addressing variations in facial poses. This study proposes a lightweight FER model based on EfficientNet-B0, balancing accuracy and efficiency for real-time deployment on embedded robotic systems. The proposed architecture integrates an Attention Augmented Convolution (AAC) layer with EfficientNet-B0 to enhance the model’s focus on subtle emotional cues, enabling robust performance in complex environments. Additionally, a Pyramid Channel-Gated Attention with a Temporal Refinement Block is introduced to capture spatial and channel dependencies, ensuring adaptability and efficiency on resource-limited devices. The model achieves accuracies of 74.22% on FER-2013, 99.14% on CK+, and 67.36% on AffectNet-7. These results demonstrate its efficiency and robustness for facial emotion recognition in human–robot interaction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104720"},"PeriodicalIF":3.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-in-the-loop dual-branch architecture for image super-resolution 面向图像超分辨率的人在环双分支架构
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.jvcir.2026.104726
Suraj Neelakantan, Martin Längkvist, Amy Loutfi
Single-image super-resolution aims to recover high-frequency detail from a single low-resolution image, but practical applications often require balancing distortion against perceptual quality. Existing methods typically produce a single fixed reconstruction and offer limited test-time control over this trade-off. This paper presents DR-SCAN, a dual-branch deep residual network for single-image super-resolution in which, during test-time inference, weights can be assigned to either of the branches to dynamically steer their respective contributions to the reconstructed output. An interactive interface enables users to re-weight the shallow and deep branches at inference or run a one-click LPIPS search, to navigate the distortion–perception trade-off without retraining the model. Ablation experiments confirm that both the second branch and the channel–spatial attention that is used within the residual blocks are essential for the network for better reconstruction, while the interactive tuning routine demonstrates the practical value of post-hoc branch fusion.
单图像超分辨率旨在从单个低分辨率图像中恢复高频细节,但实际应用通常需要平衡失真和感知质量。现有的方法通常只产生一个固定的重构,并且对这种权衡提供有限的测试时间控制。本文提出了一种用于单图像超分辨率的双分支深度残差网络DR-SCAN,在测试时间推理过程中,可以为任意一个分支分配权重,以动态地引导它们各自对重建输出的贡献。交互界面使用户能够在推理时重新权衡浅分支和深分支的权重,或者运行一键式LPIPS搜索,在不重新训练模型的情况下导航扭曲感知权衡。消融实验证实了第二分支和残块内使用的通道空间关注对于更好地重建网络是必不可少的,而交互式调优程序则证明了事后分支融合的实用价值。
{"title":"Human-in-the-loop dual-branch architecture for image super-resolution","authors":"Suraj Neelakantan,&nbsp;Martin Längkvist,&nbsp;Amy Loutfi","doi":"10.1016/j.jvcir.2026.104726","DOIUrl":"10.1016/j.jvcir.2026.104726","url":null,"abstract":"<div><div>Single-image super-resolution aims to recover high-frequency detail from a single low-resolution image, but practical applications often require balancing distortion against perceptual quality. Existing methods typically produce a single fixed reconstruction and offer limited test-time control over this trade-off. This paper presents DR-SCAN, a dual-branch deep residual network for single-image super-resolution in which, during test-time inference, weights can be assigned to either of the branches to dynamically steer their respective contributions to the reconstructed output. An interactive interface enables users to re-weight the shallow and deep branches at inference or run a one-click LPIPS search, to navigate the distortion–perception trade-off without retraining the model. Ablation experiments confirm that both the second branch and the channel–spatial attention that is used within the residual blocks are essential for the network for better reconstruction, while the interactive tuning routine demonstrates the practical value of post-hoc branch fusion.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104726"},"PeriodicalIF":3.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pedestrian trajectory prediction using multi-cue transformer 基于多线索变压器的行人轨迹预测
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-15 DOI: 10.1016/j.jvcir.2026.104723
Yanlong Tian , Rui Zhai , Xiaoting Fan , Qi Xue , Zhong Zhang , Xinshan Zhu
Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.
行人轨迹预测是一个具有挑战性的问题,因为未来的轨迹受周围环境的影响,并受到常识规则的约束。现有的轨迹预测方法通常只考虑一种线索,即社会意识线索、环境意识线索和目标条件线索来模拟与轨迹信息的相互作用,导致相互作用不足。在本文中,我们提出了一种创新的针对行人轨迹预测的Multi-cue Transformer (McTrans)网络,其中我们设计了分层交叉注意(HCA)模块,从时间和空间依赖的角度学习行人轨迹信息与三种线索之间的目标-社会-环境相互作用。此外,为了合理利用目标信息的导引作用,我们提出了渐进式目标导引损失算法(GGLoss),该算法随着时间步长的增加,逐渐增大预测目标与真地目标之间的坐标差的权重。我们在三个公共数据集上进行了广泛的实验,即SDD, inD和ETH/UCY。实验结果表明,所提出的McTrans方法优于其他最先进的方法。
{"title":"Pedestrian trajectory prediction using multi-cue transformer","authors":"Yanlong Tian ,&nbsp;Rui Zhai ,&nbsp;Xiaoting Fan ,&nbsp;Qi Xue ,&nbsp;Zhong Zhang ,&nbsp;Xinshan Zhu","doi":"10.1016/j.jvcir.2026.104723","DOIUrl":"10.1016/j.jvcir.2026.104723","url":null,"abstract":"<div><div>Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104723"},"PeriodicalIF":3.1,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1