首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Multimodal prompt-guided vision transformer for precise image manipulation localization 用于精确图像处理定位的多模态提示引导视觉转换器
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-27 DOI: 10.1016/j.jvcir.2026.104736
Yafang Xiao , Wei Jiang , Shihua Zhou , Bin Wang , Pengfei Wang , Pan Zheng
With the rise of generative AI and advanced image editing technologies, image manipulation localization has become more challenging. Existing methods often struggle with limited semantic understanding and insufficient spatial detail capture, especially in complex scenarios. To address these issues, we propose a novel multimodal text-guided framework for image manipulation localization. By fusing textual prompts with image features, our approach enhances the model’s ability to identify manipulated regions. We introduce a Multimodal Interaction Prompt Module (MIPM) that uses cross-modal attention mechanisms to align visual and textual information. Guided by multimodal prompts, our Vision Transformer-based model accurately localizes forged areas in images. Extensive experiments on public datasets, including CASIAv1 and Columbia, show that our method outperforms existing approaches. Specifically, on the CASIAv1 dataset, our approach achieves an F1 score of 0.734, surpassing the second-best method by 1.3%. These results demonstrate the effectiveness of our multimodal fusion strategy. The code is available at https://github.com/Makabaka613/MPG-ViT.
随着生成式人工智能和先进的图像编辑技术的兴起,图像处理本地化变得更具挑战性。现有的方法往往在有限的语义理解和不足的空间细节捕获方面挣扎,特别是在复杂的场景中。为了解决这些问题,我们提出了一种新的多模态文本引导框架用于图像处理定位。通过将文本提示与图像特征融合,我们的方法增强了模型识别被操纵区域的能力。我们介绍了一个多模态交互提示模块(MIPM),它使用跨模态注意机制来对齐视觉和文本信息。在多模态提示的引导下,我们基于视觉转换器的模型可以准确地定位图像中的伪造区域。在包括CASIAv1和Columbia在内的公共数据集上进行的大量实验表明,我们的方法优于现有方法。具体来说,在CASIAv1数据集上,我们的方法获得了0.734的F1分数,比第二好的方法高出1.3%。这些结果证明了我们的多模态融合策略的有效性。代码可在https://github.com/Makabaka613/MPG-ViT上获得。
{"title":"Multimodal prompt-guided vision transformer for precise image manipulation localization","authors":"Yafang Xiao ,&nbsp;Wei Jiang ,&nbsp;Shihua Zhou ,&nbsp;Bin Wang ,&nbsp;Pengfei Wang ,&nbsp;Pan Zheng","doi":"10.1016/j.jvcir.2026.104736","DOIUrl":"10.1016/j.jvcir.2026.104736","url":null,"abstract":"<div><div>With the rise of generative AI and advanced image editing technologies, image manipulation localization has become more challenging. Existing methods often struggle with limited semantic understanding and insufficient spatial detail capture, especially in complex scenarios. To address these issues, we propose a novel multimodal text-guided framework for image manipulation localization. By fusing textual prompts with image features, our approach enhances the model’s ability to identify manipulated regions. We introduce a Multimodal Interaction Prompt Module (MIPM) that uses cross-modal attention mechanisms to align visual and textual information. Guided by multimodal prompts, our Vision Transformer-based model accurately localizes forged areas in images. Extensive experiments on public datasets, including CASIAv1 and Columbia, show that our method outperforms existing approaches. Specifically, on the CASIAv1 dataset, our approach achieves an F1 score of 0.734, surpassing the second-best method by 1.3%. These results demonstrate the effectiveness of our multimodal fusion strategy. The code is available at <span><span>https://github.com/Makabaka613/MPG-ViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104736"},"PeriodicalIF":3.1,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SFNet: Hierarchical perception and adaptive test-time training for AI-generated military image detection SFNet:人工智能生成军事图像检测的层次感知和自适应测试时间训练
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-27 DOI: 10.1016/j.jvcir.2026.104733
Minyang Li , Wenpeng Mu , Yifan Yuan , Shengyan Li , Qiang Xu
Existing general-purpose forgery detection techniques fall short in military scenarios because they lack military-specific priors about how real assets are designed, manufactured, and deployed. Authentic military platforms obey strict engineering and design standards, resulting in highly regular structural layouts and characteristic material textures, whereas AI-generated forgeries often exhibit subtle violations of these constraints. To address this critical gap, we introduce SentinelFakeNet (SFNet), a novel framework specifically designed for detecting AI-generated military images. SFNet features the Military Hierarchical Perception (MHP) Module, which extracts military-relevant hierarchical representations via Cross-Level Feature Fusion (CLFF) — a mechanism that intricately combines features from varying depths of the backbone. Furthermore, to ensure robustness and adaptability to diverse generative models, we propose the Military Adaptive Test-Time Training (MATTT) strategy, which incorporates Local Consistency Verification (LCV) and Multi-Scale Signature Analysis (MSSA) as specially designed tasks. To facilitate research in this domain, we also introduce MilForgery, the first large-scale military image forensic dataset comprising 800,000 authentic and synthetically generated military-related images. Extensive experiments demonstrate that our method achieves 95.80% average accuracy, representing state-of-the-art performance. Moreover, it exhibits superior generalization capabilities on public AIGC detection benchmarks, outperforming the leading baselines by +8.47% and +6.49% on GenImage and ForenSynths in average accuracy, respectively. Our code will be available on the author’s homepage.
现有的通用伪造检测技术在军事场景中存在不足,因为它们缺乏关于如何设计、制造和部署真实资产的特定军事经验。真实的军事平台遵守严格的工程和设计标准,导致高度规则的结构布局和特征材料纹理,而人工智能生成的伪造品通常会微妙地违反这些限制。为了解决这一关键差距,我们引入了SentinelFakeNet (SFNet),这是一个专门用于检测人工智能生成的军事图像的新框架。SFNet具有军事层次感知(MHP)模块,该模块通过跨层特征融合(CLFF)提取军事相关的层次表示,CLFF是一种复杂地结合来自骨干不同深度的特征的机制。此外,为了确保对不同生成模型的鲁棒性和适应性,我们提出了军事自适应测试时间训练(MATTT)策略,该策略将局部一致性验证(LCV)和多尺度签名分析(MSSA)作为特殊设计的任务。为了促进这一领域的研究,我们还介绍了MilForgery,这是第一个大规模军事图像法医数据集,包含80万张真实和合成的军事相关图像。大量的实验表明,我们的方法达到95.80%的平均准确率,代表了最先进的性能。此外,它在公共AIGC检测基准上表现出卓越的泛化能力,在GenImage和ForenSynths上的平均准确率分别比领先基线高出+8.47%和+6.49%。我们的代码将在作者的主页上提供。
{"title":"SFNet: Hierarchical perception and adaptive test-time training for AI-generated military image detection","authors":"Minyang Li ,&nbsp;Wenpeng Mu ,&nbsp;Yifan Yuan ,&nbsp;Shengyan Li ,&nbsp;Qiang Xu","doi":"10.1016/j.jvcir.2026.104733","DOIUrl":"10.1016/j.jvcir.2026.104733","url":null,"abstract":"<div><div>Existing general-purpose forgery detection techniques fall short in military scenarios because they lack military-specific priors about how real assets are designed, manufactured, and deployed. Authentic military platforms obey strict engineering and design standards, resulting in highly regular structural layouts and characteristic material textures, whereas AI-generated forgeries often exhibit subtle violations of these constraints. To address this critical gap, we introduce SentinelFakeNet (SFNet), a novel framework specifically designed for detecting AI-generated military images. SFNet features the Military Hierarchical Perception (MHP) Module, which extracts military-relevant hierarchical representations via Cross-Level Feature Fusion (CLFF) — a mechanism that intricately combines features from varying depths of the backbone. Furthermore, to ensure robustness and adaptability to diverse generative models, we propose the Military Adaptive Test-Time Training (MATTT) strategy, which incorporates Local Consistency Verification (LCV) and Multi-Scale Signature Analysis (MSSA) as specially designed tasks. To facilitate research in this domain, we also introduce MilForgery, the first large-scale military image forensic dataset comprising 800,000 authentic and synthetically generated military-related images. Extensive experiments demonstrate that our method achieves 95.80% average accuracy, representing state-of-the-art performance. Moreover, it exhibits superior generalization capabilities on public AIGC detection benchmarks, outperforming the leading baselines by +8.47% and +6.49% on GenImage and ForenSynths in average accuracy, respectively. Our code will be available on the author’s homepage.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104733"},"PeriodicalIF":3.1,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the transformer-based and diffusion-based models for single image deblurring 探索基于变压器和基于扩散的单幅图像去模糊模型
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-27 DOI: 10.1016/j.jvcir.2026.104735
Seunghwan Park , Chaehun Shin , Jaihyun Lew , Sungroh Yoon
Image deblurring is a fundamental task in image restoration (IR) aimed at removing blurring artifacts caused by factors such as defocusing, motions, and others. Since a blurry image could be originated from various sharp images, deblurring is regarded as an ill-posed problem with multiple valid solutions. The evolution of deblurring techniques spans from rule-based algorithms to deep learning-based models. Early research focused on estimating blur kernels using maximum a posteriori (MAP) estimation, but advancements in deep learning have shifted the focus towards directly predicting sharp images by leveraging deep learning techniques such as convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and others. Building on these foundations, recent studies have advanced along two directions: transformer-based architectural innovations and diffusion-based algorithmic advances. This survey provides an in-depth investigation of recent deblurring models and traditional approaches. Furthermore, we conduct a fair re-evaluation under a unified evaluation protocol.
图像去模糊是图像恢复(IR)中的一项基本任务,旨在消除因散焦、运动等因素引起的模糊伪影。由于模糊图像可以由各种清晰图像产生,因此去模糊被认为是一个具有多个有效解的不适定问题。去模糊技术的发展跨越了从基于规则的算法到基于深度学习的模型。早期的研究主要集中在使用最大后验(MAP)估计来估计模糊核,但深度学习的进步已经将重点转移到通过利用深度学习技术(如卷积神经网络(cnn)、生成对抗网络(gan)、循环神经网络(rnn)等直接预测清晰图像。在这些基础上,最近的研究沿着两个方向发展:基于转换器的架构创新和基于扩散的算法进步。这项调查提供了一个深入的调查,最近的去模糊模型和传统的方法。此外,我们在统一的评估方案下进行公平的重新评估。
{"title":"Exploring the transformer-based and diffusion-based models for single image deblurring","authors":"Seunghwan Park ,&nbsp;Chaehun Shin ,&nbsp;Jaihyun Lew ,&nbsp;Sungroh Yoon","doi":"10.1016/j.jvcir.2026.104735","DOIUrl":"10.1016/j.jvcir.2026.104735","url":null,"abstract":"<div><div>Image deblurring is a fundamental task in image restoration (IR) aimed at removing blurring artifacts caused by factors such as defocusing, motions, and others. Since a blurry image could be originated from various sharp images, deblurring is regarded as an ill-posed problem with multiple valid solutions. The evolution of deblurring techniques spans from rule-based algorithms to deep learning-based models. Early research focused on estimating blur kernels using maximum a posteriori (MAP) estimation, but advancements in deep learning have shifted the focus towards directly predicting sharp images by leveraging deep learning techniques such as convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and others. Building on these foundations, recent studies have advanced along two directions: transformer-based architectural innovations and diffusion-based algorithmic advances. This survey provides an in-depth investigation of recent deblurring models and traditional approaches. Furthermore, we conduct a fair re-evaluation under a unified evaluation protocol.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104735"},"PeriodicalIF":3.1,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unified global–local feature modeling via reverse patch scaling for image manipulation localization 统一全局-局部特征建模,通过反向补丁缩放实现图像操作定位
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.jvcir.2026.104731
Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang
Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.
图像处理定位需要对全局特征和局部特征进行综合提取和融合。然而,现有的方法通常采用并行架构,分别处理语义上下文和局部细节,导致有限的交互和碎片化表示。此外,在所有层上使用统一的修补策略忽略了深层特征的不同语义角色和空间属性。为了解决这些问题,我们提出了一个统一的框架,直接从分层全局特征中派生局部表示。反向补丁缩放策略将较小的补丁大小和较大的重叠分配给更深的层,从而使密集的局部建模与增加的语义抽象相一致。一个不对称的交叉注意模块提高了功能的交互性和一致性。此外,双策略解码器通过串联和加法融合多尺度特征,而统计引导的边缘感知模块对预测掩模的局部方差和熵进行建模,以改进边界感知。大量的实验表明,我们的方法在准确性和鲁棒性方面都优于最先进的方法。
{"title":"Unified global–local feature modeling via reverse patch scaling for image manipulation localization","authors":"Jingying Cai ,&nbsp;Hang Cheng ,&nbsp;Jiabin Chen ,&nbsp;Haichou Wang ,&nbsp;Meiqing Wang","doi":"10.1016/j.jvcir.2026.104731","DOIUrl":"10.1016/j.jvcir.2026.104731","url":null,"abstract":"<div><div>Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104731"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global–local dual-branch network with local feature enhancement for visual tracking 基于局部特征增强的全局-局部双分支网络视觉跟踪
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.jvcir.2026.104725
Yuanyun Wang, Lingtao Zhou, Zhuo An, Lei Sun, Min Hu, Jun Wang
Vision Transformers (ViT) have been widely applied due to their excellent performance. Compared with CNN models, ViT models are more difficult to train and require more training samples because they cannot effectively utilize high-frequency local information. In this paper we propose an efficient tracking framework based on global and local feature extraction, and an enhancement module. To address the high-frequency local information neglected by general ViT-based trackers, we design an effective local branch architecture to capture the information. For local feature extraction and enhancement, we design a local branch, which aggregates local information by using shared weights; it utilizes the optimized context-aware weights to enhance the local features. The integration of the attention mechanism in the global and local branches enables the tracker to perceive both high-frequency local information and low-frequency global information simultaneously. Experimental comparisons show that the tracker achieves superior results and proves the generalization ability and effectiveness. Code will be available at https://github.com/WangJun-CV/GLDTrack.
视觉变压器以其优异的性能得到了广泛的应用。与CNN模型相比,ViT模型由于不能有效利用高频局部信息,训练难度更大,需要更多的训练样本。本文提出了一种基于全局和局部特征提取的高效跟踪框架和增强模块。为了解决一般基于vit的跟踪器所忽略的高频局部信息,我们设计了一个有效的局部分支架构来捕获这些信息。在局部特征提取和增强方面,设计了局部分支,利用共享权值对局部信息进行聚合;它利用优化的上下文感知权重来增强局部特征。全局分支和局部分支的注意机制集成,使跟踪器能够同时感知高频局部信息和低频全局信息。实验对比表明,该跟踪器取得了较好的效果,证明了该跟踪器的泛化能力和有效性。代码将在https://github.com/WangJun-CV/GLDTrack上提供。
{"title":"Global–local dual-branch network with local feature enhancement for visual tracking","authors":"Yuanyun Wang,&nbsp;Lingtao Zhou,&nbsp;Zhuo An,&nbsp;Lei Sun,&nbsp;Min Hu,&nbsp;Jun Wang","doi":"10.1016/j.jvcir.2026.104725","DOIUrl":"10.1016/j.jvcir.2026.104725","url":null,"abstract":"<div><div>Vision Transformers (ViT) have been widely applied due to their excellent performance. Compared with CNN models, ViT models are more difficult to train and require more training samples because they cannot effectively utilize high-frequency local information. In this paper we propose an efficient tracking framework based on global and local feature extraction, and an enhancement module. To address the high-frequency local information neglected by general ViT-based trackers, we design an effective local branch architecture to capture the information. For local feature extraction and enhancement, we design a local branch, which aggregates local information by using shared weights; it utilizes the optimized context-aware weights to enhance the local features. The integration of the attention mechanism in the global and local branches enables the tracker to perceive both high-frequency local information and low-frequency global information simultaneously. Experimental comparisons show that the tracker achieves superior results and proves the generalization ability and effectiveness. Code will be available at <span><span>https://github.com/WangJun-CV/GLDTrack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104725"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight whole-body mesh recovery with joints and depth aware hand detail optimization 轻量级的全身网格恢复与关节和深度感知手细节优化
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-23 DOI: 10.1016/j.jvcir.2026.104729
Zilong Yang, Shujun Zhang, Xiao Wang, Hu Jin, Limin Sun
Expressive whole-body mesh recovery aims to estimate 3D human pose and shape parameters, including the face and hands, from a monocular image. Since hand details play a crucial role in conveying human posture, accurate hand reconstruction is of great importance for applications in 3D human modeling. However, precise recovery of hands is highly challenging due to the relatively small spatial proportion of hands, high flexibility, diverse gestures, and frequent occlusions. In this work, we propose a lightweight whole-body mesh recovery framework that enhances hand detail reconstruction while reducing computational complexity. Specifically, we introduce a Joints and Depth Aware Fusion (JDAF) module that adaptively encodes geometric joints and depth cues from local hand regions. This module provides strong 3D priors and effectively guides the regression of accurate hand parameters. In addition, we propose an Adaptive Dual-branch Pooling Attention (ADPA) module that models global context and local fine-grained interactions in a lightweight manner. Compared with the traditional self-attention mechanism, this module significantly reduces the computational burden. Experiments on the EHF and UBody benchmarks demonstrate that our approach outperforms SOTA methods, reducing body MPVPE by 8.5% and hand PA-MPVPE by 6.2%, while significantly lowering the number of parameters and MACs. More importantly, its efficiency and lightweight make it particularly suitable for real-time visual communication scenarios such as immersive conferencing, sign language translation, and VR/AR interaction.
表达全身网格恢复的目的是估计三维人体姿态和形状参数,包括脸和手,从单眼图像。由于手部细节在人体姿态的传递中起着至关重要的作用,因此准确的手部重建对于人体三维建模的应用具有重要意义。然而,由于手部的空间比例相对较小,灵活性高,手势多样,并且经常发生闭塞,因此手部的精确恢复具有很高的挑战性。在这项工作中,我们提出了一种轻量级的全身网格恢复框架,增强了手部细节重建,同时降低了计算复杂度。具体来说,我们引入了一个关节和深度感知融合(JDAF)模块,该模块自适应地编码来自手部局部区域的几何关节和深度线索。该模块提供了强大的三维先验,有效地指导准确的手部参数回归。此外,我们提出了一个自适应双分支池注意(ADPA)模块,该模块以轻量级的方式对全局上下文和局部细粒度交互进行建模。与传统的自关注机制相比,该模块显著降低了计算量。在EHF和UBody基准测试上的实验表明,我们的方法优于SOTA方法,将车身MPVPE降低8.5%,手部PA-MPVPE降低6.2%,同时显著降低了参数和mac的数量。更重要的是,它的效率和轻量级使其特别适合沉浸式会议、手语翻译、VR/AR交互等实时视觉通信场景。
{"title":"Lightweight whole-body mesh recovery with joints and depth aware hand detail optimization","authors":"Zilong Yang,&nbsp;Shujun Zhang,&nbsp;Xiao Wang,&nbsp;Hu Jin,&nbsp;Limin Sun","doi":"10.1016/j.jvcir.2026.104729","DOIUrl":"10.1016/j.jvcir.2026.104729","url":null,"abstract":"<div><div>Expressive whole-body mesh recovery aims to estimate 3D human pose and shape parameters, including the face and hands, from a monocular image. Since hand details play a crucial role in conveying human posture, accurate hand reconstruction is of great importance for applications in 3D human modeling. However, precise recovery of hands is highly challenging due to the relatively small spatial proportion of hands, high flexibility, diverse gestures, and frequent occlusions. In this work, we propose a lightweight whole-body mesh recovery framework that enhances hand detail reconstruction while reducing computational complexity. Specifically, we introduce a Joints and Depth Aware Fusion (JDAF) module that adaptively encodes geometric joints and depth cues from local hand regions. This module provides strong 3D priors and effectively guides the regression of accurate hand parameters. In addition, we propose an Adaptive Dual-branch Pooling Attention (ADPA) module that models global context and local fine-grained interactions in a lightweight manner. Compared with the traditional self-attention mechanism, this module significantly reduces the computational burden. Experiments on the EHF and UBody benchmarks demonstrate that our approach outperforms SOTA methods, reducing body MPVPE by 8.5% and hand PA-MPVPE by 6.2%, while significantly lowering the number of parameters and MACs. More importantly, its efficiency and lightweight make it particularly suitable for real-time visual communication scenarios such as immersive conferencing, sign language translation, and VR/AR interaction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104729"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global–local co-regularization network for facial action unit detection 面部动作单元检测的全局-局部协同正则化网络
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-21 DOI: 10.1016/j.jvcir.2026.104728
Yumei Tan , Haiying Xia , Shuxiang Song
Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.
面部动作单元(AU)检测在捕捉判别性的局部特征和复杂的AU相关性方面提出了挑战。为了解决这一挑战,我们提出了一种以协作方式训练的有效的全局-局部协同正则化网络(Co-GLN)。Co-GLN由全球分支和局部分支组成,旨在在全球分支中建立全球特征级的相互关系,同时在局部分支中挖掘区域级的判别特征。具体而言,在全局分支中,设计了一个全局交互(GI)模块来增强跨像素关系,以捕获全局语义信息。本地分支由区域定位(RL)模块、特征内关系建模(IRM)模块和区域交互(RI)模块组成。RL模块根据预定义的人脸区域提取区域特征,IRM模块针对每个区域提取局部特征。随后,RI模块整合跨区域的互补信息。最后,使用协正则化约束来鼓励全局和局部分支之间的一致性。实验结果表明,Co-GLN在BP4D和DISFA数据集上的AU检测性能持续提高。
{"title":"Global–local co-regularization network for facial action unit detection","authors":"Yumei Tan ,&nbsp;Haiying Xia ,&nbsp;Shuxiang Song","doi":"10.1016/j.jvcir.2026.104728","DOIUrl":"10.1016/j.jvcir.2026.104728","url":null,"abstract":"<div><div>Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104728"},"PeriodicalIF":3.1,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
eGoRG: GPU-accelerated depth estimation for immersive video applications based on graph cuts 基于图形切割的沉浸式视频应用的gpu加速深度估计
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-19 DOI: 10.1016/j.jvcir.2026.104727
Jaime Sancho , Manuel Villa , Miguel Chavarrias , Rubén Salvador , Eduardo Juarez , César Sanz
Immersive video is gaining relevance across various fields, but its integration into real applications remains limited due to the technical challenges of depth estimation. Generating accurate depth maps is essential for 3D rendering, yet high-quality algorithms can require hundreds of seconds to produce a single frame. While real-time depth estimation solutions exist — particularly monocular deep learning-based methods and active sensors such as time-of-flight or plenoptic cameras — their depth accuracy and multiview consistency are often insufficient for depth image-based rendering (DIBR) and immersive video applications. This highlights the persistent challenge of jointly achieving real-time performance and high-quality, correlated depth across views. This paper introduces eGoRG, a GPU-accelerated depth estimation algorithm based on MPEG DERS, which employs graph cuts to achieve high-quality results. eGoRG contributes a novel GPU-based graph cuts stage, integrating block-based push-relabel acceleration and a simplified alpha expansion method. These optimizations deliver quality comparable to leading graph-cut approaches while greatly improving speed. Evaluation on an MPEG multiview dataset and a static NeRF dataset demonstrates the algorithm’s effectiveness across different scenarios.
沉浸式视频正在各个领域获得相关性,但由于深度估计的技术挑战,其与实际应用的集成仍然有限。生成精确的深度图对于3D渲染至关重要,然而高质量的算法可能需要数百秒才能生成单个帧。虽然存在实时深度估计解决方案,特别是基于单目深度学习的方法和主动传感器,如飞行时间或全光学相机,但它们的深度精度和多视图一致性通常不足以用于深度图像渲染(DIBR)和沉浸式视频应用。这突出了共同实现实时性能和高质量、跨视图相关深度的持续挑战。本文介绍了一种基于MPEG - DERS的gpu加速深度估计算法eGoRG,该算法利用图切来获得高质量的深度估计结果。eGoRG提供了一种新的基于gpu的图切割阶段,集成了基于块的推标签加速和简化的alpha展开方法。这些优化提供了与领先的图切割方法相当的质量,同时大大提高了速度。对MPEG多视图数据集和静态NeRF数据集的评估证明了该算法在不同场景下的有效性。
{"title":"eGoRG: GPU-accelerated depth estimation for immersive video applications based on graph cuts","authors":"Jaime Sancho ,&nbsp;Manuel Villa ,&nbsp;Miguel Chavarrias ,&nbsp;Rubén Salvador ,&nbsp;Eduardo Juarez ,&nbsp;César Sanz","doi":"10.1016/j.jvcir.2026.104727","DOIUrl":"10.1016/j.jvcir.2026.104727","url":null,"abstract":"<div><div>Immersive video is gaining relevance across various fields, but its integration into real applications remains limited due to the technical challenges of depth estimation. Generating accurate depth maps is essential for 3D rendering, yet high-quality algorithms can require hundreds of seconds to produce a single frame. While real-time depth estimation solutions exist — particularly monocular deep learning-based methods and active sensors such as time-of-flight or plenoptic cameras — their depth accuracy and multiview consistency are often insufficient for depth image-based rendering (DIBR) and immersive video applications. This highlights the persistent challenge of jointly achieving real-time performance and high-quality, correlated depth across views. This paper introduces eGoRG, a GPU-accelerated depth estimation algorithm based on MPEG DERS, which employs graph cuts to achieve high-quality results. eGoRG contributes a novel GPU-based graph cuts stage, integrating block-based push-relabel acceleration and a simplified alpha expansion method. These optimizations deliver quality comparable to leading graph-cut approaches while greatly improving speed. Evaluation on an MPEG multiview dataset and a static NeRF dataset demonstrates the algorithm’s effectiveness across different scenarios.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104727"},"PeriodicalIF":3.1,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MTPA: A multi-aspects perception assisted AIGV quality assessment model MTPA:一个多层面感知辅助的AIGV质量评估模型
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.jvcir.2026.104721
Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang
With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.
随着人工智能(AI)生成技术的发展,人工智能生成视频(AI generated video, AIGV)引起了人们的广泛关注。与传统视频中的视觉感知相比,AIGV有其独特的挑战,如视觉一致性、文本-视频对齐等。本文提出了一种多面向感知辅助的AIGV质量评价模型,从文本-视频对齐评分、视觉空间感知评分和视觉时间感知评分三个方面对AIGV质量进行综合评价。具体而言,采用预训练的视觉语言模块来研究文本到视频的对齐质量,使用语义感知模块来捕获视觉空间感知特征。此外,采用有效的视觉时间特征提取模块捕获多尺度时间特征。最后,综合文本-视频对齐特征、视觉空间特征、视觉时间感知特征和多尺度视觉融合特征,对图像质量进行综合评价。我们的模型在三个公共AIGV数据集上拥有最先进的结果,证明了它的有效性。
{"title":"MTPA: A multi-aspects perception assisted AIGV quality assessment model","authors":"Yun Liu,&nbsp;Daoxin Fan,&nbsp;Zihan Liu,&nbsp;Sifan Li,&nbsp;Haiyuan Wang","doi":"10.1016/j.jvcir.2026.104721","DOIUrl":"10.1016/j.jvcir.2026.104721","url":null,"abstract":"<div><div>With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104721"},"PeriodicalIF":3.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ATR-Net: Attention-based temporal-refinement network for efficient facial emotion recognition in human–robot interaction ATR-Net:人机交互中高效面部情感识别的基于注意力的时间优化网络
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.jvcir.2026.104720
Sougatamoy Biswas , Harshavardhan Reddy Gajarla , Anup Nandy , Asim Kumar Naskar
Facial Emotion Recognition (FER) enables human–robot interaction by allowing robots to interpret human emotions effectively. Traditional FER models achieve high accuracy but are often computationally intensive, limiting real-time application on resource-constrained devices. These models also face challenges in capturing subtle emotional expressions and addressing variations in facial poses. This study proposes a lightweight FER model based on EfficientNet-B0, balancing accuracy and efficiency for real-time deployment on embedded robotic systems. The proposed architecture integrates an Attention Augmented Convolution (AAC) layer with EfficientNet-B0 to enhance the model’s focus on subtle emotional cues, enabling robust performance in complex environments. Additionally, a Pyramid Channel-Gated Attention with a Temporal Refinement Block is introduced to capture spatial and channel dependencies, ensuring adaptability and efficiency on resource-limited devices. The model achieves accuracies of 74.22% on FER-2013, 99.14% on CK+, and 67.36% on AffectNet-7. These results demonstrate its efficiency and robustness for facial emotion recognition in human–robot interaction.
面部情绪识别(FER)通过允许机器人有效地解释人类的情绪来实现人机交互。传统的FER模型具有较高的精度,但通常计算量大,限制了在资源受限设备上的实时应用。这些模特在捕捉微妙的情绪表达和处理面部姿势的变化方面也面临着挑战。本研究提出了一种基于EfficientNet-B0的轻量级FER模型,平衡了嵌入式机器人系统实时部署的精度和效率。该架构将注意力增强卷积(Attention Augmented Convolution, AAC)层与effentnet - b0集成在一起,以增强模型对微妙情感线索的关注,从而在复杂环境中实现稳健的性能。此外,引入了具有时间细化块的金字塔通道门控注意力来捕获空间和通道依赖性,确保在资源有限的设备上的适应性和效率。该模型在FER-2013上的准确率为74.22%,在CK+上为99.14%,在AffectNet-7上为67.36%。实验结果证明了该方法在人机交互中人脸情感识别的有效性和鲁棒性。
{"title":"ATR-Net: Attention-based temporal-refinement network for efficient facial emotion recognition in human–robot interaction","authors":"Sougatamoy Biswas ,&nbsp;Harshavardhan Reddy Gajarla ,&nbsp;Anup Nandy ,&nbsp;Asim Kumar Naskar","doi":"10.1016/j.jvcir.2026.104720","DOIUrl":"10.1016/j.jvcir.2026.104720","url":null,"abstract":"<div><div>Facial Emotion Recognition (FER) enables human–robot interaction by allowing robots to interpret human emotions effectively. Traditional FER models achieve high accuracy but are often computationally intensive, limiting real-time application on resource-constrained devices. These models also face challenges in capturing subtle emotional expressions and addressing variations in facial poses. This study proposes a lightweight FER model based on EfficientNet-B0, balancing accuracy and efficiency for real-time deployment on embedded robotic systems. The proposed architecture integrates an Attention Augmented Convolution (AAC) layer with EfficientNet-B0 to enhance the model’s focus on subtle emotional cues, enabling robust performance in complex environments. Additionally, a Pyramid Channel-Gated Attention with a Temporal Refinement Block is introduced to capture spatial and channel dependencies, ensuring adaptability and efficiency on resource-limited devices. The model achieves accuracies of 74.22% on FER-2013, 99.14% on CK+, and 67.36% on AffectNet-7. These results demonstrate its efficiency and robustness for facial emotion recognition in human–robot interaction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104720"},"PeriodicalIF":3.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1