Pub Date : 2026-01-27DOI: 10.1016/j.jvcir.2026.104736
Yafang Xiao , Wei Jiang , Shihua Zhou , Bin Wang , Pengfei Wang , Pan Zheng
With the rise of generative AI and advanced image editing technologies, image manipulation localization has become more challenging. Existing methods often struggle with limited semantic understanding and insufficient spatial detail capture, especially in complex scenarios. To address these issues, we propose a novel multimodal text-guided framework for image manipulation localization. By fusing textual prompts with image features, our approach enhances the model’s ability to identify manipulated regions. We introduce a Multimodal Interaction Prompt Module (MIPM) that uses cross-modal attention mechanisms to align visual and textual information. Guided by multimodal prompts, our Vision Transformer-based model accurately localizes forged areas in images. Extensive experiments on public datasets, including CASIAv1 and Columbia, show that our method outperforms existing approaches. Specifically, on the CASIAv1 dataset, our approach achieves an F1 score of 0.734, surpassing the second-best method by 1.3%. These results demonstrate the effectiveness of our multimodal fusion strategy. The code is available at https://github.com/Makabaka613/MPG-ViT.
{"title":"Multimodal prompt-guided vision transformer for precise image manipulation localization","authors":"Yafang Xiao , Wei Jiang , Shihua Zhou , Bin Wang , Pengfei Wang , Pan Zheng","doi":"10.1016/j.jvcir.2026.104736","DOIUrl":"10.1016/j.jvcir.2026.104736","url":null,"abstract":"<div><div>With the rise of generative AI and advanced image editing technologies, image manipulation localization has become more challenging. Existing methods often struggle with limited semantic understanding and insufficient spatial detail capture, especially in complex scenarios. To address these issues, we propose a novel multimodal text-guided framework for image manipulation localization. By fusing textual prompts with image features, our approach enhances the model’s ability to identify manipulated regions. We introduce a Multimodal Interaction Prompt Module (MIPM) that uses cross-modal attention mechanisms to align visual and textual information. Guided by multimodal prompts, our Vision Transformer-based model accurately localizes forged areas in images. Extensive experiments on public datasets, including CASIAv1 and Columbia, show that our method outperforms existing approaches. Specifically, on the CASIAv1 dataset, our approach achieves an F1 score of 0.734, surpassing the second-best method by 1.3%. These results demonstrate the effectiveness of our multimodal fusion strategy. The code is available at <span><span>https://github.com/Makabaka613/MPG-ViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104736"},"PeriodicalIF":3.1,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1016/j.jvcir.2026.104733
Minyang Li , Wenpeng Mu , Yifan Yuan , Shengyan Li , Qiang Xu
Existing general-purpose forgery detection techniques fall short in military scenarios because they lack military-specific priors about how real assets are designed, manufactured, and deployed. Authentic military platforms obey strict engineering and design standards, resulting in highly regular structural layouts and characteristic material textures, whereas AI-generated forgeries often exhibit subtle violations of these constraints. To address this critical gap, we introduce SentinelFakeNet (SFNet), a novel framework specifically designed for detecting AI-generated military images. SFNet features the Military Hierarchical Perception (MHP) Module, which extracts military-relevant hierarchical representations via Cross-Level Feature Fusion (CLFF) — a mechanism that intricately combines features from varying depths of the backbone. Furthermore, to ensure robustness and adaptability to diverse generative models, we propose the Military Adaptive Test-Time Training (MATTT) strategy, which incorporates Local Consistency Verification (LCV) and Multi-Scale Signature Analysis (MSSA) as specially designed tasks. To facilitate research in this domain, we also introduce MilForgery, the first large-scale military image forensic dataset comprising 800,000 authentic and synthetically generated military-related images. Extensive experiments demonstrate that our method achieves 95.80% average accuracy, representing state-of-the-art performance. Moreover, it exhibits superior generalization capabilities on public AIGC detection benchmarks, outperforming the leading baselines by +8.47% and +6.49% on GenImage and ForenSynths in average accuracy, respectively. Our code will be available on the author’s homepage.
{"title":"SFNet: Hierarchical perception and adaptive test-time training for AI-generated military image detection","authors":"Minyang Li , Wenpeng Mu , Yifan Yuan , Shengyan Li , Qiang Xu","doi":"10.1016/j.jvcir.2026.104733","DOIUrl":"10.1016/j.jvcir.2026.104733","url":null,"abstract":"<div><div>Existing general-purpose forgery detection techniques fall short in military scenarios because they lack military-specific priors about how real assets are designed, manufactured, and deployed. Authentic military platforms obey strict engineering and design standards, resulting in highly regular structural layouts and characteristic material textures, whereas AI-generated forgeries often exhibit subtle violations of these constraints. To address this critical gap, we introduce SentinelFakeNet (SFNet), a novel framework specifically designed for detecting AI-generated military images. SFNet features the Military Hierarchical Perception (MHP) Module, which extracts military-relevant hierarchical representations via Cross-Level Feature Fusion (CLFF) — a mechanism that intricately combines features from varying depths of the backbone. Furthermore, to ensure robustness and adaptability to diverse generative models, we propose the Military Adaptive Test-Time Training (MATTT) strategy, which incorporates Local Consistency Verification (LCV) and Multi-Scale Signature Analysis (MSSA) as specially designed tasks. To facilitate research in this domain, we also introduce MilForgery, the first large-scale military image forensic dataset comprising 800,000 authentic and synthetically generated military-related images. Extensive experiments demonstrate that our method achieves 95.80% average accuracy, representing state-of-the-art performance. Moreover, it exhibits superior generalization capabilities on public AIGC detection benchmarks, outperforming the leading baselines by +8.47% and +6.49% on GenImage and ForenSynths in average accuracy, respectively. Our code will be available on the author’s homepage.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104733"},"PeriodicalIF":3.1,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image deblurring is a fundamental task in image restoration (IR) aimed at removing blurring artifacts caused by factors such as defocusing, motions, and others. Since a blurry image could be originated from various sharp images, deblurring is regarded as an ill-posed problem with multiple valid solutions. The evolution of deblurring techniques spans from rule-based algorithms to deep learning-based models. Early research focused on estimating blur kernels using maximum a posteriori (MAP) estimation, but advancements in deep learning have shifted the focus towards directly predicting sharp images by leveraging deep learning techniques such as convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and others. Building on these foundations, recent studies have advanced along two directions: transformer-based architectural innovations and diffusion-based algorithmic advances. This survey provides an in-depth investigation of recent deblurring models and traditional approaches. Furthermore, we conduct a fair re-evaluation under a unified evaluation protocol.
{"title":"Exploring the transformer-based and diffusion-based models for single image deblurring","authors":"Seunghwan Park , Chaehun Shin , Jaihyun Lew , Sungroh Yoon","doi":"10.1016/j.jvcir.2026.104735","DOIUrl":"10.1016/j.jvcir.2026.104735","url":null,"abstract":"<div><div>Image deblurring is a fundamental task in image restoration (IR) aimed at removing blurring artifacts caused by factors such as defocusing, motions, and others. Since a blurry image could be originated from various sharp images, deblurring is regarded as an ill-posed problem with multiple valid solutions. The evolution of deblurring techniques spans from rule-based algorithms to deep learning-based models. Early research focused on estimating blur kernels using maximum a posteriori (MAP) estimation, but advancements in deep learning have shifted the focus towards directly predicting sharp images by leveraging deep learning techniques such as convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and others. Building on these foundations, recent studies have advanced along two directions: transformer-based architectural innovations and diffusion-based algorithmic advances. This survey provides an in-depth investigation of recent deblurring models and traditional approaches. Furthermore, we conduct a fair re-evaluation under a unified evaluation protocol.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104735"},"PeriodicalIF":3.1,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.jvcir.2026.104731
Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang
Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.
{"title":"Unified global–local feature modeling via reverse patch scaling for image manipulation localization","authors":"Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang","doi":"10.1016/j.jvcir.2026.104731","DOIUrl":"10.1016/j.jvcir.2026.104731","url":null,"abstract":"<div><div>Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104731"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.jvcir.2026.104725
Yuanyun Wang, Lingtao Zhou, Zhuo An, Lei Sun, Min Hu, Jun Wang
Vision Transformers (ViT) have been widely applied due to their excellent performance. Compared with CNN models, ViT models are more difficult to train and require more training samples because they cannot effectively utilize high-frequency local information. In this paper we propose an efficient tracking framework based on global and local feature extraction, and an enhancement module. To address the high-frequency local information neglected by general ViT-based trackers, we design an effective local branch architecture to capture the information. For local feature extraction and enhancement, we design a local branch, which aggregates local information by using shared weights; it utilizes the optimized context-aware weights to enhance the local features. The integration of the attention mechanism in the global and local branches enables the tracker to perceive both high-frequency local information and low-frequency global information simultaneously. Experimental comparisons show that the tracker achieves superior results and proves the generalization ability and effectiveness. Code will be available at https://github.com/WangJun-CV/GLDTrack.
{"title":"Global–local dual-branch network with local feature enhancement for visual tracking","authors":"Yuanyun Wang, Lingtao Zhou, Zhuo An, Lei Sun, Min Hu, Jun Wang","doi":"10.1016/j.jvcir.2026.104725","DOIUrl":"10.1016/j.jvcir.2026.104725","url":null,"abstract":"<div><div>Vision Transformers (ViT) have been widely applied due to their excellent performance. Compared with CNN models, ViT models are more difficult to train and require more training samples because they cannot effectively utilize high-frequency local information. In this paper we propose an efficient tracking framework based on global and local feature extraction, and an enhancement module. To address the high-frequency local information neglected by general ViT-based trackers, we design an effective local branch architecture to capture the information. For local feature extraction and enhancement, we design a local branch, which aggregates local information by using shared weights; it utilizes the optimized context-aware weights to enhance the local features. The integration of the attention mechanism in the global and local branches enables the tracker to perceive both high-frequency local information and low-frequency global information simultaneously. Experimental comparisons show that the tracker achieves superior results and proves the generalization ability and effectiveness. Code will be available at <span><span>https://github.com/WangJun-CV/GLDTrack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104725"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.jvcir.2026.104729
Zilong Yang, Shujun Zhang, Xiao Wang, Hu Jin, Limin Sun
Expressive whole-body mesh recovery aims to estimate 3D human pose and shape parameters, including the face and hands, from a monocular image. Since hand details play a crucial role in conveying human posture, accurate hand reconstruction is of great importance for applications in 3D human modeling. However, precise recovery of hands is highly challenging due to the relatively small spatial proportion of hands, high flexibility, diverse gestures, and frequent occlusions. In this work, we propose a lightweight whole-body mesh recovery framework that enhances hand detail reconstruction while reducing computational complexity. Specifically, we introduce a Joints and Depth Aware Fusion (JDAF) module that adaptively encodes geometric joints and depth cues from local hand regions. This module provides strong 3D priors and effectively guides the regression of accurate hand parameters. In addition, we propose an Adaptive Dual-branch Pooling Attention (ADPA) module that models global context and local fine-grained interactions in a lightweight manner. Compared with the traditional self-attention mechanism, this module significantly reduces the computational burden. Experiments on the EHF and UBody benchmarks demonstrate that our approach outperforms SOTA methods, reducing body MPVPE by 8.5% and hand PA-MPVPE by 6.2%, while significantly lowering the number of parameters and MACs. More importantly, its efficiency and lightweight make it particularly suitable for real-time visual communication scenarios such as immersive conferencing, sign language translation, and VR/AR interaction.
{"title":"Lightweight whole-body mesh recovery with joints and depth aware hand detail optimization","authors":"Zilong Yang, Shujun Zhang, Xiao Wang, Hu Jin, Limin Sun","doi":"10.1016/j.jvcir.2026.104729","DOIUrl":"10.1016/j.jvcir.2026.104729","url":null,"abstract":"<div><div>Expressive whole-body mesh recovery aims to estimate 3D human pose and shape parameters, including the face and hands, from a monocular image. Since hand details play a crucial role in conveying human posture, accurate hand reconstruction is of great importance for applications in 3D human modeling. However, precise recovery of hands is highly challenging due to the relatively small spatial proportion of hands, high flexibility, diverse gestures, and frequent occlusions. In this work, we propose a lightweight whole-body mesh recovery framework that enhances hand detail reconstruction while reducing computational complexity. Specifically, we introduce a Joints and Depth Aware Fusion (JDAF) module that adaptively encodes geometric joints and depth cues from local hand regions. This module provides strong 3D priors and effectively guides the regression of accurate hand parameters. In addition, we propose an Adaptive Dual-branch Pooling Attention (ADPA) module that models global context and local fine-grained interactions in a lightweight manner. Compared with the traditional self-attention mechanism, this module significantly reduces the computational burden. Experiments on the EHF and UBody benchmarks demonstrate that our approach outperforms SOTA methods, reducing body MPVPE by 8.5% and hand PA-MPVPE by 6.2%, while significantly lowering the number of parameters and MACs. More importantly, its efficiency and lightweight make it particularly suitable for real-time visual communication scenarios such as immersive conferencing, sign language translation, and VR/AR interaction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104729"},"PeriodicalIF":3.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.jvcir.2026.104728
Yumei Tan , Haiying Xia , Shuxiang Song
Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.
{"title":"Global–local co-regularization network for facial action unit detection","authors":"Yumei Tan , Haiying Xia , Shuxiang Song","doi":"10.1016/j.jvcir.2026.104728","DOIUrl":"10.1016/j.jvcir.2026.104728","url":null,"abstract":"<div><div>Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104728"},"PeriodicalIF":3.1,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1016/j.jvcir.2026.104727
Jaime Sancho , Manuel Villa , Miguel Chavarrias , Rubén Salvador , Eduardo Juarez , César Sanz
Immersive video is gaining relevance across various fields, but its integration into real applications remains limited due to the technical challenges of depth estimation. Generating accurate depth maps is essential for 3D rendering, yet high-quality algorithms can require hundreds of seconds to produce a single frame. While real-time depth estimation solutions exist — particularly monocular deep learning-based methods and active sensors such as time-of-flight or plenoptic cameras — their depth accuracy and multiview consistency are often insufficient for depth image-based rendering (DIBR) and immersive video applications. This highlights the persistent challenge of jointly achieving real-time performance and high-quality, correlated depth across views. This paper introduces eGoRG, a GPU-accelerated depth estimation algorithm based on MPEG DERS, which employs graph cuts to achieve high-quality results. eGoRG contributes a novel GPU-based graph cuts stage, integrating block-based push-relabel acceleration and a simplified alpha expansion method. These optimizations deliver quality comparable to leading graph-cut approaches while greatly improving speed. Evaluation on an MPEG multiview dataset and a static NeRF dataset demonstrates the algorithm’s effectiveness across different scenarios.
{"title":"eGoRG: GPU-accelerated depth estimation for immersive video applications based on graph cuts","authors":"Jaime Sancho , Manuel Villa , Miguel Chavarrias , Rubén Salvador , Eduardo Juarez , César Sanz","doi":"10.1016/j.jvcir.2026.104727","DOIUrl":"10.1016/j.jvcir.2026.104727","url":null,"abstract":"<div><div>Immersive video is gaining relevance across various fields, but its integration into real applications remains limited due to the technical challenges of depth estimation. Generating accurate depth maps is essential for 3D rendering, yet high-quality algorithms can require hundreds of seconds to produce a single frame. While real-time depth estimation solutions exist — particularly monocular deep learning-based methods and active sensors such as time-of-flight or plenoptic cameras — their depth accuracy and multiview consistency are often insufficient for depth image-based rendering (DIBR) and immersive video applications. This highlights the persistent challenge of jointly achieving real-time performance and high-quality, correlated depth across views. This paper introduces eGoRG, a GPU-accelerated depth estimation algorithm based on MPEG DERS, which employs graph cuts to achieve high-quality results. eGoRG contributes a novel GPU-based graph cuts stage, integrating block-based push-relabel acceleration and a simplified alpha expansion method. These optimizations deliver quality comparable to leading graph-cut approaches while greatly improving speed. Evaluation on an MPEG multiview dataset and a static NeRF dataset demonstrates the algorithm’s effectiveness across different scenarios.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104727"},"PeriodicalIF":3.1,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-17DOI: 10.1016/j.jvcir.2026.104721
Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang
With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.
{"title":"MTPA: A multi-aspects perception assisted AIGV quality assessment model","authors":"Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang","doi":"10.1016/j.jvcir.2026.104721","DOIUrl":"10.1016/j.jvcir.2026.104721","url":null,"abstract":"<div><div>With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104721"},"PeriodicalIF":3.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Facial Emotion Recognition (FER) enables human–robot interaction by allowing robots to interpret human emotions effectively. Traditional FER models achieve high accuracy but are often computationally intensive, limiting real-time application on resource-constrained devices. These models also face challenges in capturing subtle emotional expressions and addressing variations in facial poses. This study proposes a lightweight FER model based on EfficientNet-B0, balancing accuracy and efficiency for real-time deployment on embedded robotic systems. The proposed architecture integrates an Attention Augmented Convolution (AAC) layer with EfficientNet-B0 to enhance the model’s focus on subtle emotional cues, enabling robust performance in complex environments. Additionally, a Pyramid Channel-Gated Attention with a Temporal Refinement Block is introduced to capture spatial and channel dependencies, ensuring adaptability and efficiency on resource-limited devices. The model achieves accuracies of 74.22% on FER-2013, 99.14% on CK+, and 67.36% on AffectNet-7. These results demonstrate its efficiency and robustness for facial emotion recognition in human–robot interaction.
{"title":"ATR-Net: Attention-based temporal-refinement network for efficient facial emotion recognition in human–robot interaction","authors":"Sougatamoy Biswas , Harshavardhan Reddy Gajarla , Anup Nandy , Asim Kumar Naskar","doi":"10.1016/j.jvcir.2026.104720","DOIUrl":"10.1016/j.jvcir.2026.104720","url":null,"abstract":"<div><div>Facial Emotion Recognition (FER) enables human–robot interaction by allowing robots to interpret human emotions effectively. Traditional FER models achieve high accuracy but are often computationally intensive, limiting real-time application on resource-constrained devices. These models also face challenges in capturing subtle emotional expressions and addressing variations in facial poses. This study proposes a lightweight FER model based on EfficientNet-B0, balancing accuracy and efficiency for real-time deployment on embedded robotic systems. The proposed architecture integrates an Attention Augmented Convolution (AAC) layer with EfficientNet-B0 to enhance the model’s focus on subtle emotional cues, enabling robust performance in complex environments. Additionally, a Pyramid Channel-Gated Attention with a Temporal Refinement Block is introduced to capture spatial and channel dependencies, ensuring adaptability and efficiency on resource-limited devices. The model achieves accuracies of 74.22% on FER-2013, 99.14% on CK+, and 67.36% on AffectNet-7. These results demonstrate its efficiency and robustness for facial emotion recognition in human–robot interaction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104720"},"PeriodicalIF":3.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}