Pub Date : 2026-03-01Epub Date: 2026-01-14DOI: 10.1016/j.jvcir.2026.104719
Yihang Wang , Shouxin Liu , Xudong Chen , Seok Tae Kim , Xiaowei Li
The negative effects of deepfake technology have attracted increasing attention and become a prominent social issue. Existing detection approaches typically refine conventional network architectures to uncover subtle manipulation traces, yet most focus exclusively on either spatial- or frequency-domain cues, overlooking their interaction. To address the limitations in existing deepfake detection methods, we present an innovative Multi-Scale Spatial-Frequency Variance-sensing (MSFV) model. This model effectively combines spatial and frequency information by utilizing iterative, variance-guided self-attention mechanisms. By integrating these two domains, the MSFV model enhances detection capabilities and improves the identification of subtle manipulations present in deepfake images. A dedicated high-frequency separation module further enhances the extraction of forgery indicators from the high-frequency components of manipulated images. Extensive experiments demonstrate that MSFV achieves classification accuracies of 98.95 % on the DFDC dataset and 97.92 % on the FaceForensics++ dataset, confirming its strong detection capability, generalization, and robustness compared with existing methods.
{"title":"Multi-scale Spatial Frequency Interaction Variance Perception Model for Deepfake Face Detection","authors":"Yihang Wang , Shouxin Liu , Xudong Chen , Seok Tae Kim , Xiaowei Li","doi":"10.1016/j.jvcir.2026.104719","DOIUrl":"10.1016/j.jvcir.2026.104719","url":null,"abstract":"<div><div>The negative effects of deepfake technology have attracted increasing attention and become a prominent social issue. Existing detection approaches typically refine conventional network architectures to uncover subtle manipulation traces, yet most focus exclusively on either spatial- or frequency-domain cues, overlooking their interaction. To address the limitations in existing deepfake detection methods, we present an innovative Multi-Scale Spatial-Frequency Variance-sensing (MSFV) model. This model effectively combines spatial and frequency information by utilizing iterative, variance-guided self-attention mechanisms. By integrating these two domains, the MSFV model enhances detection capabilities and improves the identification of subtle manipulations present in deepfake images. A dedicated high-frequency separation module further enhances the extraction of forgery indicators from the high-frequency components of manipulated images. Extensive experiments demonstrate that MSFV achieves classification accuracies of 98.95 % on the DFDC dataset and 97.92 % on the FaceForensics++ dataset, confirming its strong detection capability, generalization, and robustness compared with existing methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104719"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-30DOI: 10.1016/j.jvcir.2026.104737
Luna Sun , Zhenxue Chen , Xinming Zhu , Yu Bi , Chengyun Liu , Q.M. Jonathan Wu
Recently, while significant progress has been made in salient object detection, particularly with the advent of transformers, existing models still face challenges regarding the integrity and accuracy of predictions. To address these limitations, we propose the Two-Stage Nested U-Network (TSNUNet), which incorporates three innovative modules. First, the Pixel Shuffle Channel Convert Module (PSCCM) captures the potential cross-channel information distribution in high-level features, aligns multi-level features, and ensures accurate and complete initial predictions. Second, the Two-Stage Strategy, with its hybrid connections and a Two-Stage Fusion Module (TSFM), facilitates interactive learning across stages and directs precise location cues, further boosting prediction accuracy. Third, we design the Nested U/Trans-U Module for robust cross-level feature decoding. The Nested U/Trans-U Module employs continuous pixel unshuffle down-sampling, hierarchical adaptive top-layer enhancement, and multi-level pixel shuffle feature reconstruction, specifically contributing to improved feature representation and accuracy. Finally, through our Two-Stage combined supervision mechanism, TSNUNet is capable of effectively segmenting both complete and accurate salient objects. Experiments on 7 SOD and 4 cross-domain datasets show TSNUNet outperforms state-of-the-art methods with strong generalization capability. On RTX 4090 GPU, our SwinB-based model attains approximately 96 FPS in ideal forward inference and 61.49 FPS in practical end-to-end testing, demonstrating its real-time capability. Code: https://github.com/LnSCV/TSNUNet.
{"title":"TSNUNet: Two-Stage Nested U-Network for salient object detection","authors":"Luna Sun , Zhenxue Chen , Xinming Zhu , Yu Bi , Chengyun Liu , Q.M. Jonathan Wu","doi":"10.1016/j.jvcir.2026.104737","DOIUrl":"10.1016/j.jvcir.2026.104737","url":null,"abstract":"<div><div>Recently, while significant progress has been made in salient object detection, particularly with the advent of transformers, existing models still face challenges regarding the integrity and accuracy of predictions. To address these limitations, we propose the Two-Stage Nested U-Network (TSNUNet), which incorporates three innovative modules. First, the Pixel Shuffle Channel Convert Module (PSCCM) captures the potential cross-channel information distribution in high-level features, aligns multi-level features, and ensures accurate and complete initial predictions. Second, the Two-Stage Strategy, with its hybrid connections and a Two-Stage Fusion Module (TSFM), facilitates interactive learning across stages and directs precise location cues, further boosting prediction accuracy. Third, we design the Nested U/Trans-U Module for robust cross-level feature decoding. The Nested U/Trans-U Module employs continuous pixel unshuffle down-sampling, hierarchical adaptive top-layer enhancement, and multi-level pixel shuffle feature reconstruction, specifically contributing to improved feature representation and accuracy. Finally, through our Two-Stage combined supervision mechanism, TSNUNet is capable of effectively segmenting both complete and accurate salient objects. Experiments on 7 SOD and 4 cross-domain datasets show TSNUNet outperforms state-of-the-art methods with strong generalization capability. On RTX 4090 GPU, our SwinB-based model attains approximately 96 FPS in ideal forward inference and 61.49 FPS in practical end-to-end testing, demonstrating its real-time capability. Code: <span><span>https://github.com/LnSCV/TSNUNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104737"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-23DOI: 10.1016/j.jvcir.2026.104729
Zilong Yang, Shujun Zhang, Xiao Wang, Hu Jin, Limin Sun
Expressive whole-body mesh recovery aims to estimate 3D human pose and shape parameters, including the face and hands, from a monocular image. Since hand details play a crucial role in conveying human posture, accurate hand reconstruction is of great importance for applications in 3D human modeling. However, precise recovery of hands is highly challenging due to the relatively small spatial proportion of hands, high flexibility, diverse gestures, and frequent occlusions. In this work, we propose a lightweight whole-body mesh recovery framework that enhances hand detail reconstruction while reducing computational complexity. Specifically, we introduce a Joints and Depth Aware Fusion (JDAF) module that adaptively encodes geometric joints and depth cues from local hand regions. This module provides strong 3D priors and effectively guides the regression of accurate hand parameters. In addition, we propose an Adaptive Dual-branch Pooling Attention (ADPA) module that models global context and local fine-grained interactions in a lightweight manner. Compared with the traditional self-attention mechanism, this module significantly reduces the computational burden. Experiments on the EHF and UBody benchmarks demonstrate that our approach outperforms SOTA methods, reducing body MPVPE by 8.5% and hand PA-MPVPE by 6.2%, while significantly lowering the number of parameters and MACs. More importantly, its efficiency and lightweight make it particularly suitable for real-time visual communication scenarios such as immersive conferencing, sign language translation, and VR/AR interaction.
{"title":"Lightweight whole-body mesh recovery with joints and depth aware hand detail optimization","authors":"Zilong Yang, Shujun Zhang, Xiao Wang, Hu Jin, Limin Sun","doi":"10.1016/j.jvcir.2026.104729","DOIUrl":"10.1016/j.jvcir.2026.104729","url":null,"abstract":"<div><div>Expressive whole-body mesh recovery aims to estimate 3D human pose and shape parameters, including the face and hands, from a monocular image. Since hand details play a crucial role in conveying human posture, accurate hand reconstruction is of great importance for applications in 3D human modeling. However, precise recovery of hands is highly challenging due to the relatively small spatial proportion of hands, high flexibility, diverse gestures, and frequent occlusions. In this work, we propose a lightweight whole-body mesh recovery framework that enhances hand detail reconstruction while reducing computational complexity. Specifically, we introduce a Joints and Depth Aware Fusion (JDAF) module that adaptively encodes geometric joints and depth cues from local hand regions. This module provides strong 3D priors and effectively guides the regression of accurate hand parameters. In addition, we propose an Adaptive Dual-branch Pooling Attention (ADPA) module that models global context and local fine-grained interactions in a lightweight manner. Compared with the traditional self-attention mechanism, this module significantly reduces the computational burden. Experiments on the EHF and UBody benchmarks demonstrate that our approach outperforms SOTA methods, reducing body MPVPE by 8.5% and hand PA-MPVPE by 6.2%, while significantly lowering the number of parameters and MACs. More importantly, its efficiency and lightweight make it particularly suitable for real-time visual communication scenarios such as immersive conferencing, sign language translation, and VR/AR interaction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104729"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-27DOI: 10.1016/j.jvcir.2026.104733
Minyang Li , Wenpeng Mu , Yifan Yuan , Shengyan Li , Qiang Xu
Existing general-purpose forgery detection techniques fall short in military scenarios because they lack military-specific priors about how real assets are designed, manufactured, and deployed. Authentic military platforms obey strict engineering and design standards, resulting in highly regular structural layouts and characteristic material textures, whereas AI-generated forgeries often exhibit subtle violations of these constraints. To address this critical gap, we introduce SentinelFakeNet (SFNet), a novel framework specifically designed for detecting AI-generated military images. SFNet features the Military Hierarchical Perception (MHP) Module, which extracts military-relevant hierarchical representations via Cross-Level Feature Fusion (CLFF) — a mechanism that intricately combines features from varying depths of the backbone. Furthermore, to ensure robustness and adaptability to diverse generative models, we propose the Military Adaptive Test-Time Training (MATTT) strategy, which incorporates Local Consistency Verification (LCV) and Multi-Scale Signature Analysis (MSSA) as specially designed tasks. To facilitate research in this domain, we also introduce MilForgery, the first large-scale military image forensic dataset comprising 800,000 authentic and synthetically generated military-related images. Extensive experiments demonstrate that our method achieves 95.80% average accuracy, representing state-of-the-art performance. Moreover, it exhibits superior generalization capabilities on public AIGC detection benchmarks, outperforming the leading baselines by +8.47% and +6.49% on GenImage and ForenSynths in average accuracy, respectively. Our code will be available on the author’s homepage.
{"title":"SFNet: Hierarchical perception and adaptive test-time training for AI-generated military image detection","authors":"Minyang Li , Wenpeng Mu , Yifan Yuan , Shengyan Li , Qiang Xu","doi":"10.1016/j.jvcir.2026.104733","DOIUrl":"10.1016/j.jvcir.2026.104733","url":null,"abstract":"<div><div>Existing general-purpose forgery detection techniques fall short in military scenarios because they lack military-specific priors about how real assets are designed, manufactured, and deployed. Authentic military platforms obey strict engineering and design standards, resulting in highly regular structural layouts and characteristic material textures, whereas AI-generated forgeries often exhibit subtle violations of these constraints. To address this critical gap, we introduce SentinelFakeNet (SFNet), a novel framework specifically designed for detecting AI-generated military images. SFNet features the Military Hierarchical Perception (MHP) Module, which extracts military-relevant hierarchical representations via Cross-Level Feature Fusion (CLFF) — a mechanism that intricately combines features from varying depths of the backbone. Furthermore, to ensure robustness and adaptability to diverse generative models, we propose the Military Adaptive Test-Time Training (MATTT) strategy, which incorporates Local Consistency Verification (LCV) and Multi-Scale Signature Analysis (MSSA) as specially designed tasks. To facilitate research in this domain, we also introduce MilForgery, the first large-scale military image forensic dataset comprising 800,000 authentic and synthetically generated military-related images. Extensive experiments demonstrate that our method achieves 95.80% average accuracy, representing state-of-the-art performance. Moreover, it exhibits superior generalization capabilities on public AIGC detection benchmarks, outperforming the leading baselines by +8.47% and +6.49% on GenImage and ForenSynths in average accuracy, respectively. Our code will be available on the author’s homepage.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104733"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-04DOI: 10.1016/j.jvcir.2026.104732
Hongtao Yang , Yehui Liu , Minzheng Jia , Lu Han , Yongqiang Kong , Xin Jin , Ping Shi
Image aesthetic multi-attribute captioning emphasizes fine-grained aesthetic attributes, capturing intricate aesthetic characteristics from diverse perspectives and reflecting a more nuanced and profound understanding of aesthetics, encompassing a wide spectrum of aesthetic semantics. Despite its potential, current approaches to aesthetic multi-attribute captioning remain underexplored. This paper introduces a novel image aesthetic multi-attribute captioning method grounded in vision-language pre-training, aimed at addressing the inadequacy in aesthetic information expression by generating fine-grained attribute-aware aesthetic descriptions to enrich semantic depth and interpretability. Adopting a “pre-training and fine-tuning” paradigm, the proposed framework leverages CLIP and GPT-2 architectures, aligning CLIP-derived visual features with the GPT-2 embedding space via a cross-modal mapping network. The incorporation of aesthetic attribute control flags enable precise regulation of the generated aesthetic multi-attribute captions. Experimental results demonstrate that our method surpasses mainstream approaches across several metrics on DPC-MAC and PCCD dataset, including BLEU, METEOR, SPICE, etc. Furthermore, ablation studies on multi-stage aesthetic pre-training substantiate the effectiveness of the proposed strategy. The model consistently produces aesthetically coherent and attribute-aligned captions, underscoring its potential for advanced aesthetic analysis.
{"title":"Fine-grained aesthetic multi-attribute captioning with aligned vision-language representations","authors":"Hongtao Yang , Yehui Liu , Minzheng Jia , Lu Han , Yongqiang Kong , Xin Jin , Ping Shi","doi":"10.1016/j.jvcir.2026.104732","DOIUrl":"10.1016/j.jvcir.2026.104732","url":null,"abstract":"<div><div>Image aesthetic multi-attribute captioning emphasizes fine-grained aesthetic attributes, capturing intricate aesthetic characteristics from diverse perspectives and reflecting a more nuanced and profound understanding of aesthetics, encompassing a wide spectrum of aesthetic semantics. Despite its potential, current approaches to aesthetic multi-attribute captioning remain underexplored. This paper introduces a novel image aesthetic multi-attribute captioning method grounded in vision-language pre-training, aimed at addressing the inadequacy in aesthetic information expression by generating fine-grained attribute-aware aesthetic descriptions to enrich semantic depth and interpretability. Adopting a “pre-training and fine-tuning” paradigm, the proposed framework leverages CLIP and GPT-2 architectures, aligning CLIP-derived visual features with the GPT-2 embedding space via a cross-modal mapping network. The incorporation of aesthetic attribute control flags enable precise regulation of the generated aesthetic multi-attribute captions. Experimental results demonstrate that our method surpasses mainstream approaches across several metrics on DPC-MAC and PCCD dataset, including BLEU, METEOR, SPICE, etc. Furthermore, ablation studies on multi-stage aesthetic pre-training substantiate the effectiveness of the proposed strategy. The model consistently produces aesthetically coherent and attribute-aligned captions, underscoring its potential for advanced aesthetic analysis.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104732"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tracking a moving target with unmanned aerial vehicles (UAVs) poses significant challenges due to the substantial distance between the camera and target and the high relative motion. Trackers must efficiently process both appearance and motion information while adhering to the constraints of UAVs’ limited onboard computing power and real-time operational demands. Although current state-of-the-art (SOTA) UAV trackers rely on compact network structures, optimizing performance without increasing complexity remains a daunting challenge. This paper introduces a data-centric approach to enhance tracking performance in UAV environments. We first critique the limitations of existing datasets and propose a novel data mining strategy that leads to the development of the UAVSOT dataset. This dataset provides a more detailed representation for single-object tracking in UAV scenarios, effectively addressing the shortcomings of current datasets. Our experiments show that methods trained on UAVSOT significantly enhance tracking accuracy without additional computational overhead. Additionally, we compare model-centric and data-centric approaches to underscore the efficacy of our data-driven strategy in optimizing UAV trackers. The code and raw results can be found at https://github.com/caixiongyou/UAV-DC-Track.
{"title":"Data-centric is a novel perspective for UAV-based tracking: A new benchmark via efficient data utilization strategy","authors":"Xiongyou Cai, Shuguang Wu, Shiwen Li, Hongru Zhang","doi":"10.1016/j.jvcir.2026.104743","DOIUrl":"10.1016/j.jvcir.2026.104743","url":null,"abstract":"<div><div>Tracking a moving target with unmanned aerial vehicles (UAVs) poses significant challenges due to the substantial distance between the camera and target and the high relative motion. Trackers must efficiently process both appearance and motion information while adhering to the constraints of UAVs’ limited onboard computing power and real-time operational demands. Although current state-of-the-art (SOTA) UAV trackers rely on compact network structures, optimizing performance without increasing complexity remains a daunting challenge. This paper introduces a data-centric approach to enhance tracking performance in UAV environments. We first critique the limitations of existing datasets and propose a novel data mining strategy that leads to the development of the UAVSOT dataset. This dataset provides a more detailed representation for single-object tracking in UAV scenarios, effectively addressing the shortcomings of current datasets. Our experiments show that methods trained on UAVSOT significantly enhance tracking accuracy without additional computational overhead. Additionally, we compare model-centric and data-centric approaches to underscore the efficacy of our data-driven strategy in optimizing UAV trackers. The code and raw results can be found at <span><span>https://github.com/caixiongyou/UAV-DC-Track</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104743"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147398303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image deblurring is a fundamental task in image restoration (IR) aimed at removing blurring artifacts caused by factors such as defocusing, motions, and others. Since a blurry image could be originated from various sharp images, deblurring is regarded as an ill-posed problem with multiple valid solutions. The evolution of deblurring techniques spans from rule-based algorithms to deep learning-based models. Early research focused on estimating blur kernels using maximum a posteriori (MAP) estimation, but advancements in deep learning have shifted the focus towards directly predicting sharp images by leveraging deep learning techniques such as convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and others. Building on these foundations, recent studies have advanced along two directions: transformer-based architectural innovations and diffusion-based algorithmic advances. This survey provides an in-depth investigation of recent deblurring models and traditional approaches. Furthermore, we conduct a fair re-evaluation under a unified evaluation protocol.
{"title":"Exploring the transformer-based and diffusion-based models for single image deblurring","authors":"Seunghwan Park , Chaehun Shin , Jaihyun Lew , Sungroh Yoon","doi":"10.1016/j.jvcir.2026.104735","DOIUrl":"10.1016/j.jvcir.2026.104735","url":null,"abstract":"<div><div>Image deblurring is a fundamental task in image restoration (IR) aimed at removing blurring artifacts caused by factors such as defocusing, motions, and others. Since a blurry image could be originated from various sharp images, deblurring is regarded as an ill-posed problem with multiple valid solutions. The evolution of deblurring techniques spans from rule-based algorithms to deep learning-based models. Early research focused on estimating blur kernels using maximum a posteriori (MAP) estimation, but advancements in deep learning have shifted the focus towards directly predicting sharp images by leveraging deep learning techniques such as convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs), and others. Building on these foundations, recent studies have advanced along two directions: transformer-based architectural innovations and diffusion-based algorithmic advances. This survey provides an in-depth investigation of recent deblurring models and traditional approaches. Furthermore, we conduct a fair re-evaluation under a unified evaluation protocol.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104735"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Facial Emotion Recognition (FER) enables human–robot interaction by allowing robots to interpret human emotions effectively. Traditional FER models achieve high accuracy but are often computationally intensive, limiting real-time application on resource-constrained devices. These models also face challenges in capturing subtle emotional expressions and addressing variations in facial poses. This study proposes a lightweight FER model based on EfficientNet-B0, balancing accuracy and efficiency for real-time deployment on embedded robotic systems. The proposed architecture integrates an Attention Augmented Convolution (AAC) layer with EfficientNet-B0 to enhance the model’s focus on subtle emotional cues, enabling robust performance in complex environments. Additionally, a Pyramid Channel-Gated Attention with a Temporal Refinement Block is introduced to capture spatial and channel dependencies, ensuring adaptability and efficiency on resource-limited devices. The model achieves accuracies of 74.22% on FER-2013, 99.14% on CK+, and 67.36% on AffectNet-7. These results demonstrate its efficiency and robustness for facial emotion recognition in human–robot interaction.
{"title":"ATR-Net: Attention-based temporal-refinement network for efficient facial emotion recognition in human–robot interaction","authors":"Sougatamoy Biswas , Harshavardhan Reddy Gajarla , Anup Nandy , Asim Kumar Naskar","doi":"10.1016/j.jvcir.2026.104720","DOIUrl":"10.1016/j.jvcir.2026.104720","url":null,"abstract":"<div><div>Facial Emotion Recognition (FER) enables human–robot interaction by allowing robots to interpret human emotions effectively. Traditional FER models achieve high accuracy but are often computationally intensive, limiting real-time application on resource-constrained devices. These models also face challenges in capturing subtle emotional expressions and addressing variations in facial poses. This study proposes a lightweight FER model based on EfficientNet-B0, balancing accuracy and efficiency for real-time deployment on embedded robotic systems. The proposed architecture integrates an Attention Augmented Convolution (AAC) layer with EfficientNet-B0 to enhance the model’s focus on subtle emotional cues, enabling robust performance in complex environments. Additionally, a Pyramid Channel-Gated Attention with a Temporal Refinement Block is introduced to capture spatial and channel dependencies, ensuring adaptability and efficiency on resource-limited devices. The model achieves accuracies of 74.22% on FER-2013, 99.14% on CK+, and 67.36% on AffectNet-7. These results demonstrate its efficiency and robustness for facial emotion recognition in human–robot interaction.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104720"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-17DOI: 10.1016/j.jvcir.2026.104726
Suraj Neelakantan, Martin Längkvist, Amy Loutfi
Single-image super-resolution aims to recover high-frequency detail from a single low-resolution image, but practical applications often require balancing distortion against perceptual quality. Existing methods typically produce a single fixed reconstruction and offer limited test-time control over this trade-off. This paper presents DR-SCAN, a dual-branch deep residual network for single-image super-resolution in which, during test-time inference, weights can be assigned to either of the branches to dynamically steer their respective contributions to the reconstructed output. An interactive interface enables users to re-weight the shallow and deep branches at inference or run a one-click LPIPS search, to navigate the distortion–perception trade-off without retraining the model. Ablation experiments confirm that both the second branch and the channel–spatial attention that is used within the residual blocks are essential for the network for better reconstruction, while the interactive tuning routine demonstrates the practical value of post-hoc branch fusion.
{"title":"Human-in-the-loop dual-branch architecture for image super-resolution","authors":"Suraj Neelakantan, Martin Längkvist, Amy Loutfi","doi":"10.1016/j.jvcir.2026.104726","DOIUrl":"10.1016/j.jvcir.2026.104726","url":null,"abstract":"<div><div>Single-image super-resolution aims to recover high-frequency detail from a single low-resolution image, but practical applications often require balancing distortion against perceptual quality. Existing methods typically produce a single fixed reconstruction and offer limited test-time control over this trade-off. This paper presents DR-SCAN, a dual-branch deep residual network for single-image super-resolution in which, during test-time inference, weights can be assigned to either of the branches to dynamically steer their respective contributions to the reconstructed output. An interactive interface enables users to re-weight the shallow and deep branches at inference or run a one-click LPIPS search, to navigate the distortion–perception trade-off without retraining the model. Ablation experiments confirm that both the second branch and the channel–spatial attention that is used within the residual blocks are essential for the network for better reconstruction, while the interactive tuning routine demonstrates the practical value of post-hoc branch fusion.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104726"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-15DOI: 10.1016/j.jvcir.2026.104724
Shaoping Xu, Qiyu Chen, Liang Peng, Hanyang Hu, Wuyong Tao
Over the past decade, deep neural networks have significantly advanced low-light image enhancement (LLIE), achieving marked improvements in perceptual quality and robustness. However, these gains are increasingly accompanied by architectural complexity and computational inefficiency, widening the gap between enhancement performance and real-time applicability. This trade-off poses a critical challenge for time-sensitive scenarios requiring both high visual quality and efficient execution. To resolve the efficiency–quality trade-off in LLIE, we propose an ultra-lightweight framework comprising two computationally efficient modules: the adaptive Gamma correction module (AGCM) and the nonlinear refinement module (NRM). Specifically, the AGCM employs lightweight convolutions to generate spatially adaptive, pixel-wise Gamma maps that simultaneously mitigate global underexposure and suppress highlight overexposure, thereby preserving scene-specific luminance characteristics and ensuring visually natural global enhancement. Subsequently, the NRM employs two nonlinear transformation layers that logarithmically compress highlights and adaptively stretch shadows, effectively restoring local details without semantic distortion. Moreover, the first nonlinear transformation layer within the NRM incorporates residual connections to facilitate the capture and exploitation of subtle image features. Finally, the AGCM and NRM modules are jointly optimized using a hybrid loss function combining a reference-based fidelity term and no-reference perceptual metrics (i.e., local contrast, colorfulness, and exposure balance). Extensive experiments demonstrate that the proposed LLIE framework delivers performance comparable to state-of-the-art (SOTA) algorithms, while requiring only 8K parameters, achieving an optimal trade-off between enhancement quality and computational efficiency. This performance stems from our two-stage ultra-lightweight design: global illumination correction via pixel-adaptive Gamma adjustment, followed by detail-aware nonlinear refinement, all realized within a minimally parameterized architecture. As a result, the framework is uniquely suited for real-time deployment in resource-constrained environments.
{"title":"Towards fast and effective low-light image enhancement via adaptive Gamma correction and detail refinement","authors":"Shaoping Xu, Qiyu Chen, Liang Peng, Hanyang Hu, Wuyong Tao","doi":"10.1016/j.jvcir.2026.104724","DOIUrl":"10.1016/j.jvcir.2026.104724","url":null,"abstract":"<div><div>Over the past decade, deep neural networks have significantly advanced low-light image enhancement (LLIE), achieving marked improvements in perceptual quality and robustness. However, these gains are increasingly accompanied by architectural complexity and computational inefficiency, widening the gap between enhancement performance and real-time applicability. This trade-off poses a critical challenge for time-sensitive scenarios requiring both high visual quality and efficient execution. To resolve the efficiency–quality trade-off in LLIE, we propose an ultra-lightweight framework comprising two computationally efficient modules: the adaptive Gamma correction module (AGCM) and the nonlinear refinement module (NRM). Specifically, the AGCM employs lightweight convolutions to generate spatially adaptive, pixel-wise Gamma maps that simultaneously mitigate global underexposure and suppress highlight overexposure, thereby preserving scene-specific luminance characteristics and ensuring visually natural global enhancement. Subsequently, the NRM employs two nonlinear transformation layers that logarithmically compress highlights and adaptively stretch shadows, effectively restoring local details without semantic distortion. Moreover, the first nonlinear transformation layer within the NRM incorporates residual connections to facilitate the capture and exploitation of subtle image features. Finally, the AGCM and NRM modules are jointly optimized using a hybrid loss function combining a reference-based fidelity term and no-reference perceptual metrics (i.e., local contrast, colorfulness, and exposure balance). Extensive experiments demonstrate that the proposed LLIE framework delivers performance comparable to state-of-the-art (SOTA) algorithms, while requiring only 8K parameters, achieving an optimal trade-off between enhancement quality and computational efficiency. This performance stems from our two-stage ultra-lightweight design: global illumination correction via pixel-adaptive Gamma adjustment, followed by detail-aware nonlinear refinement, all realized within a minimally parameterized architecture. As a result, the framework is uniquely suited for real-time deployment in resource-constrained environments.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104724"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}