Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2026.104708
Zheng Zhao , Yufan Feng , Shangxin Li , Qian Mao , Tingwei Chen , Qi Zhao , Xiaoya Fan
Underwater wireless communication is critical for ocean exploration, but traditional technologies often suffer from high cost, high power consumption, and bulky equipment. This paper presents CQR-UC, a short-range underwater wireless communication method based on color QR (CQR) codes. To improve the recognition of CQR codes in underwater environments, a CQR-GAN model is proposed to enhance CQR code images. In addition, a dedicated underwater communication protocol, CUP, is designed to support continuous and reliable bidirectional data transmission. Experimental results demonstrate the efficacy and performance of CQR-UC in various underwater environments. On resource-constrained devices, our CQR-UC system can achieve a cost below $40, power consumption under 15 W, and excellent portability. The code is publicly available at https://github.com/XploreAI-Lab/CQR-UC.
{"title":"CQR-UC: A color QR code-based underwater wireless communication method with GAN-based image enhancement","authors":"Zheng Zhao , Yufan Feng , Shangxin Li , Qian Mao , Tingwei Chen , Qi Zhao , Xiaoya Fan","doi":"10.1016/j.jvcir.2026.104708","DOIUrl":"10.1016/j.jvcir.2026.104708","url":null,"abstract":"<div><div>Underwater wireless communication is critical for ocean exploration, but traditional technologies often suffer from high cost, high power consumption, and bulky equipment. This paper presents CQR-UC, a short-range underwater wireless communication method based on color QR (CQR) codes. To improve the recognition of CQR codes in underwater environments, a CQR-GAN model is proposed to enhance CQR code images. In addition, a dedicated underwater communication protocol, CUP, is designed to support continuous and reliable bidirectional data transmission. Experimental results demonstrate the efficacy and performance of CQR-UC in various underwater environments. On resource-constrained devices, our CQR-UC system can achieve a cost below $40, power consumption under 15 W, and excellent portability. The code is publicly available at <span><span>https://github.com/XploreAI-Lab/CQR-UC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104708"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ensuring clear underwater visibility is crucial for disciplines such as marine robotics, oceanography and marine biology, which require fast, high-quality image processing to support real-time analysis. This paper introduces the Underwater Scene Restoration network(USRNet), whose efficient and lightweight architecture overcomes critical limitations of prior work. Unlike current approaches that separately estimate complex, underwater optical parameters, our proposed USRNet employs an end-to-end approach to jointly estimate the Transmission Map(TM) and Background Light(BL) within the network using a reformulated version of the Atmospheric Scattering Model(ASM). The USRNet consists of two distinct modules. First, a dedicated Color Cast Removal(CCR) module to neutralize color casts by learning the the inherent color shifts in underwater scenes. Second, the Scene Radiance Estimation (SRE) module, which focuses on reconstructing a high-quality approximation of the final restored image. Comprehensive evaluations across multiple datasets validate our approach in both quantitative and qualitative metrics.
{"title":"USRNet: A simple yet effective Underwater Scene Restoration Network","authors":"Shabnam Thakur, Jhilik Bhattacharya, Shailendra Tiwari","doi":"10.1016/j.jvcir.2026.104710","DOIUrl":"10.1016/j.jvcir.2026.104710","url":null,"abstract":"<div><div>Ensuring clear underwater visibility is crucial for disciplines such as marine robotics, oceanography and marine biology, which require fast, high-quality image processing to support real-time analysis. This paper introduces the Underwater Scene Restoration network(USRNet), whose efficient and lightweight architecture overcomes critical limitations of prior work. Unlike current approaches that separately estimate complex, underwater optical parameters, our proposed USRNet employs an end-to-end approach to jointly estimate the Transmission Map(TM) and Background Light(BL) within the network using a reformulated version of the Atmospheric Scattering Model(ASM). The USRNet consists of two distinct modules. First, a dedicated Color Cast Removal(CCR) module to neutralize color casts by learning the the inherent color shifts in underwater scenes. Second, the Scene Radiance Estimation (SRE) module, which focuses on reconstructing a high-quality approximation of the final restored image. Comprehensive evaluations across multiple datasets validate our approach in both quantitative and qualitative metrics.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104710"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145976544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104696
Xiaojian Pan , Guang Li , Ningfei Zhang , Jianjun Li
Recently, self-supervised pretraining paradigms have been extensively investigated in the domain of skeleton-based 3D human pose estimation. Especially, methods based on masked prediction have elevated the performance of pretraining to new heights. The proposed two-stage model aims to capture richer and more significant information. Specifically, the pretraining module is designed to extract enhanced representations, and the Hybrid Dual-Stream Spatio-Temporal Network (HDSTN) processes these representations to recover detailed 3D pose information. In the pretraining phase, an improved teacher model uses the original input data to generate prediction targets for the student model. The proposed Hybrid Dual-Stream Spatio-Temporal Network (HDSTN) integrates Transformer-GCNFormer (TGFormer) blocks, which employ two parallel processing streams. The Transformer stream captures long-range dependencies, while the GCNFormer stream focuses on learning local spatial–temporal relationships between joints. By combining the strengths of both approaches, TGFormer reduces dimensionality efficiently and provides a more comprehensive representation of the 3D human pose structure. The local relationships between adjacent joints are leveraged by the GCNFormer module to generate a new representation that complements the Transformer’s output. By adaptively fusing these two representations, TGFormer demonstrates enhanced capability in learning the underlying 3D structure. This manuscript extends our earlier conference paper [Zhang et al., 2024 (AIHCIR)] which introduced a two-stage transformer-based pipeline for 3D pose lifting.
{"title":"3D human pose estimation based on a Hybrid approach of Transformer and GCN-Former","authors":"Xiaojian Pan , Guang Li , Ningfei Zhang , Jianjun Li","doi":"10.1016/j.jvcir.2025.104696","DOIUrl":"10.1016/j.jvcir.2025.104696","url":null,"abstract":"<div><div>Recently, self-supervised pretraining paradigms have been extensively investigated in the domain of skeleton-based 3D human pose estimation. Especially, methods based on masked prediction have elevated the performance of pretraining to new heights. The proposed two-stage model aims to capture richer and more significant information. Specifically, the pretraining module is designed to extract enhanced representations, and the Hybrid Dual-Stream Spatio-Temporal Network (HDSTN) processes these representations to recover detailed 3D pose information. In the pretraining phase, an improved teacher model uses the original input data to generate prediction targets for the student model. The proposed Hybrid Dual-Stream Spatio-Temporal Network (HDSTN) integrates Transformer-GCNFormer (TGFormer) blocks, which employ two parallel processing streams. The Transformer stream captures long-range dependencies, while the GCNFormer stream focuses on learning local spatial–temporal relationships between joints. By combining the strengths of both approaches, TGFormer reduces dimensionality efficiently and provides a more comprehensive representation of the 3D human pose structure. The local relationships between adjacent joints are leveraged by the GCNFormer module to generate a new representation that complements the Transformer’s output. By adaptively fusing these two representations, TGFormer demonstrates enhanced capability in learning the underlying 3D structure. This manuscript extends our earlier conference paper [Zhang et al., 2024 (AIHCIR)] which introduced a two-stage transformer-based pipeline for 3D pose lifting.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104696"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104680
Yang Zhao, Yongsheng Dong
CLIPstyler is a typical text-guided style transfer method. However, it often causes problems of content distortion and stylization inconsistency in the generated images. To alleviate this issue, in this paper we propose a Dual U-Net (DU-Net) architecture for semantic text-guided style transfer. Specifically, we first construct a Channel–Spatial Fusion Attention (CSFA) module and add it after each upsampling in our DU-Net. It simultaneously considers channel and spatial information to enhance the DU-Net’s ability to interpret and utilize input features. In addition, we design a novel loss function, Context-Aware Intersection over Union (CAIoU), which combines context aggregation with traditional IoU to optimize style transfer by balancing stylization and content preservation. Extensive qualitative and quantitative experiments on various images and texts show that the stylized images generated by our DU-Net outperform several representative methods in terms of style fidelity and content completeness. Code can be found at https://github.com/ZhaoMyang/DU-Net.
{"title":"DU-Net: A Dual U-Net for semantic text-guided style transfer","authors":"Yang Zhao, Yongsheng Dong","doi":"10.1016/j.jvcir.2025.104680","DOIUrl":"10.1016/j.jvcir.2025.104680","url":null,"abstract":"<div><div>CLIPstyler is a typical text-guided style transfer method. However, it often causes problems of content distortion and stylization inconsistency in the generated images. To alleviate this issue, in this paper we propose a Dual U-Net (DU-Net) architecture for semantic text-guided style transfer. Specifically, we first construct a Channel–Spatial Fusion Attention (CSFA) module and add it after each upsampling in our DU-Net. It simultaneously considers channel and spatial information to enhance the DU-Net’s ability to interpret and utilize input features. In addition, we design a novel loss function, Context-Aware Intersection over Union (CAIoU), which combines context aggregation with traditional IoU to optimize style transfer by balancing stylization and content preservation. Extensive qualitative and quantitative experiments on various images and texts show that the stylized images generated by our DU-Net outperform several representative methods in terms of style fidelity and content completeness. Code can be found at <span><span>https://github.com/ZhaoMyang/DU-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104680"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104705
Chenpeng Lu , Huanqiang Zeng , Chao Jiao , Jing Chen , Qi Lin , Huijie Zheng
The -domain rate control in Versatile Video Coding (VVC) achieves remarkable performance, enabling higher visual quality under stringent bit constraints. However, for high-resolution video, a significant portion of Coding Tree Units (CTUs) are the skipped block, and using the updated parameters from skip CTUs may lead to unreasonable bit allocation. This paper presents a novel rate control method for VVC that separately allocates bits to skip and non-skip CTUs. Skip CTUs’ bits can be calculated at once based on the frame’s target bits per pixel (bpp) without separate calculations. For non-skip CTUs, we first formulate the bit allocation problem as a Nash equilibrium problem inspired by the definition of the game theory. Subsequently, we design a utility function based on Gradient Magnitude Similarity Deviation (GMSD) to quantify the degradation in gradient information caused by encoding. The parameter for bit allocation of non-skip CTUs is calculated accordingly. The proposed method is implemented in VTM 13.0, and experimental results confirm its effectiveness in enhancing visual quality and achieving significant reductions in bit rates.
通用视频编码(VVC)中的λ域速率控制可以在严格的比特约束下实现更高的视觉质量。然而,对于高分辨率视频,很大一部分编码树单元(Coding Tree Units, ctu)是跳过的块,使用跳过的ctu更新的参数可能会导致比特分配不合理。提出了一种新的VVC速率控制方法,将比特分别分配给跳跃式和非跳跃式ctu。跳过ctu的比特可以根据帧的目标比特每像素(bpp)立即计算,而无需单独计算。对于非跳过的ctu,我们首先根据博弈论的定义将比特分配问题表述为纳什均衡问题。随后,我们设计了一个基于梯度量级相似偏差(GMSD)的效用函数来量化编码导致的梯度信息退化。据此计算了用于非跳变cpu位分配的λ参数。在VTM 13.0中实现了该方法,实验结果证实了该方法在提高视觉质量和显著降低比特率方面的有效性。
{"title":"Gradient degradation-aware rate control for VVC using Nash equilibrium","authors":"Chenpeng Lu , Huanqiang Zeng , Chao Jiao , Jing Chen , Qi Lin , Huijie Zheng","doi":"10.1016/j.jvcir.2025.104705","DOIUrl":"10.1016/j.jvcir.2025.104705","url":null,"abstract":"<div><div>The <span><math><mi>λ</mi></math></span>-domain rate control in Versatile Video Coding (VVC) achieves remarkable performance, enabling higher visual quality under stringent bit constraints. However, for high-resolution video, a significant portion of Coding Tree Units (CTUs) are the skipped block, and using the updated parameters from skip CTUs may lead to unreasonable bit allocation. This paper presents a novel rate control method for VVC that separately allocates bits to skip and non-skip CTUs. Skip CTUs’ bits can be calculated at once based on the frame’s target bits per pixel (bpp) without separate calculations. For non-skip CTUs, we first formulate the bit allocation problem as a Nash equilibrium problem inspired by the definition of the game theory. Subsequently, we design a utility function based on Gradient Magnitude Similarity Deviation (GMSD) to quantify the degradation in gradient information caused by encoding. The <span><math><mi>λ</mi></math></span> parameter for bit allocation of non-skip CTUs is calculated accordingly. The proposed method is implemented in VTM 13.0, and experimental results confirm its effectiveness in enhancing visual quality and achieving significant reductions in bit rates.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104705"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-24DOI: 10.1016/j.jvcir.2025.104687
Krishna Kumar Singh, K. Hima Bindu
Deep neural networks often excel at specific tasks, but face significant challenges with catastrophic forgetting when learning new classes incrementally. Recent state-of-the-art approaches in Few-Shot Class-Incremental Learning (FSCIL) predominantly utilize a decoupled classifier functioning as a prototype-only head after pre-training. The prototypes for new classes, derived from limited instances, tend to exhibit higher bias, resulting in poor performance at incremental classes. Recent methods have approached this issue through various prototype calibrations or classifier fine-tuning. However, they have limited improvements during incremental sessions.
This paper presents a novel approach to mitigate bias through Feature Augmentation and an entropy-weighted Logits Mix-up (FALM) method. The embedding derived from the final layer of the feature extractor is task-specific and may overlook certain features of unseen classes. This work incorporates a missing-pass filter applied to the features of an intermediate layer to augment the embedding of the final layer. Additionally, a logits mix-up is employed to reduce bias further by applying an entropy-weighted combination of logits resulting from three separate heads. Experiments on the miniImageNet, CIFAR100, and CUB200 datasets show that FALM achieves better performance compared to state-of-the-art (SOTA) models.
{"title":"Mitigating bias in Few Shot Class Incremental Learning with Feature Augmentation and Logits Mix-up","authors":"Krishna Kumar Singh, K. Hima Bindu","doi":"10.1016/j.jvcir.2025.104687","DOIUrl":"10.1016/j.jvcir.2025.104687","url":null,"abstract":"<div><div>Deep neural networks often excel at specific tasks, but face significant challenges with catastrophic forgetting when learning new classes incrementally. Recent state-of-the-art approaches in Few-Shot Class-Incremental Learning (FSCIL) predominantly utilize a decoupled classifier functioning as a prototype-only head after pre-training. The prototypes for new classes, derived from limited instances, tend to exhibit higher bias, resulting in poor performance at incremental classes. Recent methods have approached this issue through various prototype calibrations or classifier fine-tuning. However, they have limited improvements during incremental sessions.</div><div>This paper presents a novel approach to mitigate bias through <strong>F</strong>eature <strong>A</strong>ugmentation and an entropy-weighted <strong>L</strong>ogits <strong>M</strong>ix-up (FALM) method. The embedding derived from the final layer of the feature extractor is task-specific and may overlook certain features of unseen classes. This work incorporates a missing-pass filter applied to the features of an intermediate layer to augment the embedding of the final layer. Additionally, a logits mix-up is employed to reduce bias further by applying an entropy-weighted combination of logits resulting from three separate heads. Experiments on the miniImageNet, CIFAR100, and CUB200 datasets show that FALM achieves better performance compared to state-of-the-art (SOTA) models.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104687"},"PeriodicalIF":3.1,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1016/j.jvcir.2025.104681
Shilin Li, Hao Wang, Anna Zhu
Image texture transfer is pivotal in computer vision, holding extensive application potential. Existing methods typically transfer color alongside texture, lacking inherent color preservation and thus requiring a cumbersome two-stage process: color alignment followed by style transfer. The recent emergence of diffusion models has significantly advanced this field; however, current diffusion-based approaches usually necessitate additional training. To address this, we propose FFTDiff, a novel texture transfer model leveraging pre-trained diffusion models and the Fast Fourier Transform (FFT), eliminating extra training requirements. FFTDiff disentangles texture from content and color within the frequency domain, independently extracting texture from reference images while preserving original colors and semantics. This extracted texture is then seamlessly integrated into the content image within the diffusion model’s latent space during denoising. Comprehensive experimental results demonstrate FFTDiff’s effectiveness, highlighting its capability for realistic, aesthetically pleasing texture transfer without compromising the original semantic content or color integrity.
{"title":"FFTDiff: Tuning-free image texture transfer based on diffusion model","authors":"Shilin Li, Hao Wang, Anna Zhu","doi":"10.1016/j.jvcir.2025.104681","DOIUrl":"10.1016/j.jvcir.2025.104681","url":null,"abstract":"<div><div>Image texture transfer is pivotal in computer vision, holding extensive application potential. Existing methods typically transfer color alongside texture, lacking inherent color preservation and thus requiring a cumbersome two-stage process: color alignment followed by style transfer. The recent emergence of diffusion models has significantly advanced this field; however, current diffusion-based approaches usually necessitate additional training. To address this, we propose FFTDiff, a novel texture transfer model leveraging pre-trained diffusion models and the Fast Fourier Transform (FFT), eliminating extra training requirements. FFTDiff disentangles texture from content and color within the frequency domain, independently extracting texture from reference images while preserving original colors and semantics. This extracted texture is then seamlessly integrated into the content image within the diffusion model’s latent space during denoising. Comprehensive experimental results demonstrate FFTDiff’s effectiveness, highlighting its capability for realistic, aesthetically pleasing texture transfer without compromising the original semantic content or color integrity.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104681"},"PeriodicalIF":3.1,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1016/j.jvcir.2025.104683
Fenglin Man , Chaofeng Li , Tuxin Guan
This paper introduces an innovative infrared remote sensing small ship target detection network that integrates spatial-semantic enhancement with feature reconstruction neck. Spatial-semantic enhancement module is designed to selectively amplify diverse information within the backbone, effectively addressing the challenge of false alarms. It employs a parameter-free attention mechanism to improve the positional information of shallow feature maps and utilizes dynamic dilated convolution to capture contextual information from deep feature maps. Additionally, the feature reconstruction neck consolidates multi-scale feature maps across spatial and channel dimensions respectively, incorporating attention adjacent-layer concatenation to minimize missed detections. To further enhance detection capabilities, we have developed an infrared ship detection head that leverages the advantages of the DINO decoder while accounting for the size characteristics of infrared small targets. Experimental results from the public infrared ship detection dataset (ISDD) indicate that our approach surpasses some other state-of-the-art methods, demonstrating superior detection performance both qualitatively and quantitatively.
{"title":"Infrared remote sensing small ship target detection method based on spatial-semantic enhancement and feature reconstruction neck","authors":"Fenglin Man , Chaofeng Li , Tuxin Guan","doi":"10.1016/j.jvcir.2025.104683","DOIUrl":"10.1016/j.jvcir.2025.104683","url":null,"abstract":"<div><div>This paper introduces an innovative infrared remote sensing small ship target detection network that integrates spatial-semantic enhancement with feature reconstruction neck. Spatial-semantic enhancement module is designed to selectively amplify diverse information within the backbone, effectively addressing the challenge of false alarms. It employs a parameter-free attention mechanism to improve the positional information of shallow feature maps and utilizes dynamic dilated convolution to capture contextual information from deep feature maps. Additionally, the feature reconstruction neck consolidates multi-scale feature maps across spatial and channel dimensions respectively, incorporating attention adjacent-layer concatenation to minimize missed detections. To further enhance detection capabilities, we have developed an infrared ship detection head that leverages the advantages of the DINO decoder while accounting for the size characteristics of infrared small targets. Experimental results from the public infrared ship detection dataset (ISDD) indicate that our approach surpasses some other state-of-the-art methods, demonstrating superior detection performance both qualitatively and quantitatively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104683"},"PeriodicalIF":3.1,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1016/j.jvcir.2025.104692
Daojie Zhou, Yihan Li, Yi Li
This paper proposes a multi-view fusion-based recognition algorithm to address the imbalanced recognition accuracy of existing 3D model multi-view recognition methods. This imbalance arises from their failure to account for inter-view feature consistency and their use of inefficient fusion strategies. The proposed algorithm employs a dual-branch feature extraction network and a multi-label loss function to enforce the learning of consistent features across different views of the same model. Concurrently, a multi-level Gated Recurrent Unit (GRU) fusion network is constructed to efficiently integrate high-dimensional features from various levels and temporal information across multiple views. Simulation results demonstrate that the proposed algorithm achieves highly competitive recognition accuracy on mainstream benchmark datasets. Furthermore, it exhibits more balanced performance across different categories, thereby showing enhanced stability and robustness.
{"title":"Multi-view 3D model recognition via multi-label and multi-level fusion with bidirectional GRU","authors":"Daojie Zhou, Yihan Li, Yi Li","doi":"10.1016/j.jvcir.2025.104692","DOIUrl":"10.1016/j.jvcir.2025.104692","url":null,"abstract":"<div><div>This paper proposes a multi-view fusion-based recognition algorithm to address the imbalanced recognition accuracy of existing 3D model multi-view recognition methods. This imbalance arises from their failure to account for inter-view feature consistency and their use of inefficient fusion strategies. The proposed algorithm employs a dual-branch feature extraction network and a multi-label loss function to enforce the learning of consistent features across different views of the same model. Concurrently, a multi-level Gated Recurrent Unit (GRU) fusion network is constructed to efficiently integrate high-dimensional features from various levels and temporal information across multiple views. Simulation results demonstrate that the proposed algorithm achieves highly competitive recognition accuracy on mainstream benchmark datasets. Furthermore, it exhibits more balanced performance across different categories, thereby showing enhanced stability and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104692"},"PeriodicalIF":3.1,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The spread of deepfakes poses significant security concerns, demanding reliable detection methods. However, diverse generation techniques and class imbalance in datasets create challenges. We propose CAE-Net, a Convolution- and Attention-based weighted Ensemble network combining spatial and frequency-domain features for effective deepfake detection. The architecture integrates EfficientNet, Data-Efficient Image Transformer (DeiT), and ConvNeXt with wavelet features to learn complementary representations. We evaluated CAE-Net on the diverse IEEE Signal Processing Cup 2025 (DF-Wild Cup) dataset, which has a 5:1 fake-to-real class imbalance. To address this, we introduce a multistage disjoint-subset training strategy, sequentially training the model on non-overlapping subsets of the fake class while retaining knowledge across stages. Our approach achieved 94.46% accuracy and a 97.60% AUC, outperforming conventional class-balancing methods. Visualizations confirm the network focuses on meaningful facial regions, and our ensemble design demonstrates robustness against adversarial attacks, positioning CAE-Net as a dependable and generalized deepfake detection framework.
{"title":"CAE-Net: Generalized deepfake image detection using convolution and attention mechanisms with spatial and frequency domain features","authors":"Anindya Bhattacharjee , Kaidul Islam , Kafi Anan , Ashir Intesher , Abrar Assaeem Fuad , Utsab Saha , Hafiz Imtiaz","doi":"10.1016/j.jvcir.2025.104679","DOIUrl":"10.1016/j.jvcir.2025.104679","url":null,"abstract":"<div><div>The spread of deepfakes poses significant security concerns, demanding reliable detection methods. However, diverse generation techniques and class imbalance in datasets create challenges. We propose CAE-Net, a Convolution- and Attention-based weighted Ensemble network combining spatial and frequency-domain features for effective deepfake detection. The architecture integrates EfficientNet, Data-Efficient Image Transformer (DeiT), and ConvNeXt with wavelet features to learn complementary representations. We evaluated CAE-Net on the diverse IEEE Signal Processing Cup 2025 (DF-Wild Cup) dataset, which has a 5:1 fake-to-real class imbalance. To address this, we introduce a multistage disjoint-subset training strategy, sequentially training the model on non-overlapping subsets of the fake class while retaining knowledge across stages. Our approach achieved 94.46% accuracy and a 97.60% AUC, outperforming conventional class-balancing methods. Visualizations confirm the network focuses on meaningful facial regions, and our ensemble design demonstrates robustness against adversarial attacks, positioning CAE-Net as a dependable and generalized deepfake detection framework.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104679"},"PeriodicalIF":3.1,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}