Pub Date : 2025-09-25DOI: 10.1016/j.jvcir.2025.104590
Jielin Jiang , Quan Zhang , Yan Cui , Shun Wei , Yingnan Zhao
Recent MLP-based models have employed axial projections to orthogonally decompose the entire space into horizontal and vertical directions, effectively balancing long-range dependencies and computational costs. However, such methods operate independently along the two axes, hindering their ability to capture the image’s global spatial structure. In this paper, we propose a novel MLP architecture called Cross-Axis gated MLP (CAgMLP), which consists of two main modules, Cross-Axis Gated Token-Mixing MLP (CGTM) and Convolutional Gated Channel-Mixing MLP (CGCM). CGTM addresses the loss of information from single-dimensional interactions by leveraging a multiplicative gating mechanism that facilitates the cross-fusion of features captured along the two spatial axes, enhancing feature selection and information flow. CGCM improves the dual-branch structure of the multiplicative gating units by projecting the fused low-dimensional input into two high-dimensional feature spaces and introducing non-linear features through element-wise multiplication, further improving the model’s expressive ability. Finally, both modules incorporate local token aggregation to compensate for the lack of local inductive bias in traditional MLP models. Experiments conducted on several datasets demonstrate that CAgMLP achieves superior classification performance compared to other state-of-the-art methods, while exhibiting fewer parameters and lower computational complexity.
{"title":"CAgMLP: An MLP-like architecture with a Cross-Axis gated token mixer for image classification","authors":"Jielin Jiang , Quan Zhang , Yan Cui , Shun Wei , Yingnan Zhao","doi":"10.1016/j.jvcir.2025.104590","DOIUrl":"10.1016/j.jvcir.2025.104590","url":null,"abstract":"<div><div>Recent MLP-based models have employed axial projections to orthogonally decompose the entire space into horizontal and vertical directions, effectively balancing long-range dependencies and computational costs. However, such methods operate independently along the two axes, hindering their ability to capture the image’s global spatial structure. In this paper, we propose a novel MLP architecture called Cross-Axis gated MLP (CAgMLP), which consists of two main modules, Cross-Axis Gated Token-Mixing MLP (CGTM) and Convolutional Gated Channel-Mixing MLP (CGCM). CGTM addresses the loss of information from single-dimensional interactions by leveraging a multiplicative gating mechanism that facilitates the cross-fusion of features captured along the two spatial axes, enhancing feature selection and information flow. CGCM improves the dual-branch structure of the multiplicative gating units by projecting the fused low-dimensional input into two high-dimensional feature spaces and introducing non-linear features through element-wise multiplication, further improving the model’s expressive ability. Finally, both modules incorporate local token aggregation to compensate for the lack of local inductive bias in traditional MLP models. Experiments conducted on several datasets demonstrate that CAgMLP achieves superior classification performance compared to other state-of-the-art methods, while exhibiting fewer parameters and lower computational complexity.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104590"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-25DOI: 10.1016/j.jvcir.2025.104587
Yan Cheng , Xiong Li , Xin Zhang , Chaohong Yang
Advanced editing and deepfakes make image tampering harder to detect, threatening image security, credibility, and personal privacy. To address this challenging issue, we propose a novel end-to-end image forgery localization method, based on the curiosity-driven deep reinforcement learning method with intrinsic reward. The proposed method provides reliable localization results for forged regions in images of various types of forgery. This study designs a new Focal-based reward function that is suitable for scenarios with highly imbalanced numbers of forged and real pixels. Furthermore, considering the issue of sparse rewards caused by sparse forgery regions in real-world forgery scenarios, we introduce a surprise-based intrinsic reward generation module, which guides the agent to explore and learn the optimal strategy. Extensive experiments conducted on multiple benchmark datasets show that the proposed method outperforms other methods in pixel-level forgery localization. Additionally, the proposed method demonstrates stable robustness to image degradation caused by different post-processing attacks.
{"title":"Image forgery localization with sparse reward compensation using curiosity-driven deep reinforcement learning","authors":"Yan Cheng , Xiong Li , Xin Zhang , Chaohong Yang","doi":"10.1016/j.jvcir.2025.104587","DOIUrl":"10.1016/j.jvcir.2025.104587","url":null,"abstract":"<div><div>Advanced editing and deepfakes make image tampering harder to detect, threatening image security, credibility, and personal privacy. To address this challenging issue, we propose a novel end-to-end image forgery localization method, based on the curiosity-driven deep reinforcement learning method with intrinsic reward. The proposed method provides reliable localization results for forged regions in images of various types of forgery. This study designs a new Focal-based reward function that is suitable for scenarios with highly imbalanced numbers of forged and real pixels. Furthermore, considering the issue of sparse rewards caused by sparse forgery regions in real-world forgery scenarios, we introduce a surprise-based intrinsic reward generation module, which guides the agent to explore and learn the optimal strategy. Extensive experiments conducted on multiple benchmark datasets show that the proposed method outperforms other methods in pixel-level forgery localization. Additionally, the proposed method demonstrates stable robustness to image degradation caused by different post-processing attacks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104587"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Point clouds are the predominant data structure for representing 3D shapes. However, captured point clouds are often partial due to practical constraints, necessitating point cloud completion. In this paper, we propose a novel deep network architecture that preserves the structure of available points while incorporating coarse-to-fine information to generate dense and consistent point clouds. Our network comprises three sub-networks: Coarse-to-Fine, Structure, and Tail. The Coarse-to-Fine sub-net extracts multi-scale features, while the Structure sub-net utilizes a stacked auto-encoder with weighted skip connections to preserve structural information. The fused features are then processed by the Tail sub-net to produce a dense point cloud. Additionally, we demonstrate the effectiveness of our structure-preserving approach in point cloud classification by proposing a classification architecture based on the Structure sub-net. Experimental results show that our method outperforms existing approaches in both tasks, highlighting the importance of preserving structural information and incorporating coarse-to-fine details.
{"title":"Structure preserving point cloud completion and classification with coarse-to-fine information","authors":"Seema Kumari , Srimanta Mandal , Shanmuganathan Raman","doi":"10.1016/j.jvcir.2025.104591","DOIUrl":"10.1016/j.jvcir.2025.104591","url":null,"abstract":"<div><div>Point clouds are the predominant data structure for representing 3D shapes. However, captured point clouds are often partial due to practical constraints, necessitating point cloud completion. In this paper, we propose a novel deep network architecture that preserves the structure of available points while incorporating coarse-to-fine information to generate dense and consistent point clouds. Our network comprises three sub-networks: Coarse-to-Fine, Structure, and Tail. The Coarse-to-Fine sub-net extracts multi-scale features, while the Structure sub-net utilizes a stacked auto-encoder with weighted skip connections to preserve structural information. The fused features are then processed by the Tail sub-net to produce a dense point cloud. Additionally, we demonstrate the effectiveness of our structure-preserving approach in point cloud classification by proposing a classification architecture based on the Structure sub-net. Experimental results show that our method outperforms existing approaches in both tasks, highlighting the importance of preserving structural information and incorporating coarse-to-fine details.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104591"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145220121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-25DOI: 10.1016/j.jvcir.2025.104593
Bin Wang, Jiajia Hu, Fengyuan Zuo, Junfei Shi, Haiyan Jin
In image-denoising tasks, the diffusion model has shown great potential. Usually, the diffusion model uses a real scene’s noise-free and clean image dataset as the starting point for diffusion. When the denoising network trained on this dataset is applied to image denoising in other scenes, the generalization of the denoising network will decrease due to changes in scene priors. In order to improve generalization, we hope to find a clean image dataset that not only has rich scene priors but also has a certain scene independence. The VGG-16 network is a network trained from a large number of images. After the real scene images are processed through the VGG-16 convolution layer, the shallow feature maps obtained have scene priors and break free from the scene dependency caused by minor details. This paper uses the shallow feature maps of VGG-16 as a clean image dataset for the diffusion model, and the results of denoising experiments are surprising. Furthermore, considering that the noise of the image mainly includes Gaussian noise and Poisson noise, the classical diffusion model uses Gaussian noise for diffusion to improve the interpretability of the model. We introduce a novel Poisson–Gaussian noise mixture for the diffusion process, and the theoretical derivation is given. Finally, we propose a Poisson–Gaussian Denoising Mixture Diffusion Model based on Feature maps (F-MDM). Experiments demonstrate that our method exhibits excellent generalization ability compared to some other advanced algorithms.
{"title":"F-MDM: Rethinking image denoising with a feature map-based Poisson–Gaussian Mixture Diffusion Model","authors":"Bin Wang, Jiajia Hu, Fengyuan Zuo, Junfei Shi, Haiyan Jin","doi":"10.1016/j.jvcir.2025.104593","DOIUrl":"10.1016/j.jvcir.2025.104593","url":null,"abstract":"<div><div>In image-denoising tasks, the diffusion model has shown great potential. Usually, the diffusion model uses a real scene’s noise-free and clean image dataset as the starting point for diffusion. When the denoising network trained on this dataset is applied to image denoising in other scenes, the generalization of the denoising network will decrease due to changes in scene priors. In order to improve generalization, we hope to find a clean image dataset that not only has rich scene priors but also has a certain scene independence. The VGG-16 network is a network trained from a large number of images. After the real scene images are processed through the VGG-16 convolution layer, the shallow feature maps obtained have scene priors and break free from the scene dependency caused by minor details. This paper uses the shallow feature maps of VGG-16 as a clean image dataset for the diffusion model, and the results of denoising experiments are surprising. Furthermore, considering that the noise of the image mainly includes Gaussian noise and Poisson noise, the classical diffusion model uses Gaussian noise for diffusion to improve the interpretability of the model. We introduce a novel Poisson–Gaussian noise mixture for the diffusion process, and the theoretical derivation is given. Finally, we propose a Poisson–Gaussian Denoising <strong>M</strong>ixture <strong>D</strong>iffusion <strong>M</strong>odel based on <strong>F</strong>eature maps (<strong>F-MDM</strong>). Experiments demonstrate that our method exhibits excellent generalization ability compared to some other advanced algorithms.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104593"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1016/j.jvcir.2025.104585
Jia Huang, Wei Quan, Xiwen Li
Visual anomaly detection includes image anomaly detection and video anomaly detection, focusing on identifying and locating anomalous patterns or events in images or videos. This technology finds widespread applications across multiple domains, including industrial surface defect inspection, medical image lesion analysis, and security surveillance systems. By identifying patterns that do not conform to normal conditions, it helps to detect anomalies in a timely manner and reduce risks and losses. This paper provides a comprehensive review of existing visual anomaly detection algorithms. It introduces a taxonomy of algorithms from a new perspective: statistical-based algorithms, measurement-based algorithms, generative-based algorithms, and representation-based algorithms. Furthermore, this paper systematically introduces datasets for visual anomaly detection and compares the performance of various algorithms on different datasets under typical evaluation metrics. By analyzing existing algorithms, we identify current challenges and suggest promising future research directions.
{"title":"Visual anomaly detection algorithms: Development and Frontier review","authors":"Jia Huang, Wei Quan, Xiwen Li","doi":"10.1016/j.jvcir.2025.104585","DOIUrl":"10.1016/j.jvcir.2025.104585","url":null,"abstract":"<div><div>Visual anomaly detection includes image anomaly detection and video anomaly detection, focusing on identifying and locating anomalous patterns or events in images or videos. This technology finds widespread applications across multiple domains, including industrial surface defect inspection, medical image lesion analysis, and security surveillance systems. By identifying patterns that do not conform to normal conditions, it helps to detect anomalies in a timely manner and reduce risks and losses. This paper provides a comprehensive review of existing visual anomaly detection algorithms. It introduces a taxonomy of algorithms from a new perspective: statistical-based algorithms, measurement-based algorithms, generative-based algorithms, and representation-based algorithms. Furthermore, this paper systematically introduces datasets for visual anomaly detection and compares the performance of various algorithms on different datasets under typical evaluation metrics. By analyzing existing algorithms, we identify current challenges and suggest promising future research directions.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104585"},"PeriodicalIF":3.1,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, Vision Transformer-based methods have emerged as promising approaches for localizing semantic objects in weakly supervised semantic segmentation tasks. However, existing methods primarily rely on the attention mechanism to establish relations between classes and image patches, often neglecting the intrinsic interrelations among tokens within datasets. To address this gap, we propose the Inter-image Token Relation Learning (ITRL) framework, which advances weakly supervised semantic segmentation by inter-image consistency. Specifically, the Inter-image Class Token Contrast method is introduced to generate comprehensive class representations by contrasting class tokens in a memory bank manner. Additionally, the Inter-image Patch Token Align approach is presented, which enhances the normalized mutual information among patch tokens, thereby strengthening their interdependencies. Extensive experiments validated the proposed framework, showcasing competitive mean Intersection over Union scores on the PASCAL VOC 2012 and MS COCO 2014 datasets.
{"title":"Inter-image Token Relation Learning for weakly supervised semantic segmentation","authors":"Jingfeng Tang, Keyang Cheng, Liutao Wei, Yongzhao Zhan","doi":"10.1016/j.jvcir.2025.104576","DOIUrl":"10.1016/j.jvcir.2025.104576","url":null,"abstract":"<div><div>In recent years, Vision Transformer-based methods have emerged as promising approaches for localizing semantic objects in weakly supervised semantic segmentation tasks. However, existing methods primarily rely on the attention mechanism to establish relations between classes and image patches, often neglecting the intrinsic interrelations among tokens within datasets. To address this gap, we propose the Inter-image Token Relation Learning (ITRL) framework, which advances weakly supervised semantic segmentation by inter-image consistency. Specifically, the Inter-image Class Token Contrast method is introduced to generate comprehensive class representations by contrasting class tokens in a memory bank manner. Additionally, the Inter-image Patch Token Align approach is presented, which enhances the normalized mutual information among patch tokens, thereby strengthening their interdependencies. Extensive experiments validated the proposed framework, showcasing competitive mean Intersection over Union scores on the PASCAL VOC 2012 and MS COCO 2014 datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104576"},"PeriodicalIF":3.1,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1016/j.jvcir.2025.104586
Wenxiao Cai , Xinyue Lei , Xinyu He , Junming Leo Chen , Yuzhi Hao , Yangang Wang
We introduce Knowledge NeRF, a few-shot framework for novel-view synthesis of dynamic articulated objects. Conventional dynamic-NeRF methods learn a deformation field from long monocular videos, yet they degrade sharply when only sparse observations are available. Our key idea is to reuse a high-quality, pose-specific NeRF as a knowledge base and learn a lightweight projection module for each new pose that maps 3-D points in the current state to their canonical counterparts. By freezing the pretrained radiance field and training only this module with five input images, Knowledge NeRF renders novel views whose fidelity matches a NeRF trained with one hundred images. Experimental results demonstrate the effectiveness of our method in reconstructing dynamic 3D scenes with 5 input images in one state. Knowledge NeRF is a new pipeline and a promising solution for novel view synthesis in dynamic articulated objects. The data and implementation will be publicly available at: https://github.com/RussRobin/Knowledge_NeRF.
{"title":"Knowledge NeRF: Few-shot novel view synthesis for dynamic articulated objects","authors":"Wenxiao Cai , Xinyue Lei , Xinyu He , Junming Leo Chen , Yuzhi Hao , Yangang Wang","doi":"10.1016/j.jvcir.2025.104586","DOIUrl":"10.1016/j.jvcir.2025.104586","url":null,"abstract":"<div><div>We introduce Knowledge NeRF, a few-shot framework for novel-view synthesis of dynamic articulated objects. Conventional dynamic-NeRF methods learn a deformation field from long monocular videos, yet they degrade sharply when only sparse observations are available. Our key idea is to reuse a high-quality, pose-specific NeRF as a knowledge base and learn a lightweight projection module for each new pose that maps 3-D points in the current state to their canonical counterparts. By freezing the pretrained radiance field and training only this module with five input images, Knowledge NeRF renders novel views whose fidelity matches a NeRF trained with one hundred images. Experimental results demonstrate the effectiveness of our method in reconstructing dynamic 3D scenes with 5 input images in one state. Knowledge NeRF is a new pipeline and a promising solution for novel view synthesis in dynamic articulated objects. The data and implementation will be publicly available at: <span><span>https://github.com/RussRobin/Knowledge_NeRF</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104586"},"PeriodicalIF":3.1,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-19DOI: 10.1016/j.jvcir.2025.104589
Lichun Yang , Jianghao Wu , Hongguang Li , Chunlei Liu , Shize Wei
This paper presents a novel multi-task learning framework for joint airport runway segmentation and line detection, addressing two key challenges in aircraft visual navigation: (1) edge detection for sub-5 %-pixel targets and (2) computational inefficiencies in existing methods. Our contributions include: (i) ENecNet, a lightweight yet powerful encoder that boosts small-target detection IoU by 15.5 % through optimized channel expansion and architectural refinement; (ii) a dual-decoder design with task-specific branches for area segmentation and edge line detection; and (iii) a dynamically weighted multi-task loss function to ensure balanced training. Extensive evaluations on the RDD5000 dataset show state-of-the-art performance with 0.9709 segmentation IoU and 0.6256 line detection IoU at 38.4 FPS. The framework also demonstrates robust performance (0.9513–0.9664 IoU) across different airports and challenging conditions such as nighttime, smog, and mountainous terrain, proving its suitability for real-time onboard navigation systems.
{"title":"Joint airport runway segmentation and line detection via multi-task learning for intelligent visual navigation","authors":"Lichun Yang , Jianghao Wu , Hongguang Li , Chunlei Liu , Shize Wei","doi":"10.1016/j.jvcir.2025.104589","DOIUrl":"10.1016/j.jvcir.2025.104589","url":null,"abstract":"<div><div>This paper presents a novel multi-task learning framework for joint airport runway segmentation and line detection, addressing two key challenges in aircraft visual navigation: (1) edge detection for sub-5 %-pixel targets and (2) computational inefficiencies in existing methods. Our contributions include: (i) ENecNet, a lightweight yet powerful encoder that boosts small-target detection IoU by 15.5 % through optimized channel expansion and architectural refinement; (ii) a dual-decoder design with task-specific branches for area segmentation and edge line detection; and (iii) a dynamically weighted multi-task loss function to ensure balanced training. Extensive evaluations on the RDD5000 dataset show state-of-the-art performance with 0.9709 segmentation IoU and 0.6256 line detection IoU at 38.4 FPS. The framework also demonstrates robust performance (0.9513–0.9664 IoU) across different airports and challenging conditions such as nighttime, smog, and mountainous terrain, proving its suitability for real-time onboard navigation systems.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104589"},"PeriodicalIF":3.1,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-18DOI: 10.1016/j.jvcir.2025.104580
Hao Kong , Zi-Ming Wu , Bin Yan , Jeng-Shyang Pan , Hong-Mei Yang
The existing meaningful secret sharing schemes for 3D model face the issue of model extension. To address this problem, we propose a non-extended secret 3D mesh sharing scheme. Considering the large amount of data that needs to be shared in a 3D model, we designed a circuit structure to accelerate the computation during sharing. In the sharing stage, vertex data is encoded and converted to integer data from floating-point data. This is more conducive to handling the computation in FPGA. By adjusting the length of the encoding, multiple secrets can be embedded in the vertex encoding stage. This solves the extension problem of the scheme. Experiments were conducted on a set of 3D meshes to compare the differences between the cover models and the shares. This experimental result shows that the shares maintain high fidelity with the cover meshes. Furthermore, the FPGA implementation achieves a throughput of 675Mbit/s. Simulation results show that the parallel circuit structure is 30 times faster than the serial structure. In terms of resource consumption, the circuit structure designed in this scheme occupies less than 5% of the on-chip resources.
{"title":"A non-extended 3D mesh secret sharing scheme adapted for FPGA processing","authors":"Hao Kong , Zi-Ming Wu , Bin Yan , Jeng-Shyang Pan , Hong-Mei Yang","doi":"10.1016/j.jvcir.2025.104580","DOIUrl":"10.1016/j.jvcir.2025.104580","url":null,"abstract":"<div><div>The existing meaningful secret sharing schemes for 3D model face the issue of model extension. To address this problem, we propose a non-extended secret 3D mesh sharing scheme. Considering the large amount of data that needs to be shared in a 3D model, we designed a circuit structure to accelerate the computation during sharing. In the sharing stage, vertex data is encoded and converted to integer data from floating-point data. This is more conducive to handling the computation in FPGA. By adjusting the length of the encoding, multiple secrets can be embedded in the vertex encoding stage. This solves the extension problem of the scheme. Experiments were conducted on a set of 3D meshes to compare the differences between the cover models and the shares. This experimental result shows that the shares maintain high fidelity with the cover meshes. Furthermore, the FPGA implementation achieves a throughput of 675Mbit/s. Simulation results show that the parallel circuit structure is 30 times faster than the serial structure. In terms of resource consumption, the circuit structure designed in this scheme occupies less than 5% of the on-chip resources.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104580"},"PeriodicalIF":3.1,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145121127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17DOI: 10.1016/j.jvcir.2025.104584
Yanlei Wei , Yongping Wang , Xiaolin Zhang , Jingyu Wang , Lixin Liu
The emergence of a large number of adversarial samples has exposed the vulnerabilities of Deep Neural Networks (DNNs). With the rise of diffusion models, their powerful denoising capabilities have made them a popular strategy for adversarial defense. The defense capability of diffusion models is effective against simple adversarial attacks; however, their effectiveness diminishes when facing more sophisticated and complex attacks. To address this issue, this paper proposes a method called Adaptive Guided Denoising Diffusion (AGDD), which can effectively defend against adversarial attacks. Specifically, we first apply a small noise perturbation to the given adversarial samples, performing the forward diffusion process. Then, in the reverse denoising phase, the diffusion model is guided by the adaptive guided formula to perform denoising. At the same time, the adaptive guided formula is adjusted according to the adaptive matrix and the residual . Additionally, we introduced a momentum factor to further optimize the denoising process, reduce the oscillations caused by gradient variations, and enhance the stability and convergence of the optimization process. Through AGDD, the denoised images accurately reconstruct the characteristics of the original observations (i.e., the unperturbed images) and exhibit strong robustness and adaptability across diverse noise conditions. Extensive experiments on the ImageNet dataset using Convolutional Neural Networks (CNN) and Vision Transformer (ViT) architectures demonstrate that the proposed method exhibits superior robustness against adversarial attacks, with classification accuracy reaching 87.4% for CNN and 85.9% for ViT, surpassing other state-of-the-art defense techniques.
{"title":"Defending against adversarial attacks via an Adaptive Guided Denoising Diffusion model","authors":"Yanlei Wei , Yongping Wang , Xiaolin Zhang , Jingyu Wang , Lixin Liu","doi":"10.1016/j.jvcir.2025.104584","DOIUrl":"10.1016/j.jvcir.2025.104584","url":null,"abstract":"<div><div>The emergence of a large number of adversarial samples has exposed the vulnerabilities of Deep Neural Networks (DNNs). With the rise of diffusion models, their powerful denoising capabilities have made them a popular strategy for adversarial defense. The defense capability of diffusion models is effective against simple adversarial attacks; however, their effectiveness diminishes when facing more sophisticated and complex attacks. To address this issue, this paper proposes a method called Adaptive Guided Denoising Diffusion (AGDD), which can effectively defend against adversarial attacks. Specifically, we first apply a small noise perturbation to the given adversarial samples, performing the forward diffusion process. Then, in the reverse denoising phase, the diffusion model is guided by the adaptive guided formula <span><math><msub><mrow><mi>g</mi></mrow><mrow><mi>A</mi><mi>G</mi></mrow></msub></math></span> to perform denoising. At the same time, the adaptive guided formula <span><math><msub><mrow><mi>g</mi></mrow><mrow><mi>A</mi><mi>G</mi></mrow></msub></math></span> is adjusted according to the adaptive matrix <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>t</mi></mrow></msub></math></span> and the residual <span><math><msub><mrow><mi>r</mi></mrow><mrow><mi>t</mi></mrow></msub></math></span>. Additionally, we introduced a momentum factor <span><math><mi>m</mi></math></span> to further optimize the denoising process, reduce the oscillations caused by gradient variations, and enhance the stability and convergence of the optimization process. Through AGDD, the denoised images accurately reconstruct the characteristics of the original observations (i.e., the unperturbed images) and exhibit strong robustness and adaptability across diverse noise conditions. Extensive experiments on the ImageNet dataset using Convolutional Neural Networks (CNN) and Vision Transformer (ViT) architectures demonstrate that the proposed method exhibits superior robustness against adversarial attacks, with classification accuracy reaching 87.4% for CNN and 85.9% for ViT, surpassing other state-of-the-art defense techniques.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104584"},"PeriodicalIF":3.1,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145121128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}