Pub Date : 2025-12-11DOI: 10.1016/j.imavis.2025.105871
Jia Yuan , Jun Chen , Chongshou Li , Pedro Alonso , Xinke Li , Tianrui Li
Adversarial attacks on 3D point clouds are increasingly critical for safety-sensitive domains like autonomous driving. Most existing methods ignore local geometric structure, yielding perturbations that harm imperceptibility and geometric consistency. We introduce local geometry-aware adversarial attack (LoGA-Attack), a local geometry-aware approach that exploits topological and geometric cues to craft refined perturbations. A Neighborhood Centrality (NC) score partitions points into contour and flat points sets. Contour points receive gradient-based iterative updates to maximize attack strength, while flat points use an Optimal Neighborhood-based Attack (ONA) that projects gradients onto the most consistent local geometric direction. Experiments on ModelNet40 and ScanObjectNN show higher attack success with lower perceptual distortion, demonstrating superior performance and strong transferability. Our code is available at: https://github.com/yuanjiachn/LoGA-Attack.
{"title":"LoGA-Attack: Local geometry-aware adversarial attack on 3D point clouds","authors":"Jia Yuan , Jun Chen , Chongshou Li , Pedro Alonso , Xinke Li , Tianrui Li","doi":"10.1016/j.imavis.2025.105871","DOIUrl":"10.1016/j.imavis.2025.105871","url":null,"abstract":"<div><div>Adversarial attacks on 3D point clouds are increasingly critical for safety-sensitive domains like autonomous driving. Most existing methods ignore local geometric structure, yielding perturbations that harm imperceptibility and geometric consistency. We introduce local geometry-aware adversarial attack (LoGA-Attack), a local geometry-aware approach that exploits topological and geometric cues to craft refined perturbations. A Neighborhood Centrality (NC) score partitions points into contour and flat points sets. Contour points receive gradient-based iterative updates to maximize attack strength, while flat points use an Optimal Neighborhood-based Attack (ONA) that projects gradients onto the most consistent local geometric direction. Experiments on ModelNet40 and ScanObjectNN show higher attack success with lower perceptual distortion, demonstrating superior performance and strong transferability. Our code is available at: <span><span>https://github.com/yuanjiachn/LoGA-Attack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105871"},"PeriodicalIF":4.2,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-time and high-quality scene reconstruction remains a critical challenge for robotics applications. 3D Gaussian Splatting (3DGS) demonstrates remarkable capabilities in scene rendering. However, its integration with SLAM systems confronts two critical limitations: (1) slow pose tracking caused by full-frame rendering multiple times, and (2) susceptibility to environmental variations such as illumination variations and motion blur. To alleviate these issues, this paper proposes gaussian landmarks-based real-time reconstruction framework — GLT-SLAM, which composes of a ray casting-driven tracking module, a multi-modal keyframe selector, and an incremental geometric–photometric mapping module. To avoid redundant rendering computations, the tracking module achieves efficient 3D-2D correspondence by encoding Gaussian landmark-emitted rays and fusing attention scores. Furthermore, to enhance the framework’s robustness against complex environmental conditions, the keyframe selector balances multiple influencing factors including image quality, tracking uncertainty, information entropy, and feature overlap ratios. Finally, to achieve a compact map representation, the mapping module adds only Gaussian primitives of points, lines, and planes, and performs global map optimization through joint photometric–geometric constraints. Experimental results on the Replica, TUM RGB-D, and BJTU datasets demonstrate that the proposed method achieves a real-time processing rate of over 30 Hz on a platform with an NVIDIA RTX 3090, demonstrating a 19% higher efficiency than the fastest Photo-SLAM method while significantly outperforming other baseline methods in both localization and mapping accuracy. The source code will be available on GitHub.1
实时和高质量的场景重建仍然是机器人应用的关键挑战。三维高斯溅射(3DGS)在场景渲染中表现出非凡的能力。然而,它与SLAM系统的集成面临两个关键限制:(1)多次绘制全帧导致的姿态跟踪缓慢;(2)易受光照变化和运动模糊等环境变化的影响。为了解决这些问题,本文提出了基于高斯地标的实时重建框架GLT-SLAM,该框架由光线投射驱动的跟踪模块、多模态关键帧选择器和增量几何光度映射模块组成。为了避免冗余的渲染计算,跟踪模块通过对高斯地标发射射线进行编码并融合注意分数来实现高效的3D-2D对应。此外,为了增强框架对复杂环境条件的鲁棒性,关键帧选择器平衡了多个影响因素,包括图像质量、跟踪不确定性、信息熵和特征重叠率。最后,为了实现紧凑的地图表示,映射模块仅添加点、线、面高斯原语,并通过光度-几何联合约束进行全局地图优化。在Replica, TUM RGB-D和BJTU数据集上的实验结果表明,该方法在NVIDIA RTX 3090平台上实现了超过30 Hz的实时处理速率,比最快的Photo-SLAM方法效率高出19%,同时在定位和制图精度方面显着优于其他基准方法。源代码可以在GitHub.1上获得
{"title":"Gaussian landmarks tracking-based real-time splatting reconstruction model","authors":"Donglin Zhu, Zhongli Wang, Xiaoyang Fan, Miao Chen, Jiuyu Chen","doi":"10.1016/j.imavis.2025.105869","DOIUrl":"10.1016/j.imavis.2025.105869","url":null,"abstract":"<div><div>Real-time and high-quality scene reconstruction remains a critical challenge for robotics applications. 3D Gaussian Splatting (3DGS) demonstrates remarkable capabilities in scene rendering. However, its integration with SLAM systems confronts two critical limitations: (1) slow pose tracking caused by full-frame rendering multiple times, and (2) susceptibility to environmental variations such as illumination variations and motion blur. To alleviate these issues, this paper proposes gaussian landmarks-based real-time reconstruction framework — GLT-SLAM, which composes of a ray casting-driven tracking module, a multi-modal keyframe selector, and an incremental geometric–photometric mapping module. To avoid redundant rendering computations, the tracking module achieves efficient 3D-2D correspondence by encoding Gaussian landmark-emitted rays and fusing attention scores. Furthermore, to enhance the framework’s robustness against complex environmental conditions, the keyframe selector balances multiple influencing factors including image quality, tracking uncertainty, information entropy, and feature overlap ratios. Finally, to achieve a compact map representation, the mapping module adds only Gaussian primitives of points, lines, and planes, and performs global map optimization through joint photometric–geometric constraints. Experimental results on the Replica, TUM RGB-D, and BJTU datasets demonstrate that the proposed method achieves a real-time processing rate of over 30 Hz on a platform with an NVIDIA RTX 3090, demonstrating a 19% higher efficiency than the fastest Photo-SLAM method while significantly outperforming other baseline methods in both localization and mapping accuracy. The source code will be available on GitHub.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105869"},"PeriodicalIF":4.2,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-07DOI: 10.1016/j.imavis.2025.105868
Liwei Deng , Songyu Chen , Xin Yang , Sijuan Huang , Jing Wang
Medical image registration has important applications in medical image analysis. Although deep learning-based registration methods are widely recognized, there is still performance improvement space for existing algorithms due to the complex physiological structure of brain images. In this paper, we aim to propose a deformable medical image registration method that is highly accurate and capable of handling complex physiological structures. Therefore, we propose DFMNet, a dual-channel fusion method based on GMamba, to achieve accurate brain MRI image registration. Compared with state-of-the-art networks like TransMorph, DFMNet has a dual-channel network structure with different fusion strategies. We propose the GMamba block to efficiently capture the remote dependencies in moving and fixed image features. Meanwhile, we propose a context extraction channel to enhance the texture structure of the image content. In addition, we designed a weighted fusion block to help the features of the two channels can be fused efficiently. Extensive experiments on three public brain datasets demonstrate the effectiveness of DFMNet. The experimental results demonstrate that DFMNet outperforms multiple current state-of-the-art deformable registration methods in structural registration of brain images.
{"title":"A deformable registration framework for brain MR images based on a dual-channel fusion strategy using GMamba","authors":"Liwei Deng , Songyu Chen , Xin Yang , Sijuan Huang , Jing Wang","doi":"10.1016/j.imavis.2025.105868","DOIUrl":"10.1016/j.imavis.2025.105868","url":null,"abstract":"<div><div>Medical image registration has important applications in medical image analysis. Although deep learning-based registration methods are widely recognized, there is still performance improvement space for existing algorithms due to the complex physiological structure of brain images. In this paper, we aim to propose a deformable medical image registration method that is highly accurate and capable of handling complex physiological structures. Therefore, we propose DFMNet, a dual-channel fusion method based on GMamba, to achieve accurate brain MRI image registration. Compared with state-of-the-art networks like TransMorph, DFMNet has a dual-channel network structure with different fusion strategies. We propose the GMamba block to efficiently capture the remote dependencies in moving and fixed image features. Meanwhile, we propose a context extraction channel to enhance the texture structure of the image content. In addition, we designed a weighted fusion block to help the features of the two channels can be fused efficiently. Extensive experiments on three public brain datasets demonstrate the effectiveness of DFMNet. The experimental results demonstrate that DFMNet outperforms multiple current state-of-the-art deformable registration methods in structural registration of brain images.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105868"},"PeriodicalIF":4.2,"publicationDate":"2025-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.imavis.2025.105856
Yitong Yang , Lei Zhu , Xinyang Yao , Hua Wang , Yang Pan , Bo Zhang
Infrared and visible light image fusion aims to generate integrated representations that synergistically preserve salient thermal targets in the infrared modality and high-resolution textural details in the visible light modality. However, existing methods face two core challenges: First, high-frequency noise in visible images, such as sensor noise and nonuniform illumination artifacts, is often highly coupled with effective textures. Traditional fusion paradigms readily amplify noise interference while enhancing details, leading to structural distortion and visual graininess in fusion results. Second, mainstream approaches predominantly rely on simple aggregation operations like feature stitching or linear weighting, lacking deep modeling of cross-modal semantic correlations. This prevents adaptive interaction and collaborative enhancement of complementary information between modalities, creating a significant trade-off between target saliency and detail preservation. To address these challenges, we propose a dual-branch fusion network based on local contour enhancement. Specifically, it distinguishes and enhances meaningful contour details in a learnable manner while suppressing meaningless noise, thereby purifying the detail information used for fusion at its source. Cross-attention weights are computed based on feature representations extracted from different modal branches, enabling a feature selection mechanism that facilitates dynamic cross-modal interaction between infrared and visible light information. We evaluate our method against 11 state-of-the-art deep learning-based fusion approaches across four benchmark datasets using both subjective assessments and objective metrics. The experimental results demonstrate superior performance on public datasets. Furthermore, YOLOv12-based detection tests reveal that our method achieves higher confidence scores and better overall detection performance compared to other fusion techniques.
{"title":"LCFusion: Infrared and visible image fusion network based on local contour enhancement","authors":"Yitong Yang , Lei Zhu , Xinyang Yao , Hua Wang , Yang Pan , Bo Zhang","doi":"10.1016/j.imavis.2025.105856","DOIUrl":"10.1016/j.imavis.2025.105856","url":null,"abstract":"<div><div>Infrared and visible light image fusion aims to generate integrated representations that synergistically preserve salient thermal targets in the infrared modality and high-resolution textural details in the visible light modality. However, existing methods face two core challenges: First, high-frequency noise in visible images, such as sensor noise and nonuniform illumination artifacts, is often highly coupled with effective textures. Traditional fusion paradigms readily amplify noise interference while enhancing details, leading to structural distortion and visual graininess in fusion results. Second, mainstream approaches predominantly rely on simple aggregation operations like feature stitching or linear weighting, lacking deep modeling of cross-modal semantic correlations. This prevents adaptive interaction and collaborative enhancement of complementary information between modalities, creating a significant trade-off between target saliency and detail preservation. To address these challenges, we propose a dual-branch fusion network based on local contour enhancement. Specifically, it distinguishes and enhances meaningful contour details in a learnable manner while suppressing meaningless noise, thereby purifying the detail information used for fusion at its source. Cross-attention weights are computed based on feature representations extracted from different modal branches, enabling a feature selection mechanism that facilitates dynamic cross-modal interaction between infrared and visible light information. We evaluate our method against 11 state-of-the-art deep learning-based fusion approaches across four benchmark datasets using both subjective assessments and objective metrics. The experimental results demonstrate superior performance on public datasets. Furthermore, YOLOv12-based detection tests reveal that our method achieves higher confidence scores and better overall detection performance compared to other fusion techniques.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105856"},"PeriodicalIF":4.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.imavis.2025.105855
Jirui Di, Zhengping Hu, Hehao Zhang, Qiming Zhang, Zhe Sun
Fine-grained actions often lack scene prior information, making strong temporal modeling particularly important. Since these actions primarily rely on subtle and localized motion differences, single-scale features are often insufficient to capture their complexity. In contrast, multi-scale features not only capture fine-grained patterns but also contain rich rhythmic information, which is crucial for modeling temporal dependencies. However, existing methods for processing multi-scale features suffer from two major limitations: they often rely on naive downsampling operations for scale alignment, causing significant structural information loss, and they treat features from different layers equally, without fully exploiting the complementary strengths across hierarchical levels. To address these issues, we propose a novel Wavelet-Based Adaptive Multi-scale Fusion Network (WAM-Net), which consists of three key components: (1) a Wavelet-based Fusion Module (WFM) that achieves feature alignment through wavelet reconstruction, avoiding the structural degradation typically introduced by direct downsampling, (2) an Adaptive Feature Selection Module (AFSM) that dynamically selects and fuses two levels of features based on global information, enabling the network to leverage their complementary advantages, and (3) a Duration Context Encoder (DCE) that extracts temporal duration representations from the overall video length to guide global dependency modeling. Extensive experiments on Diving48, FineGym, and Kinetics-400 demonstrate that our approach consistently outperforms existing state-of-the-art methods.
{"title":"WAM-Net: Wavelet-Based Adaptive Multi-scale Fusion Network for fine-grained action recognition","authors":"Jirui Di, Zhengping Hu, Hehao Zhang, Qiming Zhang, Zhe Sun","doi":"10.1016/j.imavis.2025.105855","DOIUrl":"10.1016/j.imavis.2025.105855","url":null,"abstract":"<div><div>Fine-grained actions often lack scene prior information, making strong temporal modeling particularly important. Since these actions primarily rely on subtle and localized motion differences, single-scale features are often insufficient to capture their complexity. In contrast, multi-scale features not only capture fine-grained patterns but also contain rich rhythmic information, which is crucial for modeling temporal dependencies. However, existing methods for processing multi-scale features suffer from two major limitations: they often rely on naive downsampling operations for scale alignment, causing significant structural information loss, and they treat features from different layers equally, without fully exploiting the complementary strengths across hierarchical levels. To address these issues, we propose a novel Wavelet-Based Adaptive Multi-scale Fusion Network (WAM-Net), which consists of three key components: (1) a Wavelet-based Fusion Module (WFM) that achieves feature alignment through wavelet reconstruction, avoiding the structural degradation typically introduced by direct downsampling, (2) an Adaptive Feature Selection Module (AFSM) that dynamically selects and fuses two levels of features based on global information, enabling the network to leverage their complementary advantages, and (3) a Duration Context Encoder (DCE) that extracts temporal duration representations from the overall video length to guide global dependency modeling. Extensive experiments on Diving48, FineGym, and Kinetics-400 demonstrate that our approach consistently outperforms existing state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105855"},"PeriodicalIF":4.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.imavis.2025.105848
Vibhav Ranjan, Kuldeep Chaurasia, Jagendra Singh
Skin cancer detection is a critical task in dermatology, where early diagnosis can significantly improve patient outcomes. In this work, we propose a novel approach for skin cancer classification that combines three deep learning models—InceptionResNetV2 with Soft Attention (SA), ResNet50V2 with SA, and DenseNet201—optimized using a Genetic Algorithm (GA) to find the best ensemble weights. The approach integrates several key innovations: Sigmoid Focal Cross-entropy Loss to address class imbalance, Mish activation for improved gradient flow, and Cosine Annealing learning rate scheduling for enhanced convergence. The GA-based optimization fine-tunes the ensemble weights to maximize classification performance, especially for challenging skin cancer types like melanoma. Experimental results on the HAM10000 dataset demonstrate the effectiveness of the proposed ensemble model, achieving superior accuracy and precision compared to individual models. This work offers a robust framework for skin cancer detection, combining state-of-the-art deep learning techniques with an optimization strategy.
{"title":"Enhancing skin cancer classification with Soft Attention and genetic algorithm-optimized ensemble learning","authors":"Vibhav Ranjan, Kuldeep Chaurasia, Jagendra Singh","doi":"10.1016/j.imavis.2025.105848","DOIUrl":"10.1016/j.imavis.2025.105848","url":null,"abstract":"<div><div>Skin cancer detection is a critical task in dermatology, where early diagnosis can significantly improve patient outcomes. In this work, we propose a novel approach for skin cancer classification that combines three deep learning models—InceptionResNetV2 with Soft Attention (SA), ResNet50V2 with SA, and DenseNet201—optimized using a Genetic Algorithm (GA) to find the best ensemble weights. The approach integrates several key innovations: Sigmoid Focal Cross-entropy Loss to address class imbalance, Mish activation for improved gradient flow, and Cosine Annealing learning rate scheduling for enhanced convergence. The GA-based optimization fine-tunes the ensemble weights to maximize classification performance, especially for challenging skin cancer types like melanoma. Experimental results on the HAM10000 dataset demonstrate the effectiveness of the proposed ensemble model, achieving superior accuracy and precision compared to individual models. This work offers a robust framework for skin cancer detection, combining state-of-the-art deep learning techniques with an optimization strategy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105848"},"PeriodicalIF":4.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.imavis.2025.105857
Pasquale Coscia, Antonio Fusillo, Angelo Genovese, Vincenzo Piuri, Fabio Scotti
Scene geometry estimation from images plays a key role in robotics, augmented reality, and autonomous systems. In particular, Monocular Depth Estimation (MDE) focuses on predicting depth using a single RGB image, avoiding the need for expensive sensors. State-of-the-art approaches use deep learning models for MDE while processing images as a whole, sub-optimally exploiting their spatial information. A recent research direction focuses on smaller image patches, as depth information varies across different regions of an image. This approach reduces model complexity and improves performance by capturing finer spatial details. From this perspective, we propose a novel warp patch-based extraction method which corrects perspective camera distortions, and employ it in tailored training and inference pipelines. Our experimental results show that our patch-based approach outperforms its full-image-trained counterpart and the classical crop patch-based extraction. With our technique, we obtain a general performance enhancements over recent state-of-the-art models. Code is available at https://github.com/AntonioFusillo/PatchMDE.
{"title":"On the relevance of patch-based extraction methods for monocular depth estimation","authors":"Pasquale Coscia, Antonio Fusillo, Angelo Genovese, Vincenzo Piuri, Fabio Scotti","doi":"10.1016/j.imavis.2025.105857","DOIUrl":"10.1016/j.imavis.2025.105857","url":null,"abstract":"<div><div>Scene geometry estimation from images plays a key role in robotics, augmented reality, and autonomous systems. In particular, Monocular Depth Estimation (MDE) focuses on predicting depth using a single RGB image, avoiding the need for expensive sensors. State-of-the-art approaches use deep learning models for MDE while processing images as a whole, sub-optimally exploiting their spatial information. A recent research direction focuses on smaller image patches, as depth information varies across different regions of an image. This approach reduces model complexity and improves performance by capturing finer spatial details. From this perspective, we propose a novel warp patch-based extraction method which corrects perspective camera distortions, and employ it in tailored training and inference pipelines. Our experimental results show that our patch-based approach outperforms its full-image-trained counterpart and the classical crop patch-based extraction. With our technique, we obtain a general performance enhancements over recent state-of-the-art models. Code is available at <span><span>https://github.com/AntonioFusillo/PatchMDE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105857"},"PeriodicalIF":4.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1016/j.imavis.2025.105853
Jagannath Sethi , Jaydeb Bhaumik , Ananda S. Chowdhury
In this paper, we propose a secure high quality video recovery scheme which can be useful for diverse applications like telemedicine and cloud-based surveillance. Our solution consists of deep learning-based video Compressive Sensing (CS) followed by a strategy for encrypting the compressed video. We split a video into a number of Groups Of Pictures (GOPs), where, each GOP consists of both keyframes and non-keyframes. The proposed video CS method uses a convolutional neural network (CNN) with a Structural Similarity Index Measure (SSIM) based loss function. Our recovery process has two stages. In the initial recovery stage, CNN is employed to make efficient use of spatial redundancy. In the deep recovery stage, non-keyframes are compensated by utilizing both keyframes and neighboring non-keyframes. Keyframes use multilevel feature compensation, and neighboring non-keyframes use single-level feature compensation. Additionally, we propose an unpredictable and complex chaotic map, with a broader chaotic range, termed as Sine Symbolic Chaotic Map (SSCM). For encrypting compressed features, we suggest a secure encryption scheme consisting of four operations: Forward Diffusion, Substitution, Backward Diffusion, and XORing with SSCM based chaotic sequence. Through extensive experimentation, we establish the efficacy of our combined solution over i) several state-of-the-art image and video CS methods, and ii) a number of video encryption techniques.
{"title":"A robust and secure video recovery scheme with deep compressive sensing","authors":"Jagannath Sethi , Jaydeb Bhaumik , Ananda S. Chowdhury","doi":"10.1016/j.imavis.2025.105853","DOIUrl":"10.1016/j.imavis.2025.105853","url":null,"abstract":"<div><div>In this paper, we propose a secure high quality video recovery scheme which can be useful for diverse applications like telemedicine and cloud-based surveillance. Our solution consists of deep learning-based video Compressive Sensing (CS) followed by a strategy for encrypting the compressed video. We split a video into a number of Groups Of Pictures (GOPs), where, each GOP consists of both keyframes and non-keyframes. The proposed video CS method uses a convolutional neural network (CNN) with a Structural Similarity Index Measure (SSIM) based loss function. Our recovery process has two stages. In the initial recovery stage, CNN is employed to make efficient use of spatial redundancy. In the deep recovery stage, non-keyframes are compensated by utilizing both keyframes and neighboring non-keyframes. Keyframes use multilevel feature compensation, and neighboring non-keyframes use single-level feature compensation. Additionally, we propose an unpredictable and complex chaotic map, with a broader chaotic range, termed as Sine Symbolic Chaotic Map (SSCM). For encrypting compressed features, we suggest a secure encryption scheme consisting of four operations: Forward Diffusion, Substitution, Backward Diffusion, and XORing with SSCM based chaotic sequence. Through extensive experimentation, we establish the efficacy of our combined solution over i) several state-of-the-art image and video CS methods, and ii) a number of video encryption techniques.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105853"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-29DOI: 10.1016/j.imavis.2025.105847
Huan Wan , Jing Ai , Jing Liu , Xin Wei , Jinshan Zeng , Jianyi Wan
Accurate polyp segmentation from the colonoscopy images is crucial for diagnosing and treating colorectal diseases. Although many automatic polyp segmentation models have been proposed and achieved good progress, they still suffer from under-segmentation or over-segmentation problems caused by the characteristics of colonoscopy images: blurred boundaries and widely varied polyp sizes. To address these problems, we propose a novel model, the Distinct Polyp Generator Network (DPG-Net), for polyp segmentation. In DPG-Net, a Feature Progressive Enhancement Module (FPEM) and a Dynamical Aggregation Module (DAM) are developed. The proposed FPEM is responsible for enhancing the polyps and polyp boundaries by jointly utilizing the boundary information and global prior information. Simultaneously, a DAM is developed to integrate all decoding features based on their own traits and detect polyps with various sizes. Finally, accurate segmentation results are obtained. Extensive experiments on five widely used datasets demonstrate that the proposed DPG-Net model is superior to the state-of-the-art models. To evaluate the cross-domain generalization ability, we adopt the proposed DPG-Net for the skin lesion segmentation task. Again, experimental results show that our DPG-Net achieves advanced performance in this task, which verifies the strong generalizability of DPG-Net.
{"title":"Distinct Polyp Generator Network for polyp segmentation","authors":"Huan Wan , Jing Ai , Jing Liu , Xin Wei , Jinshan Zeng , Jianyi Wan","doi":"10.1016/j.imavis.2025.105847","DOIUrl":"10.1016/j.imavis.2025.105847","url":null,"abstract":"<div><div>Accurate polyp segmentation from the colonoscopy images is crucial for diagnosing and treating colorectal diseases. Although many automatic polyp segmentation models have been proposed and achieved good progress, they still suffer from under-segmentation or over-segmentation problems caused by the characteristics of colonoscopy images: blurred boundaries and widely varied polyp sizes. To address these problems, we propose a novel model, the Distinct Polyp Generator Network (DPG-Net), for polyp segmentation. In DPG-Net, a Feature Progressive Enhancement Module (FPEM) and a Dynamical Aggregation Module (DAM) are developed. The proposed FPEM is responsible for enhancing the polyps and polyp boundaries by jointly utilizing the boundary information and global prior information. Simultaneously, a DAM is developed to integrate all decoding features based on their own traits and detect polyps with various sizes. Finally, accurate segmentation results are obtained. Extensive experiments on five widely used datasets demonstrate that the proposed DPG-Net model is superior to the state-of-the-art models. To evaluate the cross-domain generalization ability, we adopt the proposed DPG-Net for the skin lesion segmentation task. Again, experimental results show that our DPG-Net achieves advanced performance in this task, which verifies the strong generalizability of DPG-Net.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105847"},"PeriodicalIF":4.2,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-29DOI: 10.1016/j.imavis.2025.105854
Thurimerla Prasanth , Ram Prasad Padhy , B. Sivaselvan
Autonomous Vehicles (AVs) depend on sophisticated perception systems to serve as the vital component of intelligent transportation to ensure secure and smooth navigation. Perception is an essential component of AVs and enables real-time analysis and understanding of the environment for effective decision-making. 3D object detection (3D-OD) is crucial among perception tasks as it accurately determines the 3D geometry and spatial positioning of surrounding objects. The commonly used modalities for 3D-OD are camera, LiDAR, and sensor fusion. In this work, we propose a LiDAR-based 3D-OD approach using point cloud data. The proposed model achieves superior performance while maintaining computational efficiency. This approach utilizes Pillar-based LiDAR processing and uses only 2D convolutions. The model pipeline becomes simple and more efficient by employing only 2D convolutions. We propose a Cascaded Convolutional Backbone (CCB) integrated with 1 × 1 convolutions to improve detection accuracy. We combined the fast Pillar-based encoding with our lightweight backbone. The proposed model reduces complexity to make it well-suited for real-time navigation of an AV. We evaluated our model on the official KITTI test server. The model results are decent in 3D and Bird’s Eye View (BEV) detection benchmarks for the car and cyclist classes. The results of our proposed model are featured on the official KITTI leaderboard.
{"title":"PCNet3D++: A pillar-based cascaded 3D object detection model with an enhanced 2D backbone","authors":"Thurimerla Prasanth , Ram Prasad Padhy , B. Sivaselvan","doi":"10.1016/j.imavis.2025.105854","DOIUrl":"10.1016/j.imavis.2025.105854","url":null,"abstract":"<div><div>Autonomous Vehicles (AVs) depend on sophisticated perception systems to serve as the vital component of intelligent transportation to ensure secure and smooth navigation. Perception is an essential component of AVs and enables real-time analysis and understanding of the environment for effective decision-making. 3D object detection (3D-OD) is crucial among perception tasks as it accurately determines the 3D geometry and spatial positioning of surrounding objects. The commonly used modalities for 3D-OD are camera, LiDAR, and sensor fusion. In this work, we propose a LiDAR-based 3D-OD approach using point cloud data. The proposed model achieves superior performance while maintaining computational efficiency. This approach utilizes Pillar-based LiDAR processing and uses only 2D convolutions. The model pipeline becomes simple and more efficient by employing only 2D convolutions. We propose a Cascaded Convolutional Backbone (CCB) integrated with 1 × 1 convolutions to improve detection accuracy. We combined the fast Pillar-based encoding with our lightweight backbone. The proposed model reduces complexity to make it well-suited for real-time navigation of an AV. We evaluated our model on the official KITTI test server. The model results are decent in 3D and Bird’s Eye View (BEV) detection benchmarks for the car and cyclist classes. The results of our proposed model are featured on the official KITTI leaderboard.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105854"},"PeriodicalIF":4.2,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}