Image and Vision Computing最新文献_第7页

Self-ensembling for 3D point cloud domain adaptation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105409

Qing Li , Xiaojiang Peng , Chuan Yan , Pan Gao , Qi Hao

Recently 3D point cloud learning has been a hot topic in computer vision and autonomous driving. Due to the fact that it is difficult to manually annotate a qualitative large-scale 3D point cloud dataset, unsupervised domain adaptation (UDA) is popular in 3D point cloud learning which aims to transfer the learned knowledge from the labeled source domain to the unlabeled target domain. Existing methods mainly resort to a deformation reconstruction in the target domain, leveraging the deformable invariance process for generalization and domain adaptation. In this paper, we propose a conceptually new yet simple method, termed as self-ensembling network (SEN) for domain generalization and adaptation. In SEN, we propose a soft classification loss on the source domain and a consistency loss on the target domain to stabilize the feature representations and to capture better invariance in the UDA task. In addition, we extend the pointmixup module on the target domain to increase the diversity of point clouds which further boosts cross domain generalization. Extensive experiments on several 3D point cloud UDA benchmarks show that our SEN outperforms the state-of-the-art methods on both classification and segmentation tasks.

{"title":"Self-ensembling for 3D point cloud domain adaptation","authors":"Qing Li , Xiaojiang Peng , Chuan Yan , Pan Gao , Qi Hao","doi":"10.1016/j.imavis.2024.105409","DOIUrl":"10.1016/j.imavis.2024.105409","url":null,"abstract":"<div><div>Recently 3D point cloud learning has been a hot topic in computer vision and autonomous driving. Due to the fact that it is difficult to manually annotate a qualitative large-scale 3D point cloud dataset, unsupervised domain adaptation (UDA) is popular in 3D point cloud learning which aims to transfer the learned knowledge from the labeled source domain to the unlabeled target domain. Existing methods mainly resort to a deformation reconstruction in the target domain, leveraging the deformable invariance process for generalization and domain adaptation. In this paper, we propose a conceptually new yet simple method, termed as self-ensembling network (SEN) for domain generalization and adaptation. In SEN, we propose a soft classification loss on the source domain and a consistency loss on the target domain to stabilize the feature representations and to capture better invariance in the UDA task. In addition, we extend the pointmixup module on the target domain to increase the diversity of point clouds which further boosts cross domain generalization. Extensive experiments on several 3D point cloud UDA benchmarks show that our SEN outperforms the state-of-the-art methods on both classification and segmentation tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105409"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105435

Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan

Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at https://github.com/UESTC-nnLab/STC.

{"title":"Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection","authors":"Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan","doi":"10.1016/j.imavis.2025.105435","DOIUrl":"10.1016/j.imavis.2025.105435","url":null,"abstract":"<div><div>Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at <span><span>https://github.com/UESTC-nnLab/STC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105435"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DALSCLIP: Domain aggregation via learning stronger domain-invariant features for CLIP

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105359

Yuewen Zhang , Jiuhang Wang , Hongying Tang , Ronghua Qin

When the test data follows a different distribution from the training data, neural networks experience domain shift. We can address this issue with domain generalization (DG), which aims to develop models that can perform well on unknown domains. In this paper, we propose a simple yet effective framework called DALSCLIP to achieve high-performance generalization of CLIP, Contrastive LanguageImage Pre-training, in DG. Specifically, we optimize CLIP in two aspects: images and prompts. For images, we propose a method to remove domain-specific features from input images and learn better domain-invariant features. We first train specific classifiers for each domain to learn their corresponding domain-specific information and then learn a mapping to remove domain-specific information. For prompts, we design a lightweight optimizer(Attention-based MLP) to automatically optimize the prompts and incorporate domain-specific information into the input, helping the prompts better adapt to the domain. Meanwhile, we freeze the network parameters during training to maximize the retention of pre-training model information. We extensively evaluate our model on three public datasets. Qualitative and quantitative experiments demonstrate that our framework outperforms other baselines significantly.

{"title":"DALSCLIP: Domain aggregation via learning stronger domain-invariant features for CLIP","authors":"Yuewen Zhang , Jiuhang Wang , Hongying Tang , Ronghua Qin","doi":"10.1016/j.imavis.2024.105359","DOIUrl":"10.1016/j.imavis.2024.105359","url":null,"abstract":"<div><div>When the test data follows a different distribution from the training data, neural networks experience domain shift. We can address this issue with domain generalization (DG), which aims to develop models that can perform well on unknown domains. In this paper, we propose a simple yet effective framework called DALSCLIP to achieve high-performance generalization of CLIP, Contrastive LanguageImage Pre-training, in DG. Specifically, we optimize CLIP in two aspects: images and prompts. For images, we propose a method to remove domain-specific features from input images and learn better domain-invariant features. We first train specific classifiers for each domain to learn their corresponding domain-specific information and then learn a mapping to remove domain-specific information. For prompts, we design a lightweight optimizer(Attention-based MLP) to automatically optimize the prompts and incorporate domain-specific information into the input, helping the prompts better adapt to the domain. Meanwhile, we freeze the network parameters during training to maximize the retention of pre-training model information. We extensively evaluate our model on three public datasets. Qualitative and quantitative experiments demonstrate that our framework outperforms other baselines significantly.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105359"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FSBI: Deepfake detection with frequency enhanced self-blended images

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105418

Ahmed Abul Hasanaath , Hamzah Luqman , Raed Katib , Saeed Anwar

Advances in deepfake research have led to the creation of almost perfect image manipulations that are undetectable to the human eye and some deepfake detection tools. Recently, several techniques have been proposed to differentiate deepfakes from real images and videos. This study introduces a frequency enhanced self-blended images (FSBI) approach for deepfake detection. This proposed approach utilizes discrete wavelet transforms (DWT) to extract discriminative features from self-blended images (SBI). The features are then used to train a convolutional network architecture model. SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency feature extractor to detect artifacts that could not be detected easily in the time domain. The proposed approach was evaluated on FF++ and Celeb-DF datasets, and the obtained results outperformed state-of-the-art techniques using the cross-dataset evaluation protocol, achieving an AUC of 95.49% on Celeb-DF dataset. It also achieved competitive performance in the within-dataset evaluation setup. These results highlight the robustness and effectiveness of our method in addressing the challenging generalization problem inherent in deepfake detection. The code is available at https://github.com/gufranSabri/FSBI.

{"title":"FSBI: Deepfake detection with frequency enhanced self-blended images","authors":"Ahmed Abul Hasanaath , Hamzah Luqman , Raed Katib , Saeed Anwar","doi":"10.1016/j.imavis.2025.105418","DOIUrl":"10.1016/j.imavis.2025.105418","url":null,"abstract":"<div><div>Advances in deepfake research have led to the creation of almost perfect image manipulations that are undetectable to the human eye and some deepfake detection tools. Recently, several techniques have been proposed to differentiate deepfakes from real images and videos. This study introduces a frequency enhanced self-blended images (FSBI) approach for deepfake detection. This proposed approach utilizes discrete wavelet transforms (DWT) to extract discriminative features from self-blended images (SBI). The features are then used to train a convolutional network architecture model. SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency feature extractor to detect artifacts that could not be detected easily in the time domain. The proposed approach was evaluated on FF++ and Celeb-DF datasets, and the obtained results outperformed state-of-the-art techniques using the cross-dataset evaluation protocol, achieving an AUC of 95.49% on Celeb-DF dataset. It also achieved competitive performance in the within-dataset evaluation setup. These results highlight the robustness and effectiveness of our method in addressing the challenging generalization problem inherent in deepfake detection. The code is available at <span><span>https://github.com/gufranSabri/FSBI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105418"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ProtoMed: Prototypical networks with auxiliary regularization for few-shot medical image classification

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105337

Achraf Ouahab, Olfa Ben Ahmed

Although deep learning has shown impressive results in computer vision, the scarcity of annotated medical images poses a significant challenge for its effective integration into Computer-Aided Diagnosis (CAD) systems. Few-Shot Learning (FSL) opens promising perspectives for image recognition in low-data scenarios. However, applying FSL for medical image diagnosis presents significant challenges, particularly in learning disease-specific and clinically relevant features from a limited number of images. In the medical domain, training samples from different classes often exhibit visual similarities. Consequently, certain medical conditions may present striking resemblances, resulting in minimal inter-class variation. In this paper, we propose a prototypical network-based approach for few-shot medical image classification for low-prevalence diseases detection. Our method leverages meta-learning to use prior knowledge gained from common diseases, enabling generalization to new cases with limited data. However, the episodic training inherent in meta-learning tends to disproportionately emphasize the connections between elements in the support set and those in the query set, which can compromise the understanding of complex relationships within medical image data during the training phase. To address this, we propose an auxiliary network as a regularizer in the meta-training phase, designed to enhance the similarity of image representations from the same class while enforcing dissimilarity between representations from different classes in both the query and support sets. The proposed method has been evaluated using three medical diagnosis problems with different imaging modalities and different levels of visual imaging details and patterns. The obtained model is lightweight and efficient, demonstrating superior performance in both efficiency and accuracy compared to state-of-the-art. These findings highlight the potential of our approach to improve performance in practical applications, balancing resource limitations with the need for high diagnostic accuracy.

{"title":"ProtoMed: Prototypical networks with auxiliary regularization for few-shot medical image classification","authors":"Achraf Ouahab, Olfa Ben Ahmed","doi":"10.1016/j.imavis.2024.105337","DOIUrl":"10.1016/j.imavis.2024.105337","url":null,"abstract":"<div><div>Although deep learning has shown impressive results in computer vision, the scarcity of annotated medical images poses a significant challenge for its effective integration into Computer-Aided Diagnosis (CAD) systems. Few-Shot Learning (FSL) opens promising perspectives for image recognition in low-data scenarios. However, applying FSL for medical image diagnosis presents significant challenges, particularly in learning disease-specific and clinically relevant features from a limited number of images. In the medical domain, training samples from different classes often exhibit visual similarities. Consequently, certain medical conditions may present striking resemblances, resulting in minimal inter-class variation. In this paper, we propose a prototypical network-based approach for few-shot medical image classification for low-prevalence diseases detection. Our method leverages meta-learning to use prior knowledge gained from common diseases, enabling generalization to new cases with limited data. However, the episodic training inherent in meta-learning tends to disproportionately emphasize the connections between elements in the support set and those in the query set, which can compromise the understanding of complex relationships within medical image data during the training phase. To address this, we propose an auxiliary network as a regularizer in the meta-training phase, designed to enhance the similarity of image representations from the same class while enforcing dissimilarity between representations from different classes in both the query and support sets. The proposed method has been evaluated using three medical diagnosis problems with different imaging modalities and different levels of visual imaging details and patterns. The obtained model is lightweight and efficient, demonstrating superior performance in both efficiency and accuracy compared to state-of-the-art. These findings highlight the potential of our approach to improve performance in practical applications, balancing resource limitations with the need for high diagnostic accuracy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105337"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HPD-Depth: High performance decoding network for self-supervised monocular depth estimation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105360

Liehao Wu , Laihua Wang , Guanghui Wei, Yang Yu

Self-supervised monocular depth estimation methods have shown promising results by leveraging geometric relationships among image sequences for network supervision. However, existing methods often face challenges such as blurry depth edges, high computational overhead, and information redundancy. This paper analyzes and investigates technologies related to deep feature encoding, decoding, and regression, and proposes a novel depth estimation network termed HPD-Depth, optimized by three strategies: utilizing the Residual Channel Attention Transition (RCAT) module to bridge the semantic gap between encoding and decoding features while highlighting important features; adopting the Sub-pixel Refinement Upsampling (SPRU) module to obtain high-resolution feature maps with detailed features; and introducing the Adaptive Hybrid Convolutional Attention (AHCA) module to address issues of local depth confusion and depth boundary blurriness. HPD-Depth excels at extracting clear scene structures and capturing detailed local information while maintaining an effective balance between accuracy and parameter count. Comprehensive experiments demonstrate that HPD-Depth performs comparably to state-of-the-art algorithms on the KITTI benchmarks and exhibits significant potential when trained with high-resolution data. Compared with the baseline model, the average relative error and squared relative error are reduced by 6.09% and 12.62% in low-resolution experiments, respectively, and by 11.3% and 18.5% in high-resolution experiments, respectively. Moreover, HPD-Depth demonstrates excellent generalization performance on the Make3D dataset.

{"title":"HPD-Depth: High performance decoding network for self-supervised monocular depth estimation","authors":"Liehao Wu , Laihua Wang , Guanghui Wei, Yang Yu","doi":"10.1016/j.imavis.2024.105360","DOIUrl":"10.1016/j.imavis.2024.105360","url":null,"abstract":"<div><div>Self-supervised monocular depth estimation methods have shown promising results by leveraging geometric relationships among image sequences for network supervision. However, existing methods often face challenges such as blurry depth edges, high computational overhead, and information redundancy. This paper analyzes and investigates technologies related to deep feature encoding, decoding, and regression, and proposes a novel depth estimation network termed HPD-Depth, optimized by three strategies: utilizing the Residual Channel Attention Transition (RCAT) module to bridge the semantic gap between encoding and decoding features while highlighting important features; adopting the Sub-pixel Refinement Upsampling (SPRU) module to obtain high-resolution feature maps with detailed features; and introducing the Adaptive Hybrid Convolutional Attention (AHCA) module to address issues of local depth confusion and depth boundary blurriness. HPD-Depth excels at extracting clear scene structures and capturing detailed local information while maintaining an effective balance between accuracy and parameter count. Comprehensive experiments demonstrate that HPD-Depth performs comparably to state-of-the-art algorithms on the KITTI benchmarks and exhibits significant potential when trained with high-resolution data. Compared with the baseline model, the average relative error and squared relative error are reduced by 6.09% and 12.62% in low-resolution experiments, respectively, and by 11.3% and 18.5% in high-resolution experiments, respectively. Moreover, HPD-Depth demonstrates excellent generalization performance on the Make3D dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105360"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AwareTrack: Object awareness for visual tracking via templates interaction

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105363

Hong Zhang , Jianbo Song , Hanyang Liu , Yang Han , Yifan Yang , Huimin Ma

Current popular trackers, whether based on the Siamese network or Transformer, have focused their main work on relation modeling between the template and the search area, and on the design of the tracking head, neglecting the fundamental element of tracking, the template. Templates are often mixed with too much background information, which can interfere with the extraction of template features. To address the above issue, a template object-aware tracker (AwareTrack) is proposed. Through the information interaction between multiple templates, the attention of the templates can be truly focused on the object itself, and the background interference can be suppressed. To ensure that the foreground objects of the templates have the same appearance to the greatest extent, the concept of awareness templates is proposed, which consists of two close frames. In addition, an awareness templates sampling method based on similarity discrimination via Siamese network is also proposed, which adaptively determines the interval between two awareness templates, ensure the maximization of background differences in the awareness templates. Meanwhile, online updates to the awareness templates ensure that our tracker has access to the most recent features of the foreground object. Our AwareTrack achieves state-of-the-art performance on multiple benchmarks, particularly on the one-shot tracking benchmark GOT-10k, achieving the AO of 78.1%, which is a 4.4% improvement over OSTrack-384.

{"title":"AwareTrack: Object awareness for visual tracking via templates interaction","authors":"Hong Zhang , Jianbo Song , Hanyang Liu , Yang Han , Yifan Yang , Huimin Ma","doi":"10.1016/j.imavis.2024.105363","DOIUrl":"10.1016/j.imavis.2024.105363","url":null,"abstract":"<div><div>Current popular trackers, whether based on the Siamese network or Transformer, have focused their main work on relation modeling between the template and the search area, and on the design of the tracking head, neglecting the fundamental element of tracking, the template. Templates are often mixed with too much background information, which can interfere with the extraction of template features. To address the above issue, a template object-aware tracker (AwareTrack) is proposed. Through the information interaction between multiple templates, the attention of the templates can be truly focused on the object itself, and the background interference can be suppressed. To ensure that the foreground objects of the templates have the same appearance to the greatest extent, the concept of awareness templates is proposed, which consists of two close frames. In addition, an awareness templates sampling method based on similarity discrimination via Siamese network is also proposed, which adaptively determines the interval between two awareness templates, ensure the maximization of background differences in the awareness templates. Meanwhile, online updates to the awareness templates ensure that our tracker has access to the most recent features of the foreground object. Our AwareTrack achieves state-of-the-art performance on multiple benchmarks, particularly on the one-shot tracking benchmark GOT-10k, achieving the AO of 78.1%, which is a 4.4% improvement over OSTrack-384.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105363"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SAMNet: Adapting segment anything model for accurate light field salient object detection

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105403

Xingzheng Wang, Jianbin Wu, Shaoyong Wu, Jiahui Li

Light field salient object detection (LF SOD) is an important task that aims to segment visually salient objects from the surroundings. However, existing methods still struggle to achieve accurate detection, especially in complex scenes. Recently, segment anything model (SAM) excels in various vision tasks with its strong object segmentation ability and generalization capability, which is suitable for solving the LF SOD challenge. In this paper, we aim to adapt the SAM for accurate LF SOD. Specifically, we propose a network named SAMNet with two adaptation designs. Firstly, to enhance the perception of salient objects, we design a task-oriented multi-scale convolution adapter (MSCA) integrated into SAM’s image encoder. Parameters in the image encoder except MSCA are frozen to balance detection accuracy and computational requirements. Furthermore, to effectively utilize the rich scene information of LF data, we design a data-oriented cross-modal fusion module (CMFM) to fuse SAM features of different modalities. Comprehensive experiments on four benchmark datasets demonstrate the effectiveness of SAMNet over current state-of-the-art methods. In particular, SAMNet achieves the highest F-measures of 0.945, 0.819, 0.868, and 0.898, respectively. To the best of our knowledge, this is the first work that adapts a vision foundation model to LF SOD.

{"title":"SAMNet: Adapting segment anything model for accurate light field salient object detection","authors":"Xingzheng Wang, Jianbin Wu, Shaoyong Wu, Jiahui Li","doi":"10.1016/j.imavis.2024.105403","DOIUrl":"10.1016/j.imavis.2024.105403","url":null,"abstract":"<div><div>Light field salient object detection (LF SOD) is an important task that aims to segment visually salient objects from the surroundings. However, existing methods still struggle to achieve accurate detection, especially in complex scenes. Recently, segment anything model (SAM) excels in various vision tasks with its strong object segmentation ability and generalization capability, which is suitable for solving the LF SOD challenge. In this paper, we aim to adapt the SAM for accurate LF SOD. Specifically, we propose a network named SAMNet with two adaptation designs. Firstly, to enhance the perception of salient objects, we design a task-oriented multi-scale convolution adapter (MSCA) integrated into SAM’s image encoder. Parameters in the image encoder except MSCA are frozen to balance detection accuracy and computational requirements. Furthermore, to effectively utilize the rich scene information of LF data, we design a data-oriented cross-modal fusion module (CMFM) to fuse SAM features of different modalities. Comprehensive experiments on four benchmark datasets demonstrate the effectiveness of SAMNet over current state-of-the-art methods. In particular, SAMNet achieves the highest F-measures of 0.945, 0.819, 0.868, and 0.898, respectively. To the best of our knowledge, this is the first work that adapts a vision foundation model to LF SOD.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105403"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Privacy-preserving explainable AI enable federated learning-based denoising fingerprint recognition model

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105420

Haewon Byeon , Mohammed E. Seno , Divya Nimma , Janjhyam Venkata Naga Ramesh , Abdelhamid Zaidi , Azzah AlGhamdi , Ismail Keshta , Mukesh Soni , Mohammad Shabaz

Most existing fingerprint recognition methods are based on machine learning and often overlook the privacy and heterogeneity of data when training on large datasets, leading to user information leakage and decreased recognition accuracy. To collaboratively optimize model accuracy under privacy protection, a novel fingerprint recognition algorithm based on artificial intelligence enable federated learning-based Fingerprint Recognition, (AI-Fed-FR) is proposed. First, federated learning is used to iteratively aggregate parameters from various clients, thereby improving the performance of the global model. Second, Explainable AI is applied for denoising low-quality fingerprint images to enhance fingerprint texture structure. Third, to address the fairness issue caused by client heterogeneity, a client scheduling strategy based on reservoir sampling is proposed. Finally, simulation experiments are conducted on three real-world datasets to analyze the effectiveness of AI-Fed-FR. Experimental results show that AI-Fed-FR improves accuracy by 5.32% compared to local learning and by 8.56% compared to the federated averaging algorithm, achieving accuracy close to centralized learning. This study is the first to demonstrate the feasibility of combining federated learning with fingerprint recognition, enhancing the security and scalability of fingerprint recognition algorithms and providing a reference for the application of federated learning in biometric technologies.

{"title":"Privacy-preserving explainable AI enable federated learning-based denoising fingerprint recognition model","authors":"Haewon Byeon , Mohammed E. Seno , Divya Nimma , Janjhyam Venkata Naga Ramesh , Abdelhamid Zaidi , Azzah AlGhamdi , Ismail Keshta , Mukesh Soni , Mohammad Shabaz","doi":"10.1016/j.imavis.2025.105420","DOIUrl":"10.1016/j.imavis.2025.105420","url":null,"abstract":"<div><div>Most existing fingerprint recognition methods are based on machine learning and often overlook the privacy and heterogeneity of data when training on large datasets, leading to user information leakage and decreased recognition accuracy. To collaboratively optimize model accuracy under privacy protection, a novel fingerprint recognition algorithm based on artificial intelligence enable federated learning-based Fingerprint Recognition, (AI-Fed-FR) is proposed. First, federated learning is used to iteratively aggregate parameters from various clients, thereby improving the performance of the global model. Second, Explainable AI is applied for denoising low-quality fingerprint images to enhance fingerprint texture structure. Third, to address the fairness issue caused by client heterogeneity, a client scheduling strategy based on reservoir sampling is proposed. Finally, simulation experiments are conducted on three real-world datasets to analyze the effectiveness of AI-Fed-FR. Experimental results show that AI-Fed-FR improves accuracy by 5.32% compared to local learning and by 8.56% compared to the federated averaging algorithm, achieving accuracy close to centralized learning. This study is the first to demonstrate the feasibility of combining federated learning with fingerprint recognition, enhancing the security and scalability of fingerprint recognition algorithms and providing a reference for the application of federated learning in biometric technologies.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105420"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Image re-identification: Where self-supervision meets vision-language learning

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105415

Bin Wang , Yuying Liang , Lei Cai , Huakun Huang , Huanqiang Zeng

Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: (1) incorporating language self-supervision in the first training stage can make the learnable text prompts more identity-specific, and (2) incorporating vision self-supervision in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: (1) the text prompt learning in the first stage can benefit from the language self-supervision, and (2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at https://github.com/BinWangGzhu/SVLL-ReID.

{"title":"Image re-identification: Where self-supervision meets vision-language learning","authors":"Bin Wang , Yuying Liang , Lei Cai , Huakun Huang , Huanqiang Zeng","doi":"10.1016/j.imavis.2025.105415","DOIUrl":"10.1016/j.imavis.2025.105415","url":null,"abstract":"<div><div>Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: (1) incorporating <em>language self-supervision</em> in the first training stage can make the learnable text prompts more identity-specific, and (2) incorporating <em>vision self-supervision</em> in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: (1) the text prompt learning in the first stage can benefit from the language self-supervision, and (2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at <span><span>https://github.com/BinWangGzhu/SVLL-ReID</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105415"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0