Image and Vision Computing最新文献

英文中文

PAdapter: Adapter combined with prompt for image and video classification

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105395

Youwei Li, Junyong Ye, Xubin Wen, Guangyi Xu, Jingjing Wang, Xinyuan Liu

In computer vision, parameter-efficient transfer learning has become an extensively used technology. Adapter is one of the commonly used basic modules, and its simplicity and efficiency have been widely proven. In the case of freezing the network backbone, only fine-tuning additional adapters can often achieve similar or even better results with lower computational costs compared to fully fine-tuning. However, the bottleneck structure of Adapter leads to a non-negligible information loss, thereby limiting the performance of Adapter. To alleviate this problem, this work proposes a plug-and-play lightweight module called PAdapter, which is a Prompt-combined Adapter that can achieve parameter-efficient transfer learning on image classification and video action recognition tasks. PAdapter is improved based on Adapter, and Prompt is introduced at the bottleneck to supplement the information that may be lost. Specifically, in the bottleneck structure of Adapter, we concatenate a learnable Prompt with bottleneck features at dimension D to supplement information and even enhance the visual expression ability of bottleneck features. Many experiments on image classification and video action recognition show that PAdapter achieves or exceeds the accuracy of full fine-tuning models with less than 2% extra parameters updated. For example, on the SSv2 and HMDB-51 datasets, PAdapter improves the accuracy by 5.49% and 16.68% respectively compared to full fine-tuning. And in almost all experiments, our PAdapter achieved higher accuracy than Adapter with similar number of tunable parameters. Code is available at https://github.com/owlholy/PAdapter.

{"title":"PAdapter: Adapter combined with prompt for image and video classification","authors":"Youwei Li, Junyong Ye, Xubin Wen, Guangyi Xu, Jingjing Wang, Xinyuan Liu","doi":"10.1016/j.imavis.2024.105395","DOIUrl":"10.1016/j.imavis.2024.105395","url":null,"abstract":"<div><div>In computer vision, parameter-efficient transfer learning has become an extensively used technology. Adapter is one of the commonly used basic modules, and its simplicity and efficiency have been widely proven. In the case of freezing the network backbone, only fine-tuning additional adapters can often achieve similar or even better results with lower computational costs compared to fully fine-tuning. However, the bottleneck structure of Adapter leads to a non-negligible information loss, thereby limiting the performance of Adapter. To alleviate this problem, this work proposes a plug-and-play lightweight module called PAdapter, which is a Prompt-combined Adapter that can achieve parameter-efficient transfer learning on image classification and video action recognition tasks. PAdapter is improved based on Adapter, and Prompt is introduced at the bottleneck to supplement the information that may be lost. Specifically, in the bottleneck structure of Adapter, we concatenate a learnable Prompt with bottleneck features at dimension <em>D</em> to supplement information and even enhance the visual expression ability of bottleneck features. Many experiments on image classification and video action recognition show that PAdapter achieves or exceeds the accuracy of full fine-tuning models with less than 2% extra parameters updated. For example, on the SSv2 and HMDB-51 datasets, PAdapter improves the accuracy by 5.49% and 16.68% respectively compared to full fine-tuning. And in almost all experiments, our PAdapter achieved higher accuracy than Adapter with similar number of tunable parameters. Code is available at <span><span>https://github.com/owlholy/PAdapter</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105395"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correlation embedding semantic-enhanced hashing for multimedia retrieval

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105421

Yunfei Chen , Yitian Long , Zhan Yang , Jun Long

Due to its feature extraction and information processing advantages, deep hashing has achieved significant success in multimedia retrieval. Currently, mainstream unsupervised multimedia hashing methods do not incorporate associative relationship information as part of the original features in generating hash codes, and their similarity measurements do not consider the transitivity of similarity. To address these challenges, we propose the Correlation Embedding Semantic-Enhanced Hashing (CESEH) for multimedia retrieval, which primarily consists of a semantic-enhanced similarity construction module and a correlation embedding hashing module. First, the semantic-enhanced similarity construction module generates a semantically enriched similarity matrix by thoroughly exploring similarity adjacency relationships and deep semantic associations within the original data. Next, the correlation embedding hashing module integrates semantic-enhanced similarity information with intra-modal semantic information, achieves precise correlation embedding and preserves semantic information integrity. Extensive experiments on three widely-used datasets demonstrate that the CESEH method outperforms state-of-the-art unsupervised hashing methods in both retrieval accuracy and robustness. The code is available at https://github.com/YunfeiChenMY/CESEH.

{"title":"Correlation embedding semantic-enhanced hashing for multimedia retrieval","authors":"Yunfei Chen , Yitian Long , Zhan Yang , Jun Long","doi":"10.1016/j.imavis.2025.105421","DOIUrl":"10.1016/j.imavis.2025.105421","url":null,"abstract":"<div><div>Due to its feature extraction and information processing advantages, deep hashing has achieved significant success in multimedia retrieval. Currently, mainstream unsupervised multimedia hashing methods do not incorporate associative relationship information as part of the original features in generating hash codes, and their similarity measurements do not consider the transitivity of similarity. To address these challenges, we propose the Correlation Embedding Semantic-Enhanced Hashing (CESEH) for multimedia retrieval, which primarily consists of a semantic-enhanced similarity construction module and a correlation embedding hashing module. First, the semantic-enhanced similarity construction module generates a semantically enriched similarity matrix by thoroughly exploring similarity adjacency relationships and deep semantic associations within the original data. Next, the correlation embedding hashing module integrates semantic-enhanced similarity information with intra-modal semantic information, achieves precise correlation embedding and preserves semantic information integrity. Extensive experiments on three widely-used datasets demonstrate that the CESEH method outperforms state-of-the-art unsupervised hashing methods in both retrieval accuracy and robustness. The code is available at <span><span>https://github.com/YunfeiChenMY/CESEH</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105421"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105422

Ayushi Verma, Tapas Badal, Abhay Bansal

The Denoising Diffusion Probabilistic Model (DDPM) has significantly advanced video generation and synthesis. DDPM relies on extensive datasets for its training process. The study presents a novel method for generating videos from a single video through a frame-conditioned Denoising Diffusion Probabilistic Model (DDPM). Additionally, incorporating the ConvNeXt-V2 model significantly boosts the framework’s feature extraction, improving video generation performance. Addressing the data scarcity challenge in video generation, the proposed model framework exploits a single video’s intrinsic complexities and temporal dynamics to generate diverse and realistic sequences. The model’s ability to generalize motion is demonstrated through thorough quantitative assessments, wherein it is trained on segments of the original video and evaluated on previously unseen frames. Integrating Global Response Normalization and Sigmoid-Weighted Linear Unit (SiLU) activation functions within the DDPM framework has enhanced generated video quality. Comparatively, the proposed model markedly outperforms the Sinfusion model across crucial image quality metrics, achieving a lower Freschet Video Distance (FVD) score of 106.52, lower Learned Perceptual Image Patch Similarity (LPIPS) score of 0.085, higher Structural Similarity Index Measure (SSIM) score of 0.089, higher Nearest-Neighbor-Field (NNF) based diversity (NNFDIV) score of 0.44. Furthermore, the model achieves a Peak Signal to Noise Ratio score of 23.95, demonstrating its strength in preserving image integrity despite noise. The integration of Global Response Normalization and SiLU significantly enhances content synthesis, while ConvNeXt-V2 boosts feature extraction, amplifying the model’s efficacy.

{"title":"A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2","authors":"Ayushi Verma, Tapas Badal, Abhay Bansal","doi":"10.1016/j.imavis.2025.105422","DOIUrl":"10.1016/j.imavis.2025.105422","url":null,"abstract":"<div><div>The Denoising Diffusion Probabilistic Model (DDPM) has significantly advanced video generation and synthesis. DDPM relies on extensive datasets for its training process. The study presents a novel method for generating videos from a single video through a frame-conditioned Denoising Diffusion Probabilistic Model (DDPM). Additionally, incorporating the ConvNeXt-V2 model significantly boosts the framework’s feature extraction, improving video generation performance. Addressing the data scarcity challenge in video generation, the proposed model framework exploits a single video’s intrinsic complexities and temporal dynamics to generate diverse and realistic sequences. The model’s ability to generalize motion is demonstrated through thorough quantitative assessments, wherein it is trained on segments of the original video and evaluated on previously unseen frames. Integrating Global Response Normalization and Sigmoid-Weighted Linear Unit (SiLU) activation functions within the DDPM framework has enhanced generated video quality. Comparatively, the proposed model markedly outperforms the Sinfusion model across crucial image quality metrics, achieving a lower Freschet Video Distance (FVD) score of 106.52, lower Learned Perceptual Image Patch Similarity (LPIPS) score of 0.085, higher Structural Similarity Index Measure (SSIM) score of 0.089, higher Nearest-Neighbor-Field (NNF) based diversity (NNFDIV) score of 0.44. Furthermore, the model achieves a Peak Signal to Noise Ratio score of 23.95, demonstrating its strength in preserving image integrity despite noise. The integration of Global Response Normalization and SiLU significantly enhances content synthesis, while ConvNeXt-V2 boosts feature extraction, amplifying the model’s efficacy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105422"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A temporally-aware noise-informed invertible network for progressive video denoising

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105369

Yan Huang , Huixin Luo , Yong Xu , Xian-Bing Meng

Video denoising is a critical task in computer vision, aiming to enhance video quality by removing noise from consecutive video frames. Despite significant progress, existing video denoising methods still suffer from challenges in maintaining temporal consistency and adapting to different noise levels. To address these issues, a temporally-aware and noise-informed invertible network is proposed by following divide-and-conquer principle for progressive video denoising. Specifically, a recurrent attention-based reversible network is designed to distinctly extract temporal information from consecutive frames, thus tackling the learning problem of temporal consistency. Simultaneously, a noise-informed two-way dense block is developed by using estimated noise as conditional guidance to adapt to different noise levels. The noise-informed guidance can then be used to guide the learning of dense block for efficient video denoising. Under the framework of invertible network, the designed two parts can be further integrated to achieve invertible learning to enable progressive video denoising. Experiments and comparative studies demonstrate that our method can achieve good denoising accuracy and fast inference speed in both synthetic scenes and real-world applications.

{"title":"A temporally-aware noise-informed invertible network for progressive video denoising","authors":"Yan Huang , Huixin Luo , Yong Xu , Xian-Bing Meng","doi":"10.1016/j.imavis.2024.105369","DOIUrl":"10.1016/j.imavis.2024.105369","url":null,"abstract":"<div><div>Video denoising is a critical task in computer vision, aiming to enhance video quality by removing noise from consecutive video frames. Despite significant progress, existing video denoising methods still suffer from challenges in maintaining temporal consistency and adapting to different noise levels. To address these issues, a temporally-aware and noise-informed invertible network is proposed by following divide-and-conquer principle for progressive video denoising. Specifically, a recurrent attention-based reversible network is designed to distinctly extract temporal information from consecutive frames, thus tackling the learning problem of temporal consistency. Simultaneously, a noise-informed two-way dense block is developed by using estimated noise as conditional guidance to adapt to different noise levels. The noise-informed guidance can then be used to guide the learning of dense block for efficient video denoising. Under the framework of invertible network, the designed two parts can be further integrated to achieve invertible learning to enable progressive video denoising. Experiments and comparative studies demonstrate that our method can achieve good denoising accuracy and fast inference speed in both synthetic scenes and real-world applications.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105369"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing 3D object detection in autonomous vehicles based on synthetic virtual environment analysis

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105385

Vladislav Li , Ilias Siniosoglou , Thomai Karamitsou , Anastasios Lytos , Ioannis D. Moscholios , Sotirios K. Goudos , Jyoti S. Banerjee , Panagiotis Sarigiannidis , Vasileios Argyriou

Autonomous Vehicles (AVs) rely on real-time processing of natural images and videos for scene understanding and safety assurance through proactive object detection. Traditional methods have primarily focused on 2D object detection, limiting their spatial understanding. This study introduces a novel approach by leveraging 3D object detection in conjunction with augmented reality (AR) ecosystems for enhanced real-time scene analysis. Our approach pioneers the integration of a synthetic dataset, designed to simulate various environmental, lighting, and spatiotemporal conditions, to train and evaluate an AI model capable of deducing 3D bounding boxes. This dataset, with its diverse weather conditions and varying camera settings, allows us to explore detection performance in highly challenging scenarios. The proposed method also significantly improves processing times while maintaining accuracy, offering competitive results in conditions previously considered difficult for object recognition. The combination of 3D detection within the AR framework and the use of synthetic data to tackle environmental complexity marks a notable contribution to the field of AV scene analysis.

{"title":"Enhancing 3D object detection in autonomous vehicles based on synthetic virtual environment analysis","authors":"Vladislav Li , Ilias Siniosoglou , Thomai Karamitsou , Anastasios Lytos , Ioannis D. Moscholios , Sotirios K. Goudos , Jyoti S. Banerjee , Panagiotis Sarigiannidis , Vasileios Argyriou","doi":"10.1016/j.imavis.2024.105385","DOIUrl":"10.1016/j.imavis.2024.105385","url":null,"abstract":"<div><div>Autonomous Vehicles (AVs) rely on real-time processing of natural images and videos for scene understanding and safety assurance through proactive object detection. Traditional methods have primarily focused on 2D object detection, limiting their spatial understanding. This study introduces a novel approach by leveraging 3D object detection in conjunction with augmented reality (AR) ecosystems for enhanced real-time scene analysis. Our approach pioneers the integration of a synthetic dataset, designed to simulate various environmental, lighting, and spatiotemporal conditions, to train and evaluate an AI model capable of deducing 3D bounding boxes. This dataset, with its diverse weather conditions and varying camera settings, allows us to explore detection performance in highly challenging scenarios. The proposed method also significantly improves processing times while maintaining accuracy, offering competitive results in conditions previously considered difficult for object recognition. The combination of 3D detection within the AR framework and the use of synthetic data to tackle environmental complexity marks a notable contribution to the field of AV scene analysis.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105385"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatial–temporal sequential network for anomaly detection based on long short-term magnitude representation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105388

Zhongyue Wang, Ying Chen

Notable advancements have been made in the field of video anomaly detection in recent years. The majority of existing methods approach the problem as a weakly-supervised classification problem based on multi-instance learning. However, the identification of key clips in this context is less precise due to a lack of effective connection between the spatial and temporal information in the video clips. The proposed solution to this issue is the Spatial-Temporal Sequential Network (STSN), which employs the Long Short-Term Magnitude Representation (LST-MR). The processing of spatial and temporal information is conducted in a sequential manner within a spatial–temporal sequential structure, with the objective of enhancing temporal localization performance through the utilization of spatial information. Furthermore, the long short-term magnitude representation is employed in spatial and temporal graphs to enhance the identification of key clips from both global and local perspectives. The combination of classification loss and distance loss is employed with magnitude guidance to reduce the omission of anomalous behaviors. The results on three widely used datasets: UCF-Crime, ShanghaiTech, and XD-Violence, demonstrate that the proposed method performs favorably when compared to existing methods.

{"title":"Spatial–temporal sequential network for anomaly detection based on long short-term magnitude representation","authors":"Zhongyue Wang, Ying Chen","doi":"10.1016/j.imavis.2024.105388","DOIUrl":"10.1016/j.imavis.2024.105388","url":null,"abstract":"<div><div>Notable advancements have been made in the field of video anomaly detection in recent years. The majority of existing methods approach the problem as a weakly-supervised classification problem based on multi-instance learning. However, the identification of key clips in this context is less precise due to a lack of effective connection between the spatial and temporal information in the video clips. The proposed solution to this issue is the Spatial-Temporal Sequential Network (STSN), which employs the Long Short-Term Magnitude Representation (LST-MR). The processing of spatial and temporal information is conducted in a sequential manner within a spatial–temporal sequential structure, with the objective of enhancing temporal localization performance through the utilization of spatial information. Furthermore, the long short-term magnitude representation is employed in spatial and temporal graphs to enhance the identification of key clips from both global and local perspectives. The combination of classification loss and distance loss is employed with magnitude guidance to reduce the omission of anomalous behaviors. The results on three widely used datasets: UCF-Crime, ShanghaiTech, and XD-Violence, demonstrate that the proposed method performs favorably when compared to existing methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105388"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-ensembling for 3D point cloud domain adaptation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105409

Qing Li , Xiaojiang Peng , Chuan Yan , Pan Gao , Qi Hao

Recently 3D point cloud learning has been a hot topic in computer vision and autonomous driving. Due to the fact that it is difficult to manually annotate a qualitative large-scale 3D point cloud dataset, unsupervised domain adaptation (UDA) is popular in 3D point cloud learning which aims to transfer the learned knowledge from the labeled source domain to the unlabeled target domain. Existing methods mainly resort to a deformation reconstruction in the target domain, leveraging the deformable invariance process for generalization and domain adaptation. In this paper, we propose a conceptually new yet simple method, termed as self-ensembling network (SEN) for domain generalization and adaptation. In SEN, we propose a soft classification loss on the source domain and a consistency loss on the target domain to stabilize the feature representations and to capture better invariance in the UDA task. In addition, we extend the pointmixup module on the target domain to increase the diversity of point clouds which further boosts cross domain generalization. Extensive experiments on several 3D point cloud UDA benchmarks show that our SEN outperforms the state-of-the-art methods on both classification and segmentation tasks.

{"title":"Self-ensembling for 3D point cloud domain adaptation","authors":"Qing Li , Xiaojiang Peng , Chuan Yan , Pan Gao , Qi Hao","doi":"10.1016/j.imavis.2024.105409","DOIUrl":"10.1016/j.imavis.2024.105409","url":null,"abstract":"<div><div>Recently 3D point cloud learning has been a hot topic in computer vision and autonomous driving. Due to the fact that it is difficult to manually annotate a qualitative large-scale 3D point cloud dataset, unsupervised domain adaptation (UDA) is popular in 3D point cloud learning which aims to transfer the learned knowledge from the labeled source domain to the unlabeled target domain. Existing methods mainly resort to a deformation reconstruction in the target domain, leveraging the deformable invariance process for generalization and domain adaptation. In this paper, we propose a conceptually new yet simple method, termed as self-ensembling network (SEN) for domain generalization and adaptation. In SEN, we propose a soft classification loss on the source domain and a consistency loss on the target domain to stabilize the feature representations and to capture better invariance in the UDA task. In addition, we extend the pointmixup module on the target domain to increase the diversity of point clouds which further boosts cross domain generalization. Extensive experiments on several 3D point cloud UDA benchmarks show that our SEN outperforms the state-of-the-art methods on both classification and segmentation tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105409"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105435

Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan

Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at https://github.com/UESTC-nnLab/STC.

{"title":"Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection","authors":"Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan","doi":"10.1016/j.imavis.2025.105435","DOIUrl":"10.1016/j.imavis.2025.105435","url":null,"abstract":"<div><div>Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at <span><span>https://github.com/UESTC-nnLab/STC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105435"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DALSCLIP: Domain aggregation via learning stronger domain-invariant features for CLIP

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105359

Yuewen Zhang , Jiuhang Wang , Hongying Tang , Ronghua Qin

When the test data follows a different distribution from the training data, neural networks experience domain shift. We can address this issue with domain generalization (DG), which aims to develop models that can perform well on unknown domains. In this paper, we propose a simple yet effective framework called DALSCLIP to achieve high-performance generalization of CLIP, Contrastive LanguageImage Pre-training, in DG. Specifically, we optimize CLIP in two aspects: images and prompts. For images, we propose a method to remove domain-specific features from input images and learn better domain-invariant features. We first train specific classifiers for each domain to learn their corresponding domain-specific information and then learn a mapping to remove domain-specific information. For prompts, we design a lightweight optimizer(Attention-based MLP) to automatically optimize the prompts and incorporate domain-specific information into the input, helping the prompts better adapt to the domain. Meanwhile, we freeze the network parameters during training to maximize the retention of pre-training model information. We extensively evaluate our model on three public datasets. Qualitative and quantitative experiments demonstrate that our framework outperforms other baselines significantly.

{"title":"DALSCLIP: Domain aggregation via learning stronger domain-invariant features for CLIP","authors":"Yuewen Zhang , Jiuhang Wang , Hongying Tang , Ronghua Qin","doi":"10.1016/j.imavis.2024.105359","DOIUrl":"10.1016/j.imavis.2024.105359","url":null,"abstract":"<div><div>When the test data follows a different distribution from the training data, neural networks experience domain shift. We can address this issue with domain generalization (DG), which aims to develop models that can perform well on unknown domains. In this paper, we propose a simple yet effective framework called DALSCLIP to achieve high-performance generalization of CLIP, Contrastive LanguageImage Pre-training, in DG. Specifically, we optimize CLIP in two aspects: images and prompts. For images, we propose a method to remove domain-specific features from input images and learn better domain-invariant features. We first train specific classifiers for each domain to learn their corresponding domain-specific information and then learn a mapping to remove domain-specific information. For prompts, we design a lightweight optimizer(Attention-based MLP) to automatically optimize the prompts and incorporate domain-specific information into the input, helping the prompts better adapt to the domain. Meanwhile, we freeze the network parameters during training to maximize the retention of pre-training model information. We extensively evaluate our model on three public datasets. Qualitative and quantitative experiments demonstrate that our framework outperforms other baselines significantly.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105359"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FSBI: Deepfake detection with frequency enhanced self-blended images

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105418

Ahmed Abul Hasanaath , Hamzah Luqman , Raed Katib , Saeed Anwar

Advances in deepfake research have led to the creation of almost perfect image manipulations that are undetectable to the human eye and some deepfake detection tools. Recently, several techniques have been proposed to differentiate deepfakes from real images and videos. This study introduces a frequency enhanced self-blended images (FSBI) approach for deepfake detection. This proposed approach utilizes discrete wavelet transforms (DWT) to extract discriminative features from self-blended images (SBI). The features are then used to train a convolutional network architecture model. SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency feature extractor to detect artifacts that could not be detected easily in the time domain. The proposed approach was evaluated on FF++ and Celeb-DF datasets, and the obtained results outperformed state-of-the-art techniques using the cross-dataset evaluation protocol, achieving an AUC of 95.49% on Celeb-DF dataset. It also achieved competitive performance in the within-dataset evaluation setup. These results highlight the robustness and effectiveness of our method in addressing the challenging generalization problem inherent in deepfake detection. The code is available at https://github.com/gufranSabri/FSBI.

{"title":"FSBI: Deepfake detection with frequency enhanced self-blended images","authors":"Ahmed Abul Hasanaath , Hamzah Luqman , Raed Katib , Saeed Anwar","doi":"10.1016/j.imavis.2025.105418","DOIUrl":"10.1016/j.imavis.2025.105418","url":null,"abstract":"<div><div>Advances in deepfake research have led to the creation of almost perfect image manipulations that are undetectable to the human eye and some deepfake detection tools. Recently, several techniques have been proposed to differentiate deepfakes from real images and videos. This study introduces a frequency enhanced self-blended images (FSBI) approach for deepfake detection. This proposed approach utilizes discrete wavelet transforms (DWT) to extract discriminative features from self-blended images (SBI). The features are then used to train a convolutional network architecture model. SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency feature extractor to detect artifacts that could not be detected easily in the time domain. The proposed approach was evaluated on FF++ and Celeb-DF datasets, and the obtained results outperformed state-of-the-art techniques using the cross-dataset evaluation protocol, achieving an AUC of 95.49% on Celeb-DF dataset. It also achieved competitive performance in the within-dataset evaluation setup. These results highlight the robustness and effectiveness of our method in addressing the challenging generalization problem inherent in deepfake detection. The code is available at <span><span>https://github.com/gufranSabri/FSBI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105418"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Image and Vision Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀