首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
PAdapter: Adapter combined with prompt for image and video classification
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105395
Youwei Li, Junyong Ye, Xubin Wen, Guangyi Xu, Jingjing Wang, Xinyuan Liu
In computer vision, parameter-efficient transfer learning has become an extensively used technology. Adapter is one of the commonly used basic modules, and its simplicity and efficiency have been widely proven. In the case of freezing the network backbone, only fine-tuning additional adapters can often achieve similar or even better results with lower computational costs compared to fully fine-tuning. However, the bottleneck structure of Adapter leads to a non-negligible information loss, thereby limiting the performance of Adapter. To alleviate this problem, this work proposes a plug-and-play lightweight module called PAdapter, which is a Prompt-combined Adapter that can achieve parameter-efficient transfer learning on image classification and video action recognition tasks. PAdapter is improved based on Adapter, and Prompt is introduced at the bottleneck to supplement the information that may be lost. Specifically, in the bottleneck structure of Adapter, we concatenate a learnable Prompt with bottleneck features at dimension D to supplement information and even enhance the visual expression ability of bottleneck features. Many experiments on image classification and video action recognition show that PAdapter achieves or exceeds the accuracy of full fine-tuning models with less than 2% extra parameters updated. For example, on the SSv2 and HMDB-51 datasets, PAdapter improves the accuracy by 5.49% and 16.68% respectively compared to full fine-tuning. And in almost all experiments, our PAdapter achieved higher accuracy than Adapter with similar number of tunable parameters. Code is available at https://github.com/owlholy/PAdapter.
{"title":"PAdapter: Adapter combined with prompt for image and video classification","authors":"Youwei Li,&nbsp;Junyong Ye,&nbsp;Xubin Wen,&nbsp;Guangyi Xu,&nbsp;Jingjing Wang,&nbsp;Xinyuan Liu","doi":"10.1016/j.imavis.2024.105395","DOIUrl":"10.1016/j.imavis.2024.105395","url":null,"abstract":"<div><div>In computer vision, parameter-efficient transfer learning has become an extensively used technology. Adapter is one of the commonly used basic modules, and its simplicity and efficiency have been widely proven. In the case of freezing the network backbone, only fine-tuning additional adapters can often achieve similar or even better results with lower computational costs compared to fully fine-tuning. However, the bottleneck structure of Adapter leads to a non-negligible information loss, thereby limiting the performance of Adapter. To alleviate this problem, this work proposes a plug-and-play lightweight module called PAdapter, which is a Prompt-combined Adapter that can achieve parameter-efficient transfer learning on image classification and video action recognition tasks. PAdapter is improved based on Adapter, and Prompt is introduced at the bottleneck to supplement the information that may be lost. Specifically, in the bottleneck structure of Adapter, we concatenate a learnable Prompt with bottleneck features at dimension <em>D</em> to supplement information and even enhance the visual expression ability of bottleneck features. Many experiments on image classification and video action recognition show that PAdapter achieves or exceeds the accuracy of full fine-tuning models with less than 2% extra parameters updated. For example, on the SSv2 and HMDB-51 datasets, PAdapter improves the accuracy by 5.49% and 16.68% respectively compared to full fine-tuning. And in almost all experiments, our PAdapter achieved higher accuracy than Adapter with similar number of tunable parameters. Code is available at <span><span>https://github.com/owlholy/PAdapter</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105395"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correlation embedding semantic-enhanced hashing for multimedia retrieval
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105421
Yunfei Chen , Yitian Long , Zhan Yang , Jun Long
Due to its feature extraction and information processing advantages, deep hashing has achieved significant success in multimedia retrieval. Currently, mainstream unsupervised multimedia hashing methods do not incorporate associative relationship information as part of the original features in generating hash codes, and their similarity measurements do not consider the transitivity of similarity. To address these challenges, we propose the Correlation Embedding Semantic-Enhanced Hashing (CESEH) for multimedia retrieval, which primarily consists of a semantic-enhanced similarity construction module and a correlation embedding hashing module. First, the semantic-enhanced similarity construction module generates a semantically enriched similarity matrix by thoroughly exploring similarity adjacency relationships and deep semantic associations within the original data. Next, the correlation embedding hashing module integrates semantic-enhanced similarity information with intra-modal semantic information, achieves precise correlation embedding and preserves semantic information integrity. Extensive experiments on three widely-used datasets demonstrate that the CESEH method outperforms state-of-the-art unsupervised hashing methods in both retrieval accuracy and robustness. The code is available at https://github.com/YunfeiChenMY/CESEH.
{"title":"Correlation embedding semantic-enhanced hashing for multimedia retrieval","authors":"Yunfei Chen ,&nbsp;Yitian Long ,&nbsp;Zhan Yang ,&nbsp;Jun Long","doi":"10.1016/j.imavis.2025.105421","DOIUrl":"10.1016/j.imavis.2025.105421","url":null,"abstract":"<div><div>Due to its feature extraction and information processing advantages, deep hashing has achieved significant success in multimedia retrieval. Currently, mainstream unsupervised multimedia hashing methods do not incorporate associative relationship information as part of the original features in generating hash codes, and their similarity measurements do not consider the transitivity of similarity. To address these challenges, we propose the Correlation Embedding Semantic-Enhanced Hashing (CESEH) for multimedia retrieval, which primarily consists of a semantic-enhanced similarity construction module and a correlation embedding hashing module. First, the semantic-enhanced similarity construction module generates a semantically enriched similarity matrix by thoroughly exploring similarity adjacency relationships and deep semantic associations within the original data. Next, the correlation embedding hashing module integrates semantic-enhanced similarity information with intra-modal semantic information, achieves precise correlation embedding and preserves semantic information integrity. Extensive experiments on three widely-used datasets demonstrate that the CESEH method outperforms state-of-the-art unsupervised hashing methods in both retrieval accuracy and robustness. The code is available at <span><span>https://github.com/YunfeiChenMY/CESEH</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105421"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105422
Ayushi Verma, Tapas Badal, Abhay Bansal
The Denoising Diffusion Probabilistic Model (DDPM) has significantly advanced video generation and synthesis. DDPM relies on extensive datasets for its training process. The study presents a novel method for generating videos from a single video through a frame-conditioned Denoising Diffusion Probabilistic Model (DDPM). Additionally, incorporating the ConvNeXt-V2 model significantly boosts the framework’s feature extraction, improving video generation performance. Addressing the data scarcity challenge in video generation, the proposed model framework exploits a single video’s intrinsic complexities and temporal dynamics to generate diverse and realistic sequences. The model’s ability to generalize motion is demonstrated through thorough quantitative assessments, wherein it is trained on segments of the original video and evaluated on previously unseen frames. Integrating Global Response Normalization and Sigmoid-Weighted Linear Unit (SiLU) activation functions within the DDPM framework has enhanced generated video quality. Comparatively, the proposed model markedly outperforms the Sinfusion model across crucial image quality metrics, achieving a lower Freschet Video Distance (FVD) score of 106.52, lower Learned Perceptual Image Patch Similarity (LPIPS) score of 0.085, higher Structural Similarity Index Measure (SSIM) score of 0.089, higher Nearest-Neighbor-Field (NNF) based diversity (NNFDIV) score of 0.44. Furthermore, the model achieves a Peak Signal to Noise Ratio score of 23.95, demonstrating its strength in preserving image integrity despite noise. The integration of Global Response Normalization and SiLU significantly enhances content synthesis, while ConvNeXt-V2 boosts feature extraction, amplifying the model’s efficacy.
{"title":"A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2","authors":"Ayushi Verma,&nbsp;Tapas Badal,&nbsp;Abhay Bansal","doi":"10.1016/j.imavis.2025.105422","DOIUrl":"10.1016/j.imavis.2025.105422","url":null,"abstract":"<div><div>The Denoising Diffusion Probabilistic Model (DDPM) has significantly advanced video generation and synthesis. DDPM relies on extensive datasets for its training process. The study presents a novel method for generating videos from a single video through a frame-conditioned Denoising Diffusion Probabilistic Model (DDPM). Additionally, incorporating the ConvNeXt-V2 model significantly boosts the framework’s feature extraction, improving video generation performance. Addressing the data scarcity challenge in video generation, the proposed model framework exploits a single video’s intrinsic complexities and temporal dynamics to generate diverse and realistic sequences. The model’s ability to generalize motion is demonstrated through thorough quantitative assessments, wherein it is trained on segments of the original video and evaluated on previously unseen frames. Integrating Global Response Normalization and Sigmoid-Weighted Linear Unit (SiLU) activation functions within the DDPM framework has enhanced generated video quality. Comparatively, the proposed model markedly outperforms the Sinfusion model across crucial image quality metrics, achieving a lower Freschet Video Distance (FVD) score of 106.52, lower Learned Perceptual Image Patch Similarity (LPIPS) score of 0.085, higher Structural Similarity Index Measure (SSIM) score of 0.089, higher Nearest-Neighbor-Field (NNF) based diversity (NNFDIV) score of 0.44. Furthermore, the model achieves a Peak Signal to Noise Ratio score of 23.95, demonstrating its strength in preserving image integrity despite noise. The integration of Global Response Normalization and SiLU significantly enhances content synthesis, while ConvNeXt-V2 boosts feature extraction, amplifying the model’s efficacy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105422"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A temporally-aware noise-informed invertible network for progressive video denoising
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105369
Yan Huang , Huixin Luo , Yong Xu , Xian-Bing Meng
Video denoising is a critical task in computer vision, aiming to enhance video quality by removing noise from consecutive video frames. Despite significant progress, existing video denoising methods still suffer from challenges in maintaining temporal consistency and adapting to different noise levels. To address these issues, a temporally-aware and noise-informed invertible network is proposed by following divide-and-conquer principle for progressive video denoising. Specifically, a recurrent attention-based reversible network is designed to distinctly extract temporal information from consecutive frames, thus tackling the learning problem of temporal consistency. Simultaneously, a noise-informed two-way dense block is developed by using estimated noise as conditional guidance to adapt to different noise levels. The noise-informed guidance can then be used to guide the learning of dense block for efficient video denoising. Under the framework of invertible network, the designed two parts can be further integrated to achieve invertible learning to enable progressive video denoising. Experiments and comparative studies demonstrate that our method can achieve good denoising accuracy and fast inference speed in both synthetic scenes and real-world applications.
{"title":"A temporally-aware noise-informed invertible network for progressive video denoising","authors":"Yan Huang ,&nbsp;Huixin Luo ,&nbsp;Yong Xu ,&nbsp;Xian-Bing Meng","doi":"10.1016/j.imavis.2024.105369","DOIUrl":"10.1016/j.imavis.2024.105369","url":null,"abstract":"<div><div>Video denoising is a critical task in computer vision, aiming to enhance video quality by removing noise from consecutive video frames. Despite significant progress, existing video denoising methods still suffer from challenges in maintaining temporal consistency and adapting to different noise levels. To address these issues, a temporally-aware and noise-informed invertible network is proposed by following divide-and-conquer principle for progressive video denoising. Specifically, a recurrent attention-based reversible network is designed to distinctly extract temporal information from consecutive frames, thus tackling the learning problem of temporal consistency. Simultaneously, a noise-informed two-way dense block is developed by using estimated noise as conditional guidance to adapt to different noise levels. The noise-informed guidance can then be used to guide the learning of dense block for efficient video denoising. Under the framework of invertible network, the designed two parts can be further integrated to achieve invertible learning to enable progressive video denoising. Experiments and comparative studies demonstrate that our method can achieve good denoising accuracy and fast inference speed in both synthetic scenes and real-world applications.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105369"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing 3D object detection in autonomous vehicles based on synthetic virtual environment analysis
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105385
Vladislav Li , Ilias Siniosoglou , Thomai Karamitsou , Anastasios Lytos , Ioannis D. Moscholios , Sotirios K. Goudos , Jyoti S. Banerjee , Panagiotis Sarigiannidis , Vasileios Argyriou
Autonomous Vehicles (AVs) rely on real-time processing of natural images and videos for scene understanding and safety assurance through proactive object detection. Traditional methods have primarily focused on 2D object detection, limiting their spatial understanding. This study introduces a novel approach by leveraging 3D object detection in conjunction with augmented reality (AR) ecosystems for enhanced real-time scene analysis. Our approach pioneers the integration of a synthetic dataset, designed to simulate various environmental, lighting, and spatiotemporal conditions, to train and evaluate an AI model capable of deducing 3D bounding boxes. This dataset, with its diverse weather conditions and varying camera settings, allows us to explore detection performance in highly challenging scenarios. The proposed method also significantly improves processing times while maintaining accuracy, offering competitive results in conditions previously considered difficult for object recognition. The combination of 3D detection within the AR framework and the use of synthetic data to tackle environmental complexity marks a notable contribution to the field of AV scene analysis.
{"title":"Enhancing 3D object detection in autonomous vehicles based on synthetic virtual environment analysis","authors":"Vladislav Li ,&nbsp;Ilias Siniosoglou ,&nbsp;Thomai Karamitsou ,&nbsp;Anastasios Lytos ,&nbsp;Ioannis D. Moscholios ,&nbsp;Sotirios K. Goudos ,&nbsp;Jyoti S. Banerjee ,&nbsp;Panagiotis Sarigiannidis ,&nbsp;Vasileios Argyriou","doi":"10.1016/j.imavis.2024.105385","DOIUrl":"10.1016/j.imavis.2024.105385","url":null,"abstract":"<div><div>Autonomous Vehicles (AVs) rely on real-time processing of natural images and videos for scene understanding and safety assurance through proactive object detection. Traditional methods have primarily focused on 2D object detection, limiting their spatial understanding. This study introduces a novel approach by leveraging 3D object detection in conjunction with augmented reality (AR) ecosystems for enhanced real-time scene analysis. Our approach pioneers the integration of a synthetic dataset, designed to simulate various environmental, lighting, and spatiotemporal conditions, to train and evaluate an AI model capable of deducing 3D bounding boxes. This dataset, with its diverse weather conditions and varying camera settings, allows us to explore detection performance in highly challenging scenarios. The proposed method also significantly improves processing times while maintaining accuracy, offering competitive results in conditions previously considered difficult for object recognition. The combination of 3D detection within the AR framework and the use of synthetic data to tackle environmental complexity marks a notable contribution to the field of AV scene analysis.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105385"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial–temporal sequential network for anomaly detection based on long short-term magnitude representation
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105388
Zhongyue Wang, Ying Chen
Notable advancements have been made in the field of video anomaly detection in recent years. The majority of existing methods approach the problem as a weakly-supervised classification problem based on multi-instance learning. However, the identification of key clips in this context is less precise due to a lack of effective connection between the spatial and temporal information in the video clips. The proposed solution to this issue is the Spatial-Temporal Sequential Network (STSN), which employs the Long Short-Term Magnitude Representation (LST-MR). The processing of spatial and temporal information is conducted in a sequential manner within a spatial–temporal sequential structure, with the objective of enhancing temporal localization performance through the utilization of spatial information. Furthermore, the long short-term magnitude representation is employed in spatial and temporal graphs to enhance the identification of key clips from both global and local perspectives. The combination of classification loss and distance loss is employed with magnitude guidance to reduce the omission of anomalous behaviors. The results on three widely used datasets: UCF-Crime, ShanghaiTech, and XD-Violence, demonstrate that the proposed method performs favorably when compared to existing methods.
{"title":"Spatial–temporal sequential network for anomaly detection based on long short-term magnitude representation","authors":"Zhongyue Wang,&nbsp;Ying Chen","doi":"10.1016/j.imavis.2024.105388","DOIUrl":"10.1016/j.imavis.2024.105388","url":null,"abstract":"<div><div>Notable advancements have been made in the field of video anomaly detection in recent years. The majority of existing methods approach the problem as a weakly-supervised classification problem based on multi-instance learning. However, the identification of key clips in this context is less precise due to a lack of effective connection between the spatial and temporal information in the video clips. The proposed solution to this issue is the Spatial-Temporal Sequential Network (STSN), which employs the Long Short-Term Magnitude Representation (LST-MR). The processing of spatial and temporal information is conducted in a sequential manner within a spatial–temporal sequential structure, with the objective of enhancing temporal localization performance through the utilization of spatial information. Furthermore, the long short-term magnitude representation is employed in spatial and temporal graphs to enhance the identification of key clips from both global and local perspectives. The combination of classification loss and distance loss is employed with magnitude guidance to reduce the omission of anomalous behaviors. The results on three widely used datasets: UCF-Crime, ShanghaiTech, and XD-Violence, demonstrate that the proposed method performs favorably when compared to existing methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105388"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-ensembling for 3D point cloud domain adaptation
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105409
Qing Li , Xiaojiang Peng , Chuan Yan , Pan Gao , Qi Hao
Recently 3D point cloud learning has been a hot topic in computer vision and autonomous driving. Due to the fact that it is difficult to manually annotate a qualitative large-scale 3D point cloud dataset, unsupervised domain adaptation (UDA) is popular in 3D point cloud learning which aims to transfer the learned knowledge from the labeled source domain to the unlabeled target domain. Existing methods mainly resort to a deformation reconstruction in the target domain, leveraging the deformable invariance process for generalization and domain adaptation. In this paper, we propose a conceptually new yet simple method, termed as self-ensembling network (SEN) for domain generalization and adaptation. In SEN, we propose a soft classification loss on the source domain and a consistency loss on the target domain to stabilize the feature representations and to capture better invariance in the UDA task. In addition, we extend the pointmixup module on the target domain to increase the diversity of point clouds which further boosts cross domain generalization. Extensive experiments on several 3D point cloud UDA benchmarks show that our SEN outperforms the state-of-the-art methods on both classification and segmentation tasks.
{"title":"Self-ensembling for 3D point cloud domain adaptation","authors":"Qing Li ,&nbsp;Xiaojiang Peng ,&nbsp;Chuan Yan ,&nbsp;Pan Gao ,&nbsp;Qi Hao","doi":"10.1016/j.imavis.2024.105409","DOIUrl":"10.1016/j.imavis.2024.105409","url":null,"abstract":"<div><div>Recently 3D point cloud learning has been a hot topic in computer vision and autonomous driving. Due to the fact that it is difficult to manually annotate a qualitative large-scale 3D point cloud dataset, unsupervised domain adaptation (UDA) is popular in 3D point cloud learning which aims to transfer the learned knowledge from the labeled source domain to the unlabeled target domain. Existing methods mainly resort to a deformation reconstruction in the target domain, leveraging the deformable invariance process for generalization and domain adaptation. In this paper, we propose a conceptually new yet simple method, termed as self-ensembling network (SEN) for domain generalization and adaptation. In SEN, we propose a soft classification loss on the source domain and a consistency loss on the target domain to stabilize the feature representations and to capture better invariance in the UDA task. In addition, we extend the pointmixup module on the target domain to increase the diversity of point clouds which further boosts cross domain generalization. Extensive experiments on several 3D point cloud UDA benchmarks show that our SEN outperforms the state-of-the-art methods on both classification and segmentation tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105409"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105435
Sicheng Zhu, Luping Ji, Shengjia Chen, Weiwei Duan
Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at https://github.com/UESTC-nnLab/STC.
{"title":"Spatial–temporal-channel collaborative feature learning with transformers for infrared small target detection","authors":"Sicheng Zhu,&nbsp;Luping Ji,&nbsp;Shengjia Chen,&nbsp;Weiwei Duan","doi":"10.1016/j.imavis.2025.105435","DOIUrl":"10.1016/j.imavis.2025.105435","url":null,"abstract":"<div><div>Infrared small target detection holds significant importance for real-world applications, particularly in military applications. However, it encounters several notable challenges, such as limited target information. Due to the localized characteristic of Convolutional Neural Networks (CNNs), most methods based on CNNs are inefficient in extracting and preserving global information, potentially leading to the loss of detailed information. In this work, we propose a transformer-based method named Spatial-Temporal-Channel collaborative feature learning network (STC). Recognizing the difficulty in detecting small targets solely based on spatial information, we incorporate temporal and channel information into our approach. Unlike the Vision Transformer used in other vision tasks, our STC comprises three distinct transformer encoders that extract spatial, temporal and channel information respectively, to obtain more accurate representations. Subsequently, a transformer decoder is employed to fuse the three attention features in a way that akin to human vision system. Additionally, we propose a new Semantic-Aware positional encoding method for video clips that incorporate temporal information into positional encoding and is scale-invariant. Through the multiple experiments and comparisons with current methods, we demonstrate the effectiveness of STC in addressing the challenges of infrared small target detection. Our source codes are available at <span><span>https://github.com/UESTC-nnLab/STC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105435"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DALSCLIP: Domain aggregation via learning stronger domain-invariant features for CLIP
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105359
Yuewen Zhang , Jiuhang Wang , Hongying Tang , Ronghua Qin
When the test data follows a different distribution from the training data, neural networks experience domain shift. We can address this issue with domain generalization (DG), which aims to develop models that can perform well on unknown domains. In this paper, we propose a simple yet effective framework called DALSCLIP to achieve high-performance generalization of CLIP, Contrastive LanguageImage Pre-training, in DG. Specifically, we optimize CLIP in two aspects: images and prompts. For images, we propose a method to remove domain-specific features from input images and learn better domain-invariant features. We first train specific classifiers for each domain to learn their corresponding domain-specific information and then learn a mapping to remove domain-specific information. For prompts, we design a lightweight optimizer(Attention-based MLP) to automatically optimize the prompts and incorporate domain-specific information into the input, helping the prompts better adapt to the domain. Meanwhile, we freeze the network parameters during training to maximize the retention of pre-training model information. We extensively evaluate our model on three public datasets. Qualitative and quantitative experiments demonstrate that our framework outperforms other baselines significantly.
{"title":"DALSCLIP: Domain aggregation via learning stronger domain-invariant features for CLIP","authors":"Yuewen Zhang ,&nbsp;Jiuhang Wang ,&nbsp;Hongying Tang ,&nbsp;Ronghua Qin","doi":"10.1016/j.imavis.2024.105359","DOIUrl":"10.1016/j.imavis.2024.105359","url":null,"abstract":"<div><div>When the test data follows a different distribution from the training data, neural networks experience domain shift. We can address this issue with domain generalization (DG), which aims to develop models that can perform well on unknown domains. In this paper, we propose a simple yet effective framework called DALSCLIP to achieve high-performance generalization of CLIP, Contrastive LanguageImage Pre-training, in DG. Specifically, we optimize CLIP in two aspects: images and prompts. For images, we propose a method to remove domain-specific features from input images and learn better domain-invariant features. We first train specific classifiers for each domain to learn their corresponding domain-specific information and then learn a mapping to remove domain-specific information. For prompts, we design a lightweight optimizer(Attention-based MLP) to automatically optimize the prompts and incorporate domain-specific information into the input, helping the prompts better adapt to the domain. Meanwhile, we freeze the network parameters during training to maximize the retention of pre-training model information. We extensively evaluate our model on three public datasets. Qualitative and quantitative experiments demonstrate that our framework outperforms other baselines significantly.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105359"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FSBI: Deepfake detection with frequency enhanced self-blended images
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105418
Ahmed Abul Hasanaath , Hamzah Luqman , Raed Katib , Saeed Anwar
Advances in deepfake research have led to the creation of almost perfect image manipulations that are undetectable to the human eye and some deepfake detection tools. Recently, several techniques have been proposed to differentiate deepfakes from real images and videos. This study introduces a frequency enhanced self-blended images (FSBI) approach for deepfake detection. This proposed approach utilizes discrete wavelet transforms (DWT) to extract discriminative features from self-blended images (SBI). The features are then used to train a convolutional network architecture model. SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency feature extractor to detect artifacts that could not be detected easily in the time domain. The proposed approach was evaluated on FF++ and Celeb-DF datasets, and the obtained results outperformed state-of-the-art techniques using the cross-dataset evaluation protocol, achieving an AUC of 95.49% on Celeb-DF dataset. It also achieved competitive performance in the within-dataset evaluation setup. These results highlight the robustness and effectiveness of our method in addressing the challenging generalization problem inherent in deepfake detection. The code is available at https://github.com/gufranSabri/FSBI.
{"title":"FSBI: Deepfake detection with frequency enhanced self-blended images","authors":"Ahmed Abul Hasanaath ,&nbsp;Hamzah Luqman ,&nbsp;Raed Katib ,&nbsp;Saeed Anwar","doi":"10.1016/j.imavis.2025.105418","DOIUrl":"10.1016/j.imavis.2025.105418","url":null,"abstract":"<div><div>Advances in deepfake research have led to the creation of almost perfect image manipulations that are undetectable to the human eye and some deepfake detection tools. Recently, several techniques have been proposed to differentiate deepfakes from real images and videos. This study introduces a frequency enhanced self-blended images (FSBI) approach for deepfake detection. This proposed approach utilizes discrete wavelet transforms (DWT) to extract discriminative features from self-blended images (SBI). The features are then used to train a convolutional network architecture model. SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency feature extractor to detect artifacts that could not be detected easily in the time domain. The proposed approach was evaluated on FF++ and Celeb-DF datasets, and the obtained results outperformed state-of-the-art techniques using the cross-dataset evaluation protocol, achieving an AUC of 95.49% on Celeb-DF dataset. It also achieved competitive performance in the within-dataset evaluation setup. These results highlight the robustness and effectiveness of our method in addressing the challenging generalization problem inherent in deepfake detection. The code is available at <span><span>https://github.com/gufranSabri/FSBI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105418"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1