首页 > 最新文献

Pattern Recognition最新文献

英文 中文
Learning physical-aware diffusion priors for zero-shot restoration of scattering-affected images
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-22 DOI: 10.1016/j.patcog.2025.111473
Yuanjian Qiao , Mingwen Shao , Lingzhuang Meng , Wangmeng Zuo
Zero-shot image restoration methods using pre-trained diffusion models have recently achieved remarkable success, which tackle image degradation without requiring paired data. However, these methods struggle to handle real-world images with intricate nonlinear scattering degradations due to the lack of physical knowledge. To address this challenge, we propose a novel Physical-aware Diffusion model (PhyDiff) for zero-shot restoration of scattering-affected images, which involves two crucial physical guidance strategies: Transmission-guided Conditional Generation (TCG) and Prior-aware Sampling Regularization (PSR). Specifically, the TCG exploits the transmission map that reflects the degradation density to dynamically guide the restoration of different corrupted regions during the reverse diffusion process. Simultaneously, the PSR leverages the inherent statistical properties of natural images to regularize the sampling output, thereby facilitating the quality of the recovered image. With these ingenious guidance schemes, our PhyDiff achieves high-quality restoration of multiple nonlinear degradations in a zero-shot manner. Extensive experiments on real-world degraded images demonstrate that our method outperforms existing methods both quantitatively and qualitatively.
{"title":"Learning physical-aware diffusion priors for zero-shot restoration of scattering-affected images","authors":"Yuanjian Qiao ,&nbsp;Mingwen Shao ,&nbsp;Lingzhuang Meng ,&nbsp;Wangmeng Zuo","doi":"10.1016/j.patcog.2025.111473","DOIUrl":"10.1016/j.patcog.2025.111473","url":null,"abstract":"<div><div>Zero-shot image restoration methods using pre-trained diffusion models have recently achieved remarkable success, which tackle image degradation without requiring paired data. However, these methods struggle to handle real-world images with intricate nonlinear scattering degradations due to the lack of physical knowledge. To address this challenge, we propose a novel Physical-aware Diffusion model (PhyDiff) for zero-shot restoration of scattering-affected images, which involves two crucial physical guidance strategies: Transmission-guided Conditional Generation (TCG) and Prior-aware Sampling Regularization (PSR). Specifically, the TCG exploits the transmission map that reflects the degradation density to dynamically guide the restoration of different corrupted regions during the reverse diffusion process. Simultaneously, the PSR leverages the inherent statistical properties of natural images to regularize the sampling output, thereby facilitating the quality of the recovered image. With these ingenious guidance schemes, our PhyDiff achieves high-quality restoration of multiple nonlinear degradations in a zero-shot manner. Extensive experiments on real-world degraded images demonstrate that our method outperforms existing methods both quantitatively and qualitatively.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111473"},"PeriodicalIF":7.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143508874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AAGCN: An adaptive data augmentation for graph contrastive learning
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-22 DOI: 10.1016/j.patcog.2025.111471
Peng Qin , Yaochun Lu , Weifu Chen , Defang Li , Guocan Feng
Contrastive learning has achieved great success in many applications. A key step in contrastive learning is to find a positive sample and negative samples. Traditional methods find the positive sample by choosing the most similar sample. A more popular approach is to do data augmentation where the original data and the augmented data are naturally treated as positive pairs. It is easy for grid data to do the augmentation, for example, we can rotate, crop or color an image to get augmented images. But it is challenging to do augmentation for graph data, due to non-Euclidean nature of graphs. Current graph augmentation methods mainly focus on masking nodes, dropping edges, or extracting subgraphs. Such methods lack of flexibility and require intensive manual settings. In this work, we propose a model called Adaptive Augmentation Graph Convolutional Network (AAGCN) for semi-supervised node classification, based on adaptive graph augmentation. Rather than choose a probability distribution, for example, Bernoulli distribution, to drop some of the nodes or edges in Dropout, the proposed model learns the mask matrices for nodes or edges adaptively. Experiments on citation networks such as Cora, CiteSeer and Cora-ML show that AAGCN achieved state-of-the-art performance compared with other popular graph neural networks. The proposed model was also tested on a more challenging and large-scale graph dataset, OGBN-Arxiv, which has 169,343 nodes and 1,166,243 edges. The proposed model could still achieve competitive prediction results.
{"title":"AAGCN: An adaptive data augmentation for graph contrastive learning","authors":"Peng Qin ,&nbsp;Yaochun Lu ,&nbsp;Weifu Chen ,&nbsp;Defang Li ,&nbsp;Guocan Feng","doi":"10.1016/j.patcog.2025.111471","DOIUrl":"10.1016/j.patcog.2025.111471","url":null,"abstract":"<div><div>Contrastive learning has achieved great success in many applications. A key step in contrastive learning is to find a positive sample and negative samples. Traditional methods find the positive sample by choosing the most similar sample. A more popular approach is to do data augmentation where the original data and the augmented data are naturally treated as positive pairs. It is easy for grid data to do the augmentation, for example, we can rotate, crop or color an image to get augmented images. But it is challenging to do augmentation for graph data, due to non-Euclidean nature of graphs. Current graph augmentation methods mainly focus on masking nodes, dropping edges, or extracting subgraphs. Such methods lack of flexibility and require intensive manual settings. In this work, we propose a model called <em>Adaptive Augmentation Graph Convolutional Network (AAGCN)</em> for semi-supervised node classification, based on adaptive graph augmentation. Rather than choose a probability distribution, for example, Bernoulli distribution, to drop some of the nodes or edges in Dropout, the proposed model learns the mask matrices for nodes or edges adaptively. Experiments on citation networks such as Cora, CiteSeer and Cora-ML show that AAGCN achieved state-of-the-art performance compared with other popular graph neural networks. The proposed model was also tested on a more challenging and large-scale graph dataset, OGBN-Arxiv, which has 169,343 nodes and 1,166,243 edges. The proposed model could still achieve competitive prediction results.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111471"},"PeriodicalIF":7.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143480565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AdaNet: A competitive adaptive convolutional neural network for spectral information identification
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-22 DOI: 10.1016/j.patcog.2025.111472
Ziyang Li , Yang Yu , Chongbo Yin , Yan Shi
Spectral analysis-based non-destructive testing techniques can monitor food authenticity, quality changes, and traceability. Convolutional neural networks (CNNs) are widely used for spectral information processing and decision-making because they can effectively extract features from spectral data. However, CNNs introduce redundancy in feature extraction, thereby wasting computational resources. This paper proposes a competitive adaptive CNN (AdaNet) to address these challenges. First, adaptive convolution (AdaConv) is used to select spectral features based on channel attention and optimize computational resource allocation. Second, a Gaussian-initialized parameter matrix is applied to rescale spatial relationships and reduce redundancy. Finally, a self-attention mask is employed to mitigate the information loss due to convolution and speed up the convergence of AdaConv. We evaluate AdaNet’s performance compared to other advanced methods. The results show that AdaNet outperforms state-of-the-art techniques, achieving average accuracies of 99.10% and 98.50% on datasets 1 and 2, respectively. We provide a viable approach to enhance the engineering applications of spectral analysis techniques. Code is available at https://github.com/Ziyang-Li-AILab/AdaNet.
{"title":"AdaNet: A competitive adaptive convolutional neural network for spectral information identification","authors":"Ziyang Li ,&nbsp;Yang Yu ,&nbsp;Chongbo Yin ,&nbsp;Yan Shi","doi":"10.1016/j.patcog.2025.111472","DOIUrl":"10.1016/j.patcog.2025.111472","url":null,"abstract":"<div><div>Spectral analysis-based non-destructive testing techniques can monitor food authenticity, quality changes, and traceability. Convolutional neural networks (CNNs) are widely used for spectral information processing and decision-making because they can effectively extract features from spectral data. However, CNNs introduce redundancy in feature extraction, thereby wasting computational resources. This paper proposes a competitive adaptive CNN (AdaNet) to address these challenges. First, adaptive convolution (AdaConv) is used to select spectral features based on channel attention and optimize computational resource allocation. Second, a Gaussian-initialized parameter matrix is applied to rescale spatial relationships and reduce redundancy. Finally, a self-attention mask is employed to mitigate the information loss due to convolution and speed up the convergence of AdaConv. We evaluate AdaNet’s performance compared to other advanced methods. The results show that AdaNet outperforms state-of-the-art techniques, achieving average accuracies of 99.10% and 98.50% on datasets 1 and 2, respectively. We provide a viable approach to enhance the engineering applications of spectral analysis techniques. Code is available at <span><span>https://github.com/Ziyang-Li-AILab/AdaNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111472"},"PeriodicalIF":7.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143511302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tensor Transformer for hyperspectral image classification
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-22 DOI: 10.1016/j.patcog.2025.111470
Wei-Tao Zhang, Yv Bai, Sheng-Di Zheng, Jian Cui, Zhen-zhen Huang
Hyperspectral image (HSI) is widely used in real-world classification tasks since it contains rich spatial and spectral features consisting of hundreds of continuous bands. In recent years, the deep learning-based HSI classification methods, such as convolutional neural network (CNN) and Transformer, have achieved good performance in HSI classification tasks. Indeed, it is acknowledged that Transformer-based neural networks, owing to their remarkable capacity to extract long-range features, frequently outperform CNN-based neural networks in HSI classification scenarios. However, Transformer-based methods always require the sequentialization of the raw 3-D HSI data, potentially disrupting the spatial–spectral structural features. This shortcoming has degraded the classification accuracy of HSI data. In this paper, we proposed a Tensor Transformer (TT) framework for HSI classification. The TT model is an end-to-end network that directly takes the raw HSI tensor data as the input sample, without the need for raw data sequentialization. The core component of the proposed framework is the Tensor Self-Attention Mechanism (TSAM), which enables the network to efficiently extract long-range spatial–spectral structural features without losing the inherent structural relationships inner the sample. Through extensive experiments on four widely used HSI datasets, the proposed TT model demonstrates superior classification performance in discriminating land features with similar spectrum compared to state-of-the-art methods.
{"title":"Tensor Transformer for hyperspectral image classification","authors":"Wei-Tao Zhang,&nbsp;Yv Bai,&nbsp;Sheng-Di Zheng,&nbsp;Jian Cui,&nbsp;Zhen-zhen Huang","doi":"10.1016/j.patcog.2025.111470","DOIUrl":"10.1016/j.patcog.2025.111470","url":null,"abstract":"<div><div>Hyperspectral image (HSI) is widely used in real-world classification tasks since it contains rich spatial and spectral features consisting of hundreds of continuous bands. In recent years, the deep learning-based HSI classification methods, such as convolutional neural network (CNN) and Transformer, have achieved good performance in HSI classification tasks. Indeed, it is acknowledged that Transformer-based neural networks, owing to their remarkable capacity to extract long-range features, frequently outperform CNN-based neural networks in HSI classification scenarios. However, Transformer-based methods always require the sequentialization of the raw 3-D HSI data, potentially disrupting the spatial–spectral structural features. This shortcoming has degraded the classification accuracy of HSI data. In this paper, we proposed a Tensor Transformer (TT) framework for HSI classification. The TT model is an end-to-end network that directly takes the raw HSI tensor data as the input sample, without the need for raw data sequentialization. The core component of the proposed framework is the Tensor Self-Attention Mechanism (TSAM), which enables the network to efficiently extract long-range spatial–spectral structural features without losing the inherent structural relationships inner the sample. Through extensive experiments on four widely used HSI datasets, the proposed TT model demonstrates superior classification performance in discriminating land features with similar spectrum compared to state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111470"},"PeriodicalIF":7.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143488219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiffFace: Diffusion-based face swapping with facial guidance
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-21 DOI: 10.1016/j.patcog.2025.111451
Kihong Kim , Yunho Kim , Seokju Cho , Junyoung Seo , Jisu Nam , Kychul Lee , Seungryong Kim , KwangHee Lee
We propose a novel diffusion-based framework for face swapping, called DiffFace. Unlike previous GAN-based models that inherit the challenges of GAN training, ID-conditional DDPM is trained during the training process to produce face images with a specified identity. During the sampling process, off-the-shelf facial expert models are employed to ensure the model can transfer the source identity while maintaining the target attributes such as structure and gaze. In addition, the target-preserving blending effectively preserve the expression of the target image from noise, while reflecting the environmental context such as background or lighting. The proposed method enables controlling the trade-off between ID and shape without any further re-training. Compared with previous GAN-based methods, DiffFace achieves high fidelity and controllability. Extensive experiments show that DiffFace is comparable or superior to the state-of-the-art methods.
{"title":"DiffFace: Diffusion-based face swapping with facial guidance","authors":"Kihong Kim ,&nbsp;Yunho Kim ,&nbsp;Seokju Cho ,&nbsp;Junyoung Seo ,&nbsp;Jisu Nam ,&nbsp;Kychul Lee ,&nbsp;Seungryong Kim ,&nbsp;KwangHee Lee","doi":"10.1016/j.patcog.2025.111451","DOIUrl":"10.1016/j.patcog.2025.111451","url":null,"abstract":"<div><div>We propose a novel diffusion-based framework for face swapping, called DiffFace. Unlike previous GAN-based models that inherit the challenges of GAN training, ID-conditional DDPM is trained during the training process to produce face images with a specified identity. During the sampling process, off-the-shelf facial expert models are employed to ensure the model can transfer the source identity while maintaining the target attributes such as structure and gaze. In addition, the target-preserving blending effectively preserve the expression of the target image from noise, while reflecting the environmental context such as background or lighting. The proposed method enables controlling the trade-off between ID and shape without any further re-training. Compared with previous GAN-based methods, DiffFace achieves high fidelity and controllability. Extensive experiments show that DiffFace is comparable or superior to the state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111451"},"PeriodicalIF":7.5,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143511301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-definition Deepfake detection via semantics reduction and cross-domain training
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-21 DOI: 10.1016/j.patcog.2025.111469
Cairong Zhao , Chutian Wang , Zifan Song , Guosheng Hu , Liang Wang , Duoqian Miao
The recent development of Deepfake videos directly threatens our information security and personal privacy. Although lots of previous works have made much progress on the Deepfake detection, we empirically find that the existing approaches do not perform well on the low definition (LD) and cross-definition (high and low) videos. To address this problem, in this paper, we follow two motivations: (1) high-level semantics reduction and (2) cross-domain training. For (1), we propose the Facial Structure Destruction and Adversarial Jigsaw Loss to reduce our model to learn high-level semantics and focus on learning low-level discriminative information; For (2), we propose an adversarial domain generalization method and a spatial attention distillation which uses the information of HD videos to guide LD videos. We conduct extensive experiments on public datasets, FaceForensics++ and Celeb-DF v2. Results show the great effectiveness of our method and we also achieve very competitive performance against state-of-the-art methods. Surprisingly, we empirically find that our method is also very effective on Face Anti-Spoofing (FAS) task, verified on OULU-NPU dataset.
{"title":"Multi-definition Deepfake detection via semantics reduction and cross-domain training","authors":"Cairong Zhao ,&nbsp;Chutian Wang ,&nbsp;Zifan Song ,&nbsp;Guosheng Hu ,&nbsp;Liang Wang ,&nbsp;Duoqian Miao","doi":"10.1016/j.patcog.2025.111469","DOIUrl":"10.1016/j.patcog.2025.111469","url":null,"abstract":"<div><div>The recent development of Deepfake videos directly threatens our information security and personal privacy. Although lots of previous works have made much progress on the Deepfake detection, we empirically find that the existing approaches do not perform well on the low definition (LD) and cross-definition (high and low) videos. To address this problem, in this paper, we follow two motivations: (1) high-level semantics reduction and (2) cross-domain training. For (1), we propose the Facial Structure Destruction and Adversarial Jigsaw Loss to reduce our model to learn high-level semantics and focus on learning low-level discriminative information; For (2), we propose an adversarial domain generalization method and a spatial attention distillation which uses the information of HD videos to guide LD videos. We conduct extensive experiments on public datasets, FaceForensics++ and Celeb-DF v2. Results show the great effectiveness of our method and we also achieve very competitive performance against state-of-the-art methods. Surprisingly, we empirically find that our method is also very effective on Face Anti-Spoofing (FAS) task, verified on OULU-NPU dataset.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111469"},"PeriodicalIF":7.5,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143480563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inversed Pyramid Network with Spatial-adapted and Task-oriented Tuning for few-shot learning
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-20 DOI: 10.1016/j.patcog.2025.111415
Xiaowei Zhao , Duorui Wang , Shihao Bai , Shuo Wang , Yajun Gao , Yu Liang , Yuqing Ma , Xianglong Liu
With the rapid development of artificial intelligence, deep neural networks have achieved great performance in many tasks. However, traditional deep learning methods require a large amount of training data, which may not be available in certain practical scenarios. In contrast, few-shot learning aims to learn a model that can be readily adapted to new unseen classes from only one or a few labeled examples. Despite this success, most existing methods rely on pre-trained feature extractor networks trained with global features, ignoring the discrimination of local features, and weak generalization capabilities limit their performance. To address the problem, according to the human’s coarse-to-fine cognition paradigm, we propose an Inverted Pyramid Network with Spatial-adapted and Task-oriented Tuning (TIPN) for few-shot learning. Specifically, the proposed framework represents local features for categories that are difficult to distinguish by global features and recognizes objects from both global and local perspectives. Moreover, to ensure the calibration validity of the proposed model at the local stage, we introduce the Spatial-adapted Layer to preserve the discriminative global representation ability of the pre-trained backbone network. Meanwhile, as the representations extracted from the past categories are not applicable to the current new tasks, we further propose the Task-oriented Tuning strategy to adjust the parameters of the Batch Normalization layer in the pre-trained feature extractor network, to explicitly transfer knowledge from base classes to novel classes according to the support samples of each task. Extensive experiments conducted on multiple benchmark datasets demonstrate that our method can significantly outperform many state-of-the-art few-shot learning methods.
{"title":"Inversed Pyramid Network with Spatial-adapted and Task-oriented Tuning for few-shot learning","authors":"Xiaowei Zhao ,&nbsp;Duorui Wang ,&nbsp;Shihao Bai ,&nbsp;Shuo Wang ,&nbsp;Yajun Gao ,&nbsp;Yu Liang ,&nbsp;Yuqing Ma ,&nbsp;Xianglong Liu","doi":"10.1016/j.patcog.2025.111415","DOIUrl":"10.1016/j.patcog.2025.111415","url":null,"abstract":"<div><div>With the rapid development of artificial intelligence, deep neural networks have achieved great performance in many tasks. However, traditional deep learning methods require a large amount of training data, which may not be available in certain practical scenarios. In contrast, few-shot learning aims to learn a model that can be readily adapted to new unseen classes from only one or a few labeled examples. Despite this success, most existing methods rely on pre-trained feature extractor networks trained with global features, ignoring the discrimination of local features, and weak generalization capabilities limit their performance. To address the problem, according to the human’s coarse-to-fine cognition paradigm, we propose an Inverted Pyramid Network with Spatial-adapted and Task-oriented Tuning (TIPN) for few-shot learning. Specifically, the proposed framework represents local features for categories that are difficult to distinguish by global features and recognizes objects from both global and local perspectives. Moreover, to ensure the calibration validity of the proposed model at the local stage, we introduce the Spatial-adapted Layer to preserve the discriminative global representation ability of the pre-trained backbone network. Meanwhile, as the representations extracted from the past categories are not applicable to the current new tasks, we further propose the Task-oriented Tuning strategy to adjust the parameters of the Batch Normalization layer in the pre-trained feature extractor network, to explicitly transfer knowledge from base classes to novel classes according to the support samples of each task. Extensive experiments conducted on multiple benchmark datasets demonstrate that our method can significantly outperform many state-of-the-art few-shot learning methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"164 ","pages":"Article 111415"},"PeriodicalIF":7.5,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143552692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prompt-Ladder: Memory-efficient prompt tuning for vision-language models on edge devices
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-20 DOI: 10.1016/j.patcog.2025.111460
Siqi Cai , Xuan Liu , Jingling Yuan , Qihua Zhou
The pre-trained vision-language models (VLMs) have been the foundation for diverse intelligent services in human life. Common VLMs hold large parameter scales and require heavy memory overhead for model pre-training, which poses challenges in adapting them to edge devices. To enable memory-efficient VLMs, previous works mainly focus on the prompt engineering technique that utilizes trainable soft prompts instead of manually designing hard prompts. However, to update fewer than 3% of prompt parameters, these studies still require the back-propagation chain to traverse pre-trained models with extensive parameters. Consequently, the intermediate activation variables and gradients occupy a significant amount of memory resources, greatly hindering their adaptation on resource-constrained edge devices. In view of the above, we propose a memory-efficient prompt-tuning method, named Prompt-Ladder. Our main idea is to adopt a lightweight ladder network as an agent to bypass VLMs during back-propagation for the parameter optimization of the designed multi-model prompt module. The ladder network fuses the intermediate output of VLMs as a guide and selects important parameters of VLMs to initialize for the maintenance of model performance. We also share parameters of the ladder network between text and image data to obtain a more semantically aligned representation across modalities for the optimization of the prompt module. The experiments across seven datasets demonstrate that Prompt-Ladder can significantly reduce memory resource usage by at least 27% compared to baselines while maintaining relatively good performance.
{"title":"Prompt-Ladder: Memory-efficient prompt tuning for vision-language models on edge devices","authors":"Siqi Cai ,&nbsp;Xuan Liu ,&nbsp;Jingling Yuan ,&nbsp;Qihua Zhou","doi":"10.1016/j.patcog.2025.111460","DOIUrl":"10.1016/j.patcog.2025.111460","url":null,"abstract":"<div><div>The pre-trained vision-language models (VLMs) have been the foundation for diverse intelligent services in human life. Common VLMs hold large parameter scales and require heavy memory overhead for model pre-training, which poses challenges in adapting them to edge devices. To enable memory-efficient VLMs, previous works mainly focus on the prompt engineering technique that utilizes trainable soft prompts instead of manually designing hard prompts. However, to update fewer than 3% of prompt parameters, these studies still require the back-propagation chain to traverse pre-trained models with extensive parameters. Consequently, the intermediate activation variables and gradients occupy a significant amount of memory resources, greatly hindering their adaptation on resource-constrained edge devices. In view of the above, we propose a memory-efficient prompt-tuning method, named <strong>Prompt-Ladder</strong>. Our main idea is to adopt a lightweight ladder network as an agent to bypass VLMs during back-propagation for the parameter optimization of the designed multi-model prompt module. The ladder network fuses the intermediate output of VLMs as a guide and selects important parameters of VLMs to initialize for the maintenance of model performance. We also share parameters of the ladder network between text and image data to obtain a more semantically aligned representation across modalities for the optimization of the prompt module. The experiments across seven datasets demonstrate that Prompt-Ladder can significantly reduce memory resource usage by at least 27% compared to baselines while maintaining relatively good performance.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111460"},"PeriodicalIF":7.5,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143464530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AMLCA: Additive multi-layer convolution-guided cross-attention network for visible and infrared image fusion
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-20 DOI: 10.1016/j.patcog.2025.111468
Dongliang Wang , Chuang Huang , Hao Pan , Yuan Sun , Jian Dai , Yanan Li , Zhenwen Ren
Multimodal image fusion is widely used in the processing of multispectral signals, e.g., visible and infrared images, which aims to create an information-rich fused image by combining the complementary information from different wavebands. Current fusion methods face significant challenges in extracting complementary information from sensors while simultaneously preserving local details and global dependencies. To address this challenge, we propose an additive multi-layer convolution-guided cross-attention network (AMLCA) for visible and infrared image fusion, which consists of two sub-modals, i.e., additive cross-attention module (ACAM) and wavelet convolution-guided transformer module (WCGTM). Specifically, the former enhances feature interaction and captures global holistic information by using an additive cross-attention mechanism, while the latter relies on wavelet convolution to guide the transformer, enhancing the preservation of details from both sources and improving the extraction of local detail information. Moreover, we propose a multi-layer fusion strategy that leverages hidden complementary features from various layers. Therefore, AMLCA can effectively extracts complementary information from local details and global dependencies, significantly enhancing overall performance. Extensive experiments and ablation analysis on public datasets demonstrate the superiority and effectiveness of AMLCA. The source code is available at https://github.com/Wangdl2000/AMLCA-code.
{"title":"AMLCA: Additive multi-layer convolution-guided cross-attention network for visible and infrared image fusion","authors":"Dongliang Wang ,&nbsp;Chuang Huang ,&nbsp;Hao Pan ,&nbsp;Yuan Sun ,&nbsp;Jian Dai ,&nbsp;Yanan Li ,&nbsp;Zhenwen Ren","doi":"10.1016/j.patcog.2025.111468","DOIUrl":"10.1016/j.patcog.2025.111468","url":null,"abstract":"<div><div>Multimodal image fusion is widely used in the processing of multispectral signals, <em>e</em>.<em>g</em>., visible and infrared images, which aims to create an information-rich fused image by combining the complementary information from different wavebands. Current fusion methods face significant challenges in extracting complementary information from sensors while simultaneously preserving local details and global dependencies. To address this challenge, we propose an additive multi-layer convolution-guided cross-attention network (AMLCA) for visible and infrared image fusion, which consists of two sub-modals, <em>i</em>.<em>e</em>., additive cross-attention module (ACAM) and wavelet convolution-guided transformer module (WCGTM). Specifically, the former enhances feature interaction and captures global holistic information by using an additive cross-attention mechanism, while the latter relies on wavelet convolution to guide the transformer, enhancing the preservation of details from both sources and improving the extraction of local detail information. Moreover, we propose a multi-layer fusion strategy that leverages hidden complementary features from various layers. Therefore, AMLCA can effectively extracts complementary information from local details and global dependencies, significantly enhancing overall performance. Extensive experiments and ablation analysis on public datasets demonstrate the superiority and effectiveness of AMLCA. The source code is available at <span><span>https://github.com/Wangdl2000/AMLCA-code</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111468"},"PeriodicalIF":7.5,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143488228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Level Knowledge Distillation with Positional Encoding Enhancement
IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-02-18 DOI: 10.1016/j.patcog.2025.111458
Lixiang Xu , Zhiwen Wang , Lu Bai , Shengwei Ji , Bing Ai , Xiaofeng Wang , Philip S. Yu
In recent years, Graph Neural Networks (GNNs) have achieved substantial success in addressing graph-related tasks. Knowledge Distillation (KD) has increasingly been adopted in graph learning as a classical technique for model compression and acceleration, enabling the transfer of predictive power from trained GNN models to lightweight, easily deployable Multi-Layer Perceptron (MLP) models. However, this approach often neglects node positional features and relies solely on trained GNN-generated labels to train MLPs based on node content features. Moreover, it heavily depends on local information aggregation, making it challenging to capture global graph structure and thereby limiting performance in node classification tasks. To address this issue, we propose Multi-Level Knowledge Distillation with Positional Encoding Enhancement (MLKD-PE). Our method employs positional encoding technique to generate node positional features, which are then combined with node content features to enhance the MLP’s ability to perceive node positions. Additionally, we introduce a multi-level KD technique that aligns the final output of the student model with the teacher model’s output, facilitating detailed knowledge transfer by incorporating intermediate layer outputs from the teacher model. Experimental results demonstrate that our method significantly improves classification accuracy across multiple datasets compared to the baseline model, confirming its superiority in node classification tasks.
{"title":"Multi-Level Knowledge Distillation with Positional Encoding Enhancement","authors":"Lixiang Xu ,&nbsp;Zhiwen Wang ,&nbsp;Lu Bai ,&nbsp;Shengwei Ji ,&nbsp;Bing Ai ,&nbsp;Xiaofeng Wang ,&nbsp;Philip S. Yu","doi":"10.1016/j.patcog.2025.111458","DOIUrl":"10.1016/j.patcog.2025.111458","url":null,"abstract":"<div><div>In recent years, Graph Neural Networks (GNNs) have achieved substantial success in addressing graph-related tasks. Knowledge Distillation (KD) has increasingly been adopted in graph learning as a classical technique for model compression and acceleration, enabling the transfer of predictive power from trained GNN models to lightweight, easily deployable Multi-Layer Perceptron (MLP) models. However, this approach often neglects node positional features and relies solely on trained GNN-generated labels to train MLPs based on node content features. Moreover, it heavily depends on local information aggregation, making it challenging to capture global graph structure and thereby limiting performance in node classification tasks. To address this issue, we propose <strong>M</strong>ulti-<strong>L</strong>evel <strong>K</strong>nowledge <strong>D</strong>istillation with <strong>P</strong>ositional <strong>E</strong>ncoding Enhancement <strong>(MLKD-PE)</strong>. Our method employs positional encoding technique to generate node positional features, which are then combined with node content features to enhance the MLP’s ability to perceive node positions. Additionally, we introduce a multi-level KD technique that aligns the final output of the student model with the teacher model’s output, facilitating detailed knowledge transfer by incorporating intermediate layer outputs from the teacher model. Experimental results demonstrate that our method significantly improves classification accuracy across multiple datasets compared to the baseline model, confirming its superiority in node classification tasks.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111458"},"PeriodicalIF":7.5,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143444504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1