Pub Date : 2026-01-29DOI: 10.1016/j.patcog.2026.113193
Junxi Wu , Sicheng Pan , Naiqi Li , Bin Chen , Baoyi An , Zhi Wang , Yaowei Wang , Shu-Tao Xia
Image restoration covers many sub-tasks, including image super-resolution, inpainting, deblurring, compressed sensing, etc. However, existing methods often struggle to balance generality across tasks and specificity to degradation patterns. Multi-task methods relying on generative priors and neglect the diversity of degradation operators, leading to worse performance and hallucinations, while task-specific methods cannot capture the generality of different image restoration tasks. In this work, we introduce the Task-Adaptive Diffusion Degradation Oriented Model (DDOM), which bridges this gap by integrating a pre-trained diffusion model as a general generative prior with lightweight Degradation Oriented Adapters (DO-Adapters) to align task-specific knowledge. DO-Adapters extract task-specific priors and refine the diffusion process at each timestep, guiding the diffusion model toward accurate restoration while reducing hallucinations. This design decouples task adaptation from the pre-trained model, enabling plug-and-play deployment across tasks with low computation (0.36% of the pre-trained models). Experimental results demonstrate DDOM outperforms the superior multi-task methods, while matching or surpassing task-specific methods in visual quality. Notably, DDOM exhibits strong generalization in out-of-distribution datasets and extreme degradation scenarios, validating its effectiveness in unifying generality.
{"title":"Universal image restoration via task-adaptive diffusion degradation oriented model","authors":"Junxi Wu , Sicheng Pan , Naiqi Li , Bin Chen , Baoyi An , Zhi Wang , Yaowei Wang , Shu-Tao Xia","doi":"10.1016/j.patcog.2026.113193","DOIUrl":"10.1016/j.patcog.2026.113193","url":null,"abstract":"<div><div>Image restoration covers many sub-tasks, including image super-resolution, inpainting, deblurring, compressed sensing, etc. However, existing methods often struggle to balance generality across tasks and specificity to degradation patterns. Multi-task methods relying on generative priors and neglect the diversity of degradation operators, leading to worse performance and hallucinations, while task-specific methods cannot capture the generality of different image restoration tasks. In this work, we introduce the Task-Adaptive Diffusion Degradation Oriented Model (DDOM), which bridges this gap by integrating a pre-trained diffusion model as a general generative prior with lightweight Degradation Oriented Adapters (DO-Adapters) to align task-specific knowledge. DO-Adapters extract task-specific priors and refine the diffusion process at each timestep, guiding the diffusion model toward accurate restoration while reducing hallucinations. This design decouples task adaptation from the pre-trained model, enabling plug-and-play deployment across tasks with low computation (0.36% of the pre-trained models). Experimental results demonstrate DDOM outperforms the superior multi-task methods, while matching or surpassing task-specific methods in visual quality. Notably, DDOM exhibits strong generalization in out-of-distribution datasets and extreme degradation scenarios, validating its effectiveness in unifying generality.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113193"},"PeriodicalIF":7.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Monocular 3D object detection has gained significant attention due to its cost-effectiveness and practicality in real-world applications. However, existing monocular methods often struggle with depth estimation and spatial consistency, limiting their accuracy in complex environments. In this work, we introduce a Temporal Deep Feature Learning framework, which enhances monocular 3D object detection by integrating temporal features across sequential frames. Our approach leverages a novel deep feature auxiliary module based on convolutional recurrent structures, effectively capturing spatiotemporal information to improve depth perception and detection robustness. The proposed module is model-agnostic and can be seamlessly integrated into various existing monocular detection frameworks. Extensive experiments across multiple state-of-the-art monocular 3D object detection models demonstrate consistent performance improvements, particularly in detecting small or partially occluded objects. Our results highlight the effectiveness and generalizability of the proposed approach, making it a promising solution for real-world autonomous perception systems. The source code of this work is at: https://github.com/Shuray36/MonoTDF-Temporal-Deep-Feature-Learning-for-Generalizable-Monocular-3D-Object-Detection.
{"title":"MonoTDF: Temporal deep feature learning for generalizable monocular 3D object detection","authors":"Xiu-Zhi Chen , Yi-Kai Chiu , Chih-Sheng Huang , Yen-Lin Chen","doi":"10.1016/j.patcog.2026.113184","DOIUrl":"10.1016/j.patcog.2026.113184","url":null,"abstract":"<div><div>Monocular 3D object detection has gained significant attention due to its cost-effectiveness and practicality in real-world applications. However, existing monocular methods often struggle with depth estimation and spatial consistency, limiting their accuracy in complex environments. In this work, we introduce a Temporal Deep Feature Learning framework, which enhances monocular 3D object detection by integrating temporal features across sequential frames. Our approach leverages a novel deep feature auxiliary module based on convolutional recurrent structures, effectively capturing spatiotemporal information to improve depth perception and detection robustness. The proposed module is model-agnostic and can be seamlessly integrated into various existing monocular detection frameworks. Extensive experiments across multiple state-of-the-art monocular 3D object detection models demonstrate consistent performance improvements, particularly in detecting small or partially occluded objects. Our results highlight the effectiveness and generalizability of the proposed approach, making it a promising solution for real-world autonomous perception systems. The source code of this work is at: <span><span>https://github.com/Shuray36/MonoTDF-Temporal-Deep-Feature-Learning-for-Generalizable-Monocular-3D-Object-Detection</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113184"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113169
Lingping Kong , Ponnuthurai Nagaratnam Suganthan , Václav Snášel , Varun Ojha , Jeng-Shyang Pan
Feature engineering is crucial in enhancing model performance, yet effectively combining multiple feature transformations to maximize their benefits remains a key challenge. In this study, we propose an innovative approach that integrates various feature engineering techniques within the boosting steps of the XGBoost algorithm and adapts the gradient-based one-sided sampling, forming an enhanced classifier named Feat-XGBoost. Feat-XGBoost aims to improve data representation and separation in model learning by iteratively applying feature transformations. We evaluated this approach on 61 diverse datasets, comparing its performance with 12 baseline classifiers, including standard XGBoost. The results show that Feat-XGBoost achieved improved accuracy in 36 datasets, with a notable increase in accuracy of 0.31 in the Balloon dataset and 13.5% on the hill-valley dataset. Across 61 datasets, the method demonstrates an average accuracy increase of 0.9080%, highlighting its effectiveness in enhancing model performance. These findings indicate that integrating multiple feature engineering strategies within the boosting framework can yield significant gains in model accuracy and robustness. We propose a simple ensemble, the Mix-XGBoost classifier, which selects the final classifier based on validation results from both the Feat-XGBoost and the baseline model. The results indicate that Mix-XGBoost enhances performance by leveraging the strengths of both classifiers. The source code will be publicly accessible after acceptance at https://github.com/lingping-fuzzy.
{"title":"Enhancing sampling performance in XGBoost by ensemble feature engineering","authors":"Lingping Kong , Ponnuthurai Nagaratnam Suganthan , Václav Snášel , Varun Ojha , Jeng-Shyang Pan","doi":"10.1016/j.patcog.2026.113169","DOIUrl":"10.1016/j.patcog.2026.113169","url":null,"abstract":"<div><div>Feature engineering is crucial in enhancing model performance, yet effectively combining multiple feature transformations to maximize their benefits remains a key challenge. In this study, we propose an innovative approach that integrates various feature engineering techniques within the boosting steps of the XGBoost algorithm and adapts the gradient-based one-sided sampling, forming an enhanced classifier named Feat-XGBoost. Feat-XGBoost aims to improve data representation and separation in model learning by iteratively applying feature transformations. We evaluated this approach on 61 diverse datasets, comparing its performance with 12 baseline classifiers, including standard XGBoost. The results show that Feat-XGBoost achieved improved accuracy in 36 datasets, with a notable increase in accuracy of 0.31 in the <em>Balloon</em> dataset and 13.5% on the <em>hill-valley</em> dataset. Across 61 datasets, the method demonstrates an average accuracy increase of 0.9080%, highlighting its effectiveness in enhancing model performance. These findings indicate that integrating multiple feature engineering strategies within the boosting framework can yield significant gains in model accuracy and robustness. We propose a simple ensemble, the Mix-XGBoost classifier, which selects the final classifier based on validation results from both the Feat-XGBoost and the baseline model. The results indicate that Mix-XGBoost enhances performance by leveraging the strengths of both classifiers. The source code will be publicly accessible after acceptance at <span><span>https://github.com/lingping-fuzzy</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113169"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113187
Zining Chen , Zhicheng Zhao , Fei Su , Shijian Lu
Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic Sample Pool (SSP) module. The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The SSP module exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.
{"title":"Data-efficient generalization for zero-shot composed image retrieval","authors":"Zining Chen , Zhicheng Zhao , Fei Su , Shijian Lu","doi":"10.1016/j.patcog.2026.113187","DOIUrl":"10.1016/j.patcog.2026.113187","url":null,"abstract":"<div><div>Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic Sample Pool (SSP) module. The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The SSP module exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113187"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113183
Su-Ji Jang, Ue-Hwan Kim
Neural rendering has shown significant potential in generating high-quality 3D scenes from sparse inputs. However, existing methods struggle to simultaneously capture both low-frequency global structures and high-frequency fine details, leading to suboptimal scene representations. To overcome this limitation, we propose a frequency-aligned supervision framework that explicitly separates the learning process into low-frequency and full-spectrum components. By introducing two sub-networks and aligning supervision signals at appropriate layers, our method enhances the formation of global structures while preserving fine details. Specifically, the low-frequency network (LFN) is supervised with low-pass targets (Gaussian-filtered images) to form global structures, while the full-spectrum network (FSN) is supervised with the original images to refine high-frequency details. The proposed approach is broadly applicable to MLP-based NeRF architectures without requiring major architectural modifications. Extensive experiments demonstrate that our method consistently improves PSNR, SSIM, and LPIPS across multiple NeRF variants and datasets, confirming its robustness in sparse input scenarios.
{"title":"Frequency-aligned supervision for few-shot neural rendering","authors":"Su-Ji Jang, Ue-Hwan Kim","doi":"10.1016/j.patcog.2026.113183","DOIUrl":"10.1016/j.patcog.2026.113183","url":null,"abstract":"<div><div>Neural rendering has shown significant potential in generating high-quality 3D scenes from sparse inputs. However, existing methods struggle to simultaneously capture both low-frequency global structures and high-frequency fine details, leading to suboptimal scene representations. To overcome this limitation, we propose a frequency-aligned supervision framework that explicitly separates the learning process into low-frequency and full-spectrum components. By introducing two sub-networks and aligning supervision signals at appropriate layers, our method enhances the formation of global structures while preserving fine details. Specifically, the low-frequency network (LFN) is supervised with low-pass targets (Gaussian-filtered images) to form global structures, while the full-spectrum network (FSN) is supervised with the original images to refine high-frequency details. The proposed approach is broadly applicable to MLP-based NeRF architectures without requiring major architectural modifications. Extensive experiments demonstrate that our method consistently improves PSNR, SSIM, and LPIPS across multiple NeRF variants and datasets, confirming its robustness in sparse input scenarios.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113183"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113180
Jiaxing Li , Lin Jiang , Zuopeng Yang , Xiaozhao Fang , Shengli Xie , Yong Xu
Cross-modal hashing is one of the promising practical applications in information retrieval for multimedia data. However, there exist some technical hurdles, e.g., how to further reduce the heterogeneous gaps for cross-modal data semantically, how to extract cross-modal knowledge by jointly training data from different modality and how to better leverage the label information to generate more discriminative hash codes, etc. To overcome the above-mentioned challenges, this paper proposes a joint asymmetric discrete hashing (JADH for short) for cross-modal retrieval. By leveraging kernel mapping operation, JADH extracts the non-linear features of cross-modal data to better preserve the semantic information in the latent common space learning. Then, a joint asymmetric hash codes learning term is customized to learn hash codes for data from different modalities jointly. As such, more cross-modal information can be preserved, which can effectively reduce the heterogeneous semantic gaps. Finally, a log-likelihood similarity preserving term is proposed to boost hash codes learning from the similarity matrix, while a classifier learning term is proposed to further improve the quality of the learned hash codes. In addition, an alternative algorithm is derived to solve the optimization problem in JADH efficiently. Experimental results on four widely used datasets show that, JADH outperforms some state-of-the-art baseline methods in hashing-based cross-modal retrieval, on accuracy and efficiency.
{"title":"Joint asymmetric discrete hashing for cross-modal retrieval","authors":"Jiaxing Li , Lin Jiang , Zuopeng Yang , Xiaozhao Fang , Shengli Xie , Yong Xu","doi":"10.1016/j.patcog.2026.113180","DOIUrl":"10.1016/j.patcog.2026.113180","url":null,"abstract":"<div><div>Cross-modal hashing is one of the promising practical applications in information retrieval for multimedia data. However, there exist some technical hurdles, e.g., how to further reduce the heterogeneous gaps for cross-modal data semantically, how to extract cross-modal knowledge by jointly training data from different modality and how to better leverage the label information to generate more discriminative hash codes, etc. To overcome the above-mentioned challenges, this paper proposes a joint asymmetric discrete hashing (JADH for short) for cross-modal retrieval. By leveraging kernel mapping operation, JADH extracts the non-linear features of cross-modal data to better preserve the semantic information in the latent common space learning. Then, a joint asymmetric hash codes learning term is customized to learn hash codes for data from different modalities jointly. As such, more cross-modal information can be preserved, which can effectively reduce the heterogeneous semantic gaps. Finally, a log-likelihood similarity preserving term is proposed to boost hash codes learning from the similarity matrix, while a classifier learning term is proposed to further improve the quality of the learned hash codes. In addition, an alternative algorithm is derived to solve the optimization problem in JADH efficiently. Experimental results on four widely used datasets show that, JADH outperforms some state-of-the-art baseline methods in hashing-based cross-modal retrieval, on accuracy and efficiency.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113180"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113185
Zihao Yin , Zhihai Wang , Haiyang Liu , Chuanlan Li , Muyun Yao , Shijiang Li , Fangjing Li , Jia Ren , Yanchao Yang
In graph learning tasks, distributional shifts between training and test data are widely observed, rendering the conventional assumption of independent and identically distributed (i.i.d.) data invalid. Such shifts pose substantial challenges to the generalization capability of graph neural networks. Existing approaches often focus on addressing a specific type of distributional bias–such as label selection bias or structural bias–yet in real-world scenarios, the nature of such biases is typically unobservable in advance. As a result, models tailored to a single bias type lack general applicability and may fail under more complex conditions. Causal feature disentanglement has emerged as a promising strategy to mitigate the influence of spurious correlations by isolating features that are causally relevant to the classification task. However, under severe bias or when causal features are incompletely identified, relying solely on these features may be insufficient to capture all informative signals, thereby limiting the model’s performance. To address this challenge, we propose a novel de-biased node classification framework named PUA (Pseudo-Features Made Useful Again), which integrates causal feature disentanglement with adaptive feature fusion. Specifically, PUA employs an attention mechanism to approximate the Markov boundary (MB), thereby disentangling causal and pseudo features at the feature level. It then performs adaptive selection on pseudo features to extract auxiliary information that may assist classification. Finally, causal and pseudo features are fused via a gating mechanism, resulting in robust node representations that are more resilient to various forms of distributional bias. Notably, PUA does not require prior knowledge of the bias type, making it broadly applicable to diverse scenarios. We conduct extensive experiments on six publicly available graph datasets under different types of distributional bias, including label selection bias, structural bias, mixture bias, and low-resource scenarios1. The experimental results demonstrate that PUA consistently outperforms existing methods, achieving superior classification accuracy and robustness across all bias conditions.
在图学习任务中,训练数据和测试数据之间的分布变化被广泛观察到,使得传统的数据独立和同分布(i.i.d)的假设无效。这种转变对图神经网络的泛化能力提出了实质性的挑战。现有的方法通常侧重于解决特定类型的分布偏差,如标签选择偏差或结构偏差,但在现实世界中,这种偏差的性质通常是无法提前观察到的。因此,为单一偏差类型量身定制的模型缺乏普遍适用性,并且可能在更复杂的条件下失效。通过分离与分类任务有因果关系的特征,因果特征解缠已经成为一种很有前途的策略,可以减轻虚假相关性的影响。然而,在严重偏差或因果特征未完全识别的情况下,仅依靠这些特征可能不足以捕获所有信息信号,从而限制了模型的性能。为了解决这一挑战,我们提出了一种新的去偏见节点分类框架,名为PUA (Pseudo-Features Made Useful Again),它将因果特征解纠缠与自适应特征融合结合在一起。具体来说,PUA采用了一种注意机制来近似马尔可夫边界(MB),从而在特征级别上分离因果特征和伪特征。然后对伪特征进行自适应选择,提取可能有助于分类的辅助信息。最后,因果特征和伪特征通过门控机制融合,产生健壮的节点表示,对各种形式的分布偏差更有弹性。值得注意的是,PUA不需要预先了解偏差类型,这使得它广泛适用于各种场景。我们在六个公开可用的图数据集上进行了广泛的实验,这些数据集在不同类型的分布偏差下,包括标签选择偏差、结构偏差、混合偏差和低资源场景1。实验结果表明,该方法优于现有方法,在所有偏置条件下都具有优异的分类精度和鲁棒性。
{"title":"PUA : Pseudo-features made useful again for robust graph node classification under distribution shift","authors":"Zihao Yin , Zhihai Wang , Haiyang Liu , Chuanlan Li , Muyun Yao , Shijiang Li , Fangjing Li , Jia Ren , Yanchao Yang","doi":"10.1016/j.patcog.2026.113185","DOIUrl":"10.1016/j.patcog.2026.113185","url":null,"abstract":"<div><div>In graph learning tasks, distributional shifts between training and test data are widely observed, rendering the conventional assumption of independent and identically distributed (i.i.d.) data invalid. Such shifts pose substantial challenges to the generalization capability of graph neural networks. Existing approaches often focus on addressing a specific type of distributional bias–such as label selection bias or structural bias–yet in real-world scenarios, the nature of such biases is typically unobservable in advance. As a result, models tailored to a single bias type lack general applicability and may fail under more complex conditions. Causal feature disentanglement has emerged as a promising strategy to mitigate the influence of spurious correlations by isolating features that are causally relevant to the classification task. However, under severe bias or when causal features are incompletely identified, relying solely on these features may be insufficient to capture all informative signals, thereby limiting the model’s performance. To address this challenge, we propose a novel de-biased node classification framework named PUA (<strong>P</strong>seudo-Features Made <strong>U</strong>seful <strong>A</strong>gain), which integrates causal feature disentanglement with adaptive feature fusion. Specifically, PUA employs an attention mechanism to approximate the Markov boundary (MB), thereby disentangling causal and pseudo features at the feature level. It then performs adaptive selection on pseudo features to extract auxiliary information that may assist classification. Finally, causal and pseudo features are fused via a gating mechanism, resulting in robust node representations that are more resilient to various forms of distributional bias. Notably, PUA does not require prior knowledge of the bias type, making it broadly applicable to diverse scenarios. We conduct extensive experiments on six publicly available graph datasets under different types of distributional bias, including label selection bias, structural bias, mixture bias, and low-resource scenarios<span><span><sup>1</sup></span></span>. The experimental results demonstrate that PUA consistently outperforms existing methods, achieving superior classification accuracy and robustness across all bias conditions.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113185"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113179
Haoran Yu , Weifeng Liu , Yingjie Wang , Baodi Liu , Dapeng Tao , Honglong Chen
Most machine learning methods often perform vulnerably in the real open world due to the unknown distribution shifts between training and testing distribution. Out-of-Distribution (OOD) generalization aims to make stable predictions under unknown distribution shifts by exploring invariant patterns to address this problem. One of the representative methods is independence sample weighting learning. It eliminates spurious correlations to make the model explore the true relationship between features and labels for stable prediction by learning a set of sample weights to eliminate dependencies between features. However, existing independence sample weighting methods roughly eliminate the correlation between all features, resulting in the loss of critical information and affecting the model’s performance. To address this problem, we propose a causal-guided independence sample weighting (CIW) algorithm. CIW first evaluates the causal effect of features on labels by constructing a cross domain-invariant directed acyclic graph (DAG). Subsequently, it generates a strength guiding mask based on the causal effect to differentially eliminate the correlation between different features avoiding redundant elimination of correlations between causal features. We perform extensive experiments in different experimental settings and experimental results demonstrate the effectiveness and superiority of our method.
由于训练分布和测试分布之间的未知分布变化,大多数机器学习方法在真实的开放世界中往往表现得很脆弱。out - distribution (OOD)泛化的目的是通过探索不变模式来解决这一问题,从而在未知分布变化的情况下做出稳定的预测。其中一种代表性的方法是独立样本加权学习。它通过学习一组样本权重来消除特征之间的依赖关系,消除虚假相关性,使模型探索特征与标签之间的真实关系,从而实现稳定的预测。然而,现有的独立样本加权方法大致消除了所有特征之间的相关性,导致关键信息的丢失,影响了模型的性能。为了解决这个问题,我们提出了一种因果导向的独立样本加权(CIW)算法。CIW首先通过构建一个跨域不变有向无环图(DAG)来评估特征对标签的因果效应。然后根据因果效应生成强度引导掩模,区别地消除不同特征之间的相关性,避免冗余地消除因果特征之间的相关性。我们在不同的实验环境下进行了大量的实验,实验结果证明了我们的方法的有效性和优越性。
{"title":"Causal-guided strength differential independence sample weighting for out-of-distribution generalization","authors":"Haoran Yu , Weifeng Liu , Yingjie Wang , Baodi Liu , Dapeng Tao , Honglong Chen","doi":"10.1016/j.patcog.2026.113179","DOIUrl":"10.1016/j.patcog.2026.113179","url":null,"abstract":"<div><div>Most machine learning methods often perform vulnerably in the real open world due to the unknown distribution shifts between training and testing distribution. Out-of-Distribution (OOD) generalization aims to make stable predictions under unknown distribution shifts by exploring invariant patterns to address this problem. One of the representative methods is independence sample weighting learning. It eliminates spurious correlations to make the model explore the true relationship between features and labels for stable prediction by learning a set of sample weights to eliminate dependencies between features. However, existing independence sample weighting methods roughly eliminate the correlation between all features, resulting in the loss of critical information and affecting the model’s performance. To address this problem, we propose a causal-guided independence sample weighting (CIW) algorithm. CIW first evaluates the causal effect of features on labels by constructing a cross domain-invariant directed acyclic graph (DAG). Subsequently, it generates a strength guiding mask based on the causal effect to differentially eliminate the correlation between different features avoiding redundant elimination of correlations between causal features. We perform extensive experiments in different experimental settings and experimental results demonstrate the effectiveness and superiority of our method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113179"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113160
Shaoze Feng , Qiyin Zhou , Yuanyuan Liu , Ke Wang , Kejun Liu , Chang Tang
Unsupervised Multimodal Emotion Recognition (UMER) aims to infer affective states by integrating unannotated multimodal data, such as text, speech, and images. A key challenge is the substantial semantic gaps between modalities, spanning both global cross-modal emotion cues and local fine-grained emotion changes within each modality. Without annotation, existing methods struggle to effectively align and fuse cross-modal emotional semantics, resulting in suboptimal UMER performance. To address this challenge, we propose DLCEA, a Dual-level Language-Driven Cross-Modal Emotion Alignment framework for robust unsupervised multimodal emotion representation learning. DLCEA leverages intrinsic emotion semantics in text to guide cross-modal alignment and introduces a dual-level semantic alignment scheme: Text-guided Cross-modal Global Emotion Alignment (TGEA) and Text-guided Cross-modal Local Emotion Alignment (TLEA). Specifically, the TGEA module treats text as an alignment anchor and applies text-guided contrastive learning to align the global emotional features of audio and visual modalities with those of the text, achieving global emotion-level consistency across all three modalities. In parallel, TLEA incorporates an emotion-aware text masking strategy and text-guided audio/video reconstruction, enabling the model to capture subtle emotional cues and reinforce local-level cross-modal consistency, thereby further addressing fine-grained emotional alignment. By jointly modeling global and local emotional alignment, DLCEA learns unified and robust multimodal emotion representations in a fully unsupervised manner. Extensive experiments on multimodal datasets such as MAFW, MOSEI, and IEMOCAP demonstrate that DLCEA outperforms existing methods by a significant margin, achieving state-of-the-art performance. These results confirm the critical role of language-driven cross-modal emotional alignment in UMER. Code is available on https://github.com/Tank9971/DLCEA.
{"title":"Unsupervised multimodal emotion-unified representation learning with dual-level language-driven cross-modal emotion alignment","authors":"Shaoze Feng , Qiyin Zhou , Yuanyuan Liu , Ke Wang , Kejun Liu , Chang Tang","doi":"10.1016/j.patcog.2026.113160","DOIUrl":"10.1016/j.patcog.2026.113160","url":null,"abstract":"<div><div>Unsupervised Multimodal Emotion Recognition (UMER) aims to infer affective states by integrating unannotated multimodal data, such as text, speech, and images. A key challenge is the substantial semantic gaps between modalities, spanning both global cross-modal emotion cues and local fine-grained emotion changes within each modality. Without annotation, existing methods struggle to effectively align and fuse cross-modal emotional semantics, resulting in suboptimal UMER performance. To address this challenge, we propose <strong>DLCEA</strong>, a <em>Dual-level Language-Driven Cross-Modal Emotion Alignment</em> framework for robust unsupervised multimodal emotion representation learning. DLCEA leverages intrinsic emotion semantics in text to guide cross-modal alignment and introduces a dual-level semantic alignment scheme: <strong>Text-guided Cross-modal Global Emotion Alignment (TGEA)</strong> and <strong>Text-guided Cross-modal Local Emotion Alignment (TLEA)</strong>. Specifically, the TGEA module treats text as an alignment anchor and applies text-guided contrastive learning to align the global emotional features of audio and visual modalities with those of the text, achieving global emotion-level consistency across all three modalities. In parallel, TLEA incorporates an emotion-aware text masking strategy and text-guided audio/video reconstruction, enabling the model to capture subtle emotional cues and reinforce local-level cross-modal consistency, thereby further addressing fine-grained emotional alignment. By jointly modeling global and local emotional alignment, DLCEA learns unified and robust multimodal emotion representations in a fully unsupervised manner. Extensive experiments on multimodal datasets such as MAFW, MOSEI, and IEMOCAP demonstrate that DLCEA outperforms existing methods by a significant margin, achieving state-of-the-art performance. These results confirm the critical role of language-driven cross-modal emotional alignment in UMER. Code is available on <span><span>https://github.com/Tank9971/DLCEA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113160"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113182
Lixiang Xu , Wei Ge , Feiping Nie , Enhong Chen , Bin Luo
Graph neural networks (GNNs) have been successfully applied to many graph classification tasks. However, most GNNs are based on message-passing neural network (MPNN) frameworks, making it difficult to utilize the structural information of the graph from multiple perspectives. To address the limitations of existing GNN methods, we incorporate structural information into graph embedding representation in two ways. On the one hand, the subgraph information in the neighborhood of a node is incorporated into the message passing process of GNN through graph entropy. On the other hand, we encode the path information in the graph with the help of an improved shortest path kernel. Then, these two parts of structural information are fused through the attention mechanism, which can capture the structural information of the graph and thus enrich the structural expression of graph neural network. Finally, the model is experimentally evaluated on seven publicly available graph classification datasets. Compared with the existing graph representation models, extensive experiments show that our model can better obtain graph representation and achieves more competitive performance.
{"title":"Kernel entropy graph isomorphism network for graph classification","authors":"Lixiang Xu , Wei Ge , Feiping Nie , Enhong Chen , Bin Luo","doi":"10.1016/j.patcog.2026.113182","DOIUrl":"10.1016/j.patcog.2026.113182","url":null,"abstract":"<div><div>Graph neural networks (GNNs) have been successfully applied to many graph classification tasks. However, most GNNs are based on message-passing neural network (MPNN) frameworks, making it difficult to utilize the structural information of the graph from multiple perspectives. To address the limitations of existing GNN methods, we incorporate structural information into graph embedding representation in two ways. On the one hand, the subgraph information in the neighborhood of a node is incorporated into the message passing process of GNN through graph entropy. On the other hand, we encode the path information in the graph with the help of an improved shortest path kernel. Then, these two parts of structural information are fused through the attention mechanism, which can capture the structural information of the graph and thus enrich the structural expression of graph neural network. Finally, the model is experimentally evaluated on seven publicly available graph classification datasets. Compared with the existing graph representation models, extensive experiments show that our model can better obtain graph representation and achieves more competitive performance.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113182"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}