Pub Date : 2026-08-01Epub Date: 2026-01-29DOI: 10.1016/j.patcog.2026.113173
Ahmed Mamdouh , Moumen El-Melegy , Samia Ali , Ron Kikinis
This research addresses the challenge of limited data in tabular data classification, particularly prevalent in domains with constraints like healthcare. We propose Tab2Visual, a novel approach that transforms heterogeneous tabular data into visual representations, enabling the application of powerful deep learning models. Tab2Visual effectively addresses data scarcity by incorporating novel image augmentation techniques and facilitating transfer learning. We extensively evaluate the proposed approach on diverse tabular datasets, comparing its performance against a wide range of machine learning algorithms, including classical methods, tree-based ensembles, and state-of-the-art deep learning models specifically designed for tabular data. We also perform an in-depth analysis of factors influencing Tab2Visual’s performance. Our experimental results demonstrate that Tab2Visual outperforms other methods in classification problems with limited tabular data.
{"title":"Tab2Visual: Deep learning for limited tabular data via visual representations and augmentation","authors":"Ahmed Mamdouh , Moumen El-Melegy , Samia Ali , Ron Kikinis","doi":"10.1016/j.patcog.2026.113173","DOIUrl":"10.1016/j.patcog.2026.113173","url":null,"abstract":"<div><div>This research addresses the challenge of limited data in tabular data classification, particularly prevalent in domains with constraints like healthcare. We propose Tab2Visual, a novel approach that transforms heterogeneous tabular data into visual representations, enabling the application of powerful deep learning models. Tab2Visual effectively addresses data scarcity by incorporating novel image augmentation techniques and facilitating transfer learning. We extensively evaluate the proposed approach on diverse tabular datasets, comparing its performance against a wide range of machine learning algorithms, including classical methods, tree-based ensembles, and state-of-the-art deep learning models specifically designed for tabular data. We also perform an in-depth analysis of factors influencing Tab2Visual’s performance. Our experimental results demonstrate that Tab2Visual outperforms other methods in classification problems with limited tabular data.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113173"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate segmentation of scalp tissue layers is essential for mechanistic studies and staging of androgenetic alopecia (AGA), a common form of hair loss that impacts quality of life and mental health. High-resolution magnetic resonance imaging (HR-MR) offers a promising assessment tool. However, accurate segmentation remains challenging due to the lack of large-scale annotated datasets, structural deformation, and low image quality. To address these issues, an Amplitude-guided Deep Reinforcement Learning (ADRL) framework is designed to decouple the data distribution of images and adaptively fuse into the distribution of unlabeled images. This enables effective feature learning of lamellar and asymmetrically thickened structures from both labeled and unlabeled data. Then, phase component alignment (PHA) is imposed to mitigate the adverse impacts of noise or artifacts. To further enhance the discriminative capability of this network, a Cross-Power Spectrum Correlation (CPSC) module is proposed to mitigate inaccurate segmentation of layer structures. Comprehensive experiments on a scalp HR-MR image dataset and a publicly available retinal OCT image dataset demonstrate that our method significantly outperforms state-of-the-art methods in semi-supervised layer segmentation.
{"title":"Amplitude-guided deep reinforcement learning for semi-supervised layer segmentation","authors":"Enting Gao , Zian Zha , Yonggang Li , Junhui Zhu , Yong Wang , Xinjian Chen , Naihui Zhou , Dehui Xiang","doi":"10.1016/j.patcog.2026.113204","DOIUrl":"10.1016/j.patcog.2026.113204","url":null,"abstract":"<div><div>Accurate segmentation of scalp tissue layers is essential for mechanistic studies and staging of androgenetic alopecia (AGA), a common form of hair loss that impacts quality of life and mental health. High-resolution magnetic resonance imaging (HR-MR) offers a promising assessment tool. However, accurate segmentation remains challenging due to the lack of large-scale annotated datasets, structural deformation, and low image quality. To address these issues, an Amplitude-guided Deep Reinforcement Learning (ADRL) framework is designed to decouple the data distribution of images and adaptively fuse into the distribution of unlabeled images. This enables effective feature learning of lamellar and asymmetrically thickened structures from both labeled and unlabeled data. Then, phase component alignment (PHA) is imposed to mitigate the adverse impacts of noise or artifacts. To further enhance the discriminative capability of this network, a Cross-Power Spectrum Correlation (CPSC) module is proposed to mitigate inaccurate segmentation of layer structures. Comprehensive experiments on a scalp HR-MR image dataset and a publicly available retinal OCT image dataset demonstrate that our method significantly outperforms state-of-the-art methods in semi-supervised layer segmentation.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113204"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-07DOI: 10.1016/j.patcog.2026.113280
Minkang Chai , Lu Wei , Zheng Qian , Ran Zhang , Ye Zhu
The explosive growth of global food culture has expanded the application scope of visual recognition; however, it has introduced complex challenges arising from high intra-class variability and inter-class similarity. However, existing systems struggle to address fine-grained confusion and the trade-off between retaining old knowledge and adapting to new information. Traditional methods are constrained by a heavy reliance on large-scale datasets, whereas emerging zero-shot techniques are prone to semantic hallucination when encountering unseen dishes, thereby posing a severe challenge to precise recognition. To address these challenges, we propose the Cross-domain Guided Food Pseudo-Target Estimation (CFPE) framework, establishing a novel paradigm that is vision-led and semantically enhanced. First, to tackle the scarcity of incremental data, we utilize cross-domain adversarial training and an adaptive mask generator to synthesize high-quality pseudo-targets, thus establishing stable geometric anchors within the feature space. Second, by integrating Bessel Estimation Loss of Hypersphere (BELH) and Perturbation Margin Enhanced Prototype Regularization (PMEPR), we geometrically reconstruct the hyperspherical manifold distribution of features, effectively correcting estimation biases induced by few-shot samples. Crucially, we introduce a Food Factor-based Visual Semantic Consistency (FVSC) constraint, which explicitly decouples fine-grained visual confusion by injecting structured semantics. This is complemented by a depth-aware feature decoupling strategy to dynamically balance the plasticity and stability of the model. Experimental results demonstrate that CFPE achieves state-of-the-art performance across multiple benchmark datasets. It not only significantly improves incremental learning accuracy but also exhibits exceptional robustness in recognizing high-entropy food images.
{"title":"Few-shot incremental food recognition via cross-domain guided pseudo-targets","authors":"Minkang Chai , Lu Wei , Zheng Qian , Ran Zhang , Ye Zhu","doi":"10.1016/j.patcog.2026.113280","DOIUrl":"10.1016/j.patcog.2026.113280","url":null,"abstract":"<div><div>The explosive growth of global food culture has expanded the application scope of visual recognition; however, it has introduced complex challenges arising from high intra-class variability and inter-class similarity. However, existing systems struggle to address fine-grained confusion and the trade-off between retaining old knowledge and adapting to new information. Traditional methods are constrained by a heavy reliance on large-scale datasets, whereas emerging zero-shot techniques are prone to semantic hallucination when encountering unseen dishes, thereby posing a severe challenge to precise recognition. To address these challenges, we propose the Cross-domain Guided Food Pseudo-Target Estimation (CFPE) framework, establishing a novel paradigm that is vision-led and semantically enhanced. First, to tackle the scarcity of incremental data, we utilize cross-domain adversarial training and an adaptive mask generator to synthesize high-quality pseudo-targets, thus establishing stable geometric anchors within the feature space. Second, by integrating Bessel Estimation Loss of Hypersphere (BELH) and Perturbation Margin Enhanced Prototype Regularization (PMEPR), we geometrically reconstruct the hyperspherical manifold distribution of features, effectively correcting estimation biases induced by few-shot samples. Crucially, we introduce a Food Factor-based Visual Semantic Consistency (FVSC) constraint, which explicitly decouples fine-grained visual confusion by injecting structured semantics. This is complemented by a depth-aware feature decoupling strategy to dynamically balance the plasticity and stability of the model. Experimental results demonstrate that CFPE achieves state-of-the-art performance across multiple benchmark datasets. It not only significantly improves incremental learning accuracy but also exhibits exceptional robustness in recognizing high-entropy food images.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113280"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-10DOI: 10.1016/j.patcog.2026.113256
Xin Shu, Yikang Guo, Shou Gang Ren
Cross-modal hashing methods have attracted substantial interest in information retrieval because of their efficiency and low memory costs. Recent advancements in contrastive learning have greatly improved the retrieval performance of these hashing techniques. However, these approaches still encounter two significant drawbacks: (1) most current methods transform multimodal data into a common Hamming space to reduce the semantic gap, which may fail to capture the strong feature correlations across modalities; and (2) semantic similarity is represented as a binary value, neglecting the semantic relationships among multiple labels. To address these issues, we propose a novel adversarial supervised contrastive feature learning approach for cross-modal hashing. Specifically, we utilize a pre-trained CLIP model to extract multimodal features and apply contrastive learning to integrate these features effectively. Additionally, we introduce an adversarial feature learning mechanism to enhance the correlation between features from different modalities. Furthermore, we employ a graph convolutional network to model label correlations. Experimental results on benchmark datasets demonstrate the effectiveness and efficiency of our proposed method.
{"title":"Adversarial supervised contrastive feature learning for cross-modal retrieval","authors":"Xin Shu, Yikang Guo, Shou Gang Ren","doi":"10.1016/j.patcog.2026.113256","DOIUrl":"10.1016/j.patcog.2026.113256","url":null,"abstract":"<div><div>Cross-modal hashing methods have attracted substantial interest in information retrieval because of their efficiency and low memory costs. Recent advancements in contrastive learning have greatly improved the retrieval performance of these hashing techniques. However, these approaches still encounter two significant drawbacks: (1) most current methods transform multimodal data into a common Hamming space to reduce the semantic gap, which may fail to capture the strong feature correlations across modalities; and (2) semantic similarity is represented as a binary value, neglecting the semantic relationships among multiple labels. To address these issues, we propose a novel adversarial supervised contrastive feature learning approach for cross-modal hashing. Specifically, we utilize a pre-trained CLIP model to extract multimodal features and apply contrastive learning to integrate these features effectively. Additionally, we introduce an adversarial feature learning mechanism to enhance the correlation between features from different modalities. Furthermore, we employ a graph convolutional network to model label correlations. Experimental results on benchmark datasets demonstrate the effectiveness and efficiency of our proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113256"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-27DOI: 10.1016/j.patcog.2026.113152
Zhaodi Wang , Biao Leng , Shuo Zhang
Unbiased Scene Graph Generation (SGG) aims to parse visual scenes into highly informative graphs under the long-tail challenge. While prototype-based methods have shown promise in unbiased SGG, they highlight the importance of learning discriminative features that are intra-class compact and inter-class separable. In this paper, we revisit prototype-based methods and analyze critical roles of representation learning and prototype classifier in driving unbiased SGG, and accordingly propose a novel framework DuoNet. To enhance intra-class compactness, we introduce a Bi-Directional Representation Refinement (BiDR2) module that captures relation-sensitive visual variability and within-relation visual consistency of entities. This module adopts relation-to-entity-to-relation refinement by integrating dual-level relation pattern modeling with a relation-specific entity constraint. Furthermore, a Knowledge-Guided Prototype Learning (KGPL) module is devised to strengthen inter-class separability by constructing an equidistributed prototypical classifier with maximum inter-class margins. The equidistributed prototype classifier is frozen during SGG training to mitigate long-tail bias, thus a knowledge-driven triplet loss is developed to strengthen the learning of BiDR2, enhancing relation-prototype matching. Extensive experiments demonstrate the effectiveness of our method, which sets new state-of-the-art performance on Visual Genome, GQA and Open Images datasets.
无偏场景图生成(Unbiased Scene Graph Generation, SGG)的目标是在长尾挑战下将视觉场景解析成高信息量的图。虽然基于原型的方法在无偏SGG中显示出了希望,但它们强调了学习类内紧凑和类间可分离的判别特征的重要性。本文回顾了基于原型的方法,分析了表征学习和原型分类器在驱动无偏SGG中的关键作用,并据此提出了一个新的框架DuoNet。为了增强类内的紧凑性,我们引入了双向表示细化(BiDR2)模块,该模块捕获关系敏感的视觉可变性和实体的关系内视觉一致性。该模块通过将双级关系模式建模与特定于关系的实体约束集成,采用关系到实体到关系的细化。在此基础上,设计了知识引导原型学习(knowledge guided Prototype Learning, KGPL)模块,通过构造类间边界最大的等分布原型分类器来增强类间可分性。在SGG训练过程中冻结等分布原型分类器以减轻长尾偏差,因此开发了知识驱动的三重损失来加强BiDR2的学习,增强关系原型匹配。大量的实验证明了我们的方法的有效性,它在视觉基因组,GQA和开放图像数据集上设置了新的最先进的性能。
{"title":"DuoNet: Joint optimization of representation learning and prototype classifier for unbiased scene graph generation","authors":"Zhaodi Wang , Biao Leng , Shuo Zhang","doi":"10.1016/j.patcog.2026.113152","DOIUrl":"10.1016/j.patcog.2026.113152","url":null,"abstract":"<div><div>Unbiased Scene Graph Generation (SGG) aims to parse visual scenes into highly informative graphs under the long-tail challenge. While prototype-based methods have shown promise in unbiased SGG, they highlight the importance of learning discriminative features that are intra-class compact and inter-class separable. In this paper, we revisit prototype-based methods and analyze critical roles of representation learning and prototype classifier in driving unbiased SGG, and accordingly propose a novel framework DuoNet. To enhance intra-class compactness, we introduce a Bi-Directional Representation Refinement (BiDR<sup>2</sup>) module that captures relation-sensitive visual variability and within-relation visual consistency of entities. This module adopts relation-to-entity-to-relation refinement by integrating dual-level relation pattern modeling with a relation-specific entity constraint. Furthermore, a Knowledge-Guided Prototype Learning (KGPL) module is devised to strengthen inter-class separability by constructing an equidistributed prototypical classifier with maximum inter-class margins. The equidistributed prototype classifier is frozen during SGG training to mitigate long-tail bias, thus a knowledge-driven triplet loss is developed to strengthen the learning of BiDR<sup>2</sup>, enhancing relation-prototype matching. Extensive experiments demonstrate the effectiveness of our method, which sets new state-of-the-art performance on Visual Genome, GQA and Open Images datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113152"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-03DOI: 10.1016/j.patcog.2026.113203
Xiaoyu Yang , Lijian Xu , Xingyu Zeng , Xiaosong Wang , Hongsheng Li , Shaoting Zhang
Foundation models have recently transformed visual-linguistic representation learning, yet their robustness under adverse imaging conditions of open worlds remains insufficiently understood. In this work, we introduce SCALAR, a scene-aware framework that endows multi-modal large language models with enhanced capability for robust spatial-concept alignment in degraded visual environments of open worlds. SCALAR proceeds in two complementary stages. The supervised alignment stage reconstructs hierarchical concept chains from visual-linguistic corpora, thereby enabling efficient spatial relationship decoding. The subsequent reinforced fine-tuning stage dispenses with annotations and leverages a consistency-driven reward to facilitate open-world self-evolution, yielding improved adaptability across diverse degraded domains. Crucially, SCALAR jointly optimizes multi-dimensional spatial representations and heterogeneous knowledge structures, thereby fostering resilience and generalization beyond canonical benchmarks. Extensive evaluations across five tasks and eight large-scale datasets demonstrate the efficacy of SCALAR in advancing state-of-the-art performance on visual grounding and complex scene understanding, even under challenging open-world environments with harsh visual conditions. Comprehensive ablation studies further elucidate the contributions of reinforced fine-tuning and multi-task joint optimization. Finally, to encourage future research, we provide a new multi-task visual grounding dataset emphasizing fine-grained scene-object relations under degradation, along with code: https://github.com/AnonymGiant/SCALAR.
{"title":"SCALAR: Spatial-concept alignment for robust vision in harsh open world","authors":"Xiaoyu Yang , Lijian Xu , Xingyu Zeng , Xiaosong Wang , Hongsheng Li , Shaoting Zhang","doi":"10.1016/j.patcog.2026.113203","DOIUrl":"10.1016/j.patcog.2026.113203","url":null,"abstract":"<div><div>Foundation models have recently transformed visual-linguistic representation learning, yet their robustness under adverse imaging conditions of open worlds remains insufficiently understood. In this work, we introduce SCALAR, a scene-aware framework that endows multi-modal large language models with enhanced capability for robust spatial-concept alignment in degraded visual environments of open worlds. SCALAR proceeds in two complementary stages. The supervised alignment stage reconstructs hierarchical concept chains from visual-linguistic corpora, thereby enabling efficient spatial relationship decoding. The subsequent reinforced fine-tuning stage dispenses with annotations and leverages a consistency-driven reward to facilitate open-world self-evolution, yielding improved adaptability across diverse degraded domains. Crucially, SCALAR jointly optimizes multi-dimensional spatial representations and heterogeneous knowledge structures, thereby fostering resilience and generalization beyond canonical benchmarks. Extensive evaluations across five tasks and eight large-scale datasets demonstrate the efficacy of SCALAR in advancing state-of-the-art performance on visual grounding and complex scene understanding, even under challenging open-world environments with harsh visual conditions. Comprehensive ablation studies further elucidate the contributions of reinforced fine-tuning and multi-task joint optimization. Finally, to encourage future research, we provide a new multi-task visual grounding dataset emphasizing fine-grained scene-object relations under degradation, along with code: <span><span>https://github.com/AnonymGiant/SCALAR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113203"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-07DOI: 10.1016/j.patcog.2026.113250
Yongjie Zhao , Liuru Pu , Huaibo Song , Bo Jiang
Crowd counting is critical for public safety and urban management in smart cities, yet faces challenges in complex scenarios. While RGB-Thermal (RGB-T) fusion helps address information loss in low-light conditions, current methods still suffer from two key limitations. (a) Existing RGB-T crowd counting methods fail to address the spatial misalignment between RGB and thermal features caused by different capturing devices, which diminishes fusion performance and impedes improvements in crowd counting accuracy. (b) Current methods fail to adequately distinguish between specific and common features of RGB and thermal modalities, leading to redundant feature fusion that compromises feature representation and results in suboptimal counting performance. To address the aforementioned challenges, the Cross-modal Spatial Alignment and Fusion Network (CSAFNet) is proposed. CSAFNet integrates three novel modules: the Cross-modal Feature Space Alignment (CFSA), Multiscale Spatia l Displacement Compensation (MSDC) and the Cross-modal Feature Decoupling Fusion (CFDF) modules. The CFSA module performs precise spatial alignment via feature windows and achieves wide spatial consistency through the MSDC module. The CFDF module employs Kullback-Leibler divergence and Jensen-Shannon divergence to perform decoupled fusion of cross-modal features, preserving modality-specific details, enhancing cross-modal commonalities, reducing redundant features, and strengthening discriminative feature representation. Extensive experiments demonstrate that the proposed CSAFNet achieves competitive performance on the RGBT-CC dataset, reducing GAME(0) to 10.75 and RMSE to 17.91. These results validate the effectiveness and promising potential of CSAFNet for cross-modal crowd counting tasks. Code is released athttps://github.com/Zyjer888/CSAFNet.
{"title":"CSAFNet: Cross-modal spatial alignment and fusion network for RGB-T crowd counting","authors":"Yongjie Zhao , Liuru Pu , Huaibo Song , Bo Jiang","doi":"10.1016/j.patcog.2026.113250","DOIUrl":"10.1016/j.patcog.2026.113250","url":null,"abstract":"<div><div>Crowd counting is critical for public safety and urban management in smart cities, yet faces challenges in complex scenarios. While RGB-Thermal (RGB-T) fusion helps address information loss in low-light conditions, current methods still suffer from two key limitations. (a) Existing RGB-T crowd counting methods fail to address the spatial misalignment between RGB and thermal features caused by different capturing devices, which diminishes fusion performance and impedes improvements in crowd counting accuracy. (b) Current methods fail to adequately distinguish between specific and common features of RGB and thermal modalities, leading to redundant feature fusion that compromises feature representation and results in suboptimal counting performance. To address the aforementioned challenges, the Cross-modal Spatial Alignment and Fusion Network (CSAFNet) is proposed. CSAFNet integrates three novel modules: the Cross-modal Feature Space Alignment (CFSA), Multiscale Spatia l Displacement Compensation (MSDC) and the Cross-modal Feature Decoupling Fusion (CFDF) modules. The CFSA module performs precise spatial alignment via feature windows and achieves wide spatial consistency through the MSDC module. The CFDF module employs Kullback-Leibler divergence and Jensen-Shannon divergence to perform decoupled fusion of cross-modal features, preserving modality-specific details, enhancing cross-modal commonalities, reducing redundant features, and strengthening discriminative feature representation. Extensive experiments demonstrate that the proposed CSAFNet achieves competitive performance on the RGBT-CC dataset, reducing GAME(0) to 10.75 and RMSE to 17.91. These results validate the effectiveness and promising potential of CSAFNet for cross-modal crowd counting tasks. <em><strong>Code is released at</strong></em> <span><span>https://github.com/Zyjer888/CSAFNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113250"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-03DOI: 10.1016/j.patcog.2026.113201
Ping He, Xiaonan Gao, Huibin Li
Weakly supervised video anomaly detection (WS-VAD) often suffers from false alarms and incomplete localization due to the lack of precise temporal annotations. To address these limitations, we propose a novel method, multi-grained text-video matching and fusing (MG-TVMF), which leverages semantic cues from anomaly category text labels to enhance both the accuracy and completeness of anomaly localization. MG-TVMF integrates two complementary branches: the MG-TVM branch improves localization accuracy through a hierarchical structure comprising a coarse-grained classification module and two fine-grained matching modules, including a video-text matching (VTM) module for global semantic alignment and a segment-text matching (STM) module for local video (i.e. segment) text alignment via optimal transport algorithm. Meanwhile, the MG-TVF branch enhances localization completeness by prepending a global video-level text prompt to each segment-level caption for multi-grained textual fusion, and reconstructing the masked anomaly-related caption of the top-scoring segment using video segment features and anomaly scores. Extensive experiments on the UCF-Crime and XD-Violence datasets demonstrate the effectiveness of the proposed VTM and STM modules as well as the MG-TVF branch, and the proposed MG-TVMF method achieves state-of-the-art performance on UCF-Crime, XD-Violence, and ShanghaiTech datasets.
{"title":"MG-TVMF: Multi-grained text-video matching and fusing for weakly supervised video anomaly detection","authors":"Ping He, Xiaonan Gao, Huibin Li","doi":"10.1016/j.patcog.2026.113201","DOIUrl":"10.1016/j.patcog.2026.113201","url":null,"abstract":"<div><div>Weakly supervised video anomaly detection (WS-VAD) often suffers from false alarms and incomplete localization due to the lack of precise temporal annotations. To address these limitations, we propose a novel method, multi-grained text-video matching and fusing (MG-TVMF), which leverages semantic cues from anomaly category text labels to enhance both the accuracy and completeness of anomaly localization. MG-TVMF integrates two complementary branches: the MG-TVM branch improves localization accuracy through a hierarchical structure comprising a coarse-grained classification module and two fine-grained matching modules, including a video-text matching (VTM) module for global semantic alignment and a segment-text matching (STM) module for local video (i.e. segment) text alignment via optimal transport algorithm. Meanwhile, the MG-TVF branch enhances localization completeness by prepending a global video-level text prompt to each segment-level caption for multi-grained textual fusion, and reconstructing the masked anomaly-related caption of the top-scoring segment using video segment features and anomaly scores. Extensive experiments on the UCF-Crime and XD-Violence datasets demonstrate the effectiveness of the proposed VTM and STM modules as well as the MG-TVF branch, and the proposed MG-TVMF method achieves state-of-the-art performance on UCF-Crime, XD-Violence, and ShanghaiTech datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113201"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-23DOI: 10.1016/j.patcog.2026.113118
Jun-Zheng Chu , Bin Pan , Tian-Yang Shi , Zhen-Wei Shi
Ensuring model robustness against distributional shifts still presents a significant challenge in many machine learning applications. To address this issue, a wide range of domain generalization (DG) methods have been developed. However, these approaches mainly focus on invariant representations by leveraging multiple source domain data, which ignore the uncertainty presented from different domains. In this paper, we establish a novel DG framework in form of evidential deep learning (EDL-DG). To reach DG objective under finite given domains, we propose a new Domain Uncertainty Shrinkage (DUS) regularization scheme on the output Dirichlet distribution parameters, which achieves better generalization across unseen domains without introducing additional structures. Theoretically, we analyze the convergence of EDL-DG, and provide a generalization bound in the framework of PAC-Bayesian learning. We show that our proposed method reduce the PAC-Bayesian bound under certain conditions, and thus achieve better generalization across unseen domains. In our experiments, we validate the effectiveness our proposed method on DomainBed benchmark in multiple real-world datasets.
{"title":"Domain generalization via domain uncertainty shrinkage","authors":"Jun-Zheng Chu , Bin Pan , Tian-Yang Shi , Zhen-Wei Shi","doi":"10.1016/j.patcog.2026.113118","DOIUrl":"10.1016/j.patcog.2026.113118","url":null,"abstract":"<div><div>Ensuring model robustness against distributional shifts still presents a significant challenge in many machine learning applications. To address this issue, a wide range of domain generalization (DG) methods have been developed. However, these approaches mainly focus on invariant representations by leveraging multiple source domain data, which ignore the uncertainty presented from different domains. In this paper, we establish a novel DG framework in form of evidential deep learning (EDL-DG). To reach DG objective under finite given domains, we propose a new <em>Domain Uncertainty Shrinkage</em> (<strong>DUS</strong>) regularization scheme on the output Dirichlet distribution parameters, which achieves better generalization across unseen domains without introducing additional structures. Theoretically, we analyze the convergence of EDL-DG, and provide a generalization bound in the framework of PAC-Bayesian learning. We show that our proposed method reduce the PAC-Bayesian bound under certain conditions, and thus achieve better generalization across unseen domains. In our experiments, we validate the effectiveness our proposed method on DomainBed benchmark in multiple real-world datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113118"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-17DOI: 10.1016/j.patcog.2026.113114
Mohamad Ebrahim Aghili, Hassan Ghassemian, Maryam Imani
Recognizing small objects in remote sensing imagery remains a significant challenge. This paper introduces YOLO-PICO, a novel and highly efficient object detector designed for small object recognition. At its core is the Expansion Attention (EA) Module, a new operator for spatial-channel feature fusion that enhances fine-grained details with minimal computational cost. This allows YOLO-PICO to achieve competitive performance with significantly fewer parameters than existing models, as demonstrated by our new parameter efficiency metric, Size-Normalized Average Precision (SNAP). Furthermore, we show that YOLO-PICO's efficiency makes it an ideal foundation for an Ensemble of Specialists (EoS) framework, a decision-level fusion strategy that substantially boosts detection accuracy with a modest increase in inference time. Our results demonstrate that this combination of an efficient core model and an advanced fusion strategy offers a compelling solution for high-performance recognition on resource-constrained platforms. The code will be made available at: https://github.com/MohamadEbrahimAghili/YOLO-PICO.
{"title":"YOLO-PICO: Lightweight object recognition in remote sensing images using expansion attention modules","authors":"Mohamad Ebrahim Aghili, Hassan Ghassemian, Maryam Imani","doi":"10.1016/j.patcog.2026.113114","DOIUrl":"10.1016/j.patcog.2026.113114","url":null,"abstract":"<div><div>Recognizing small objects in remote sensing imagery remains a significant challenge. This paper introduces YOLO-PICO, a novel and highly efficient object detector designed for small object recognition. At its core is the Expansion Attention (EA) Module, a new operator for spatial-channel feature fusion that enhances fine-grained details with minimal computational cost. This allows YOLO-PICO to achieve competitive performance with significantly fewer parameters than existing models, as demonstrated by our new parameter efficiency metric, Size-Normalized Average Precision (SNAP). Furthermore, we show that YOLO-PICO's efficiency makes it an ideal foundation for an Ensemble of Specialists (EoS) framework, a decision-level fusion strategy that substantially boosts detection accuracy with a modest increase in inference time. Our results demonstrate that this combination of an efficient core model and an advanced fusion strategy offers a compelling solution for high-performance recognition on resource-constrained platforms. The code will be made available at: <span><span>https://github.com/MohamadEbrahimAghili/YOLO-PICO</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113114"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}