Pub Date : 2026-01-28DOI: 10.1016/j.patcog.2026.113182
Lixiang Xu , Wei Ge , Feiping Nie , Enhong Chen , Bin Luo
Graph neural networks (GNNs) have been successfully applied to many graph classification tasks. However, most GNNs are based on message-passing neural network (MPNN) frameworks, making it difficult to utilize the structural information of the graph from multiple perspectives. To address the limitations of existing GNN methods, we incorporate structural information into graph embedding representation in two ways. On the one hand, the subgraph information in the neighborhood of a node is incorporated into the message passing process of GNN through graph entropy. On the other hand, we encode the path information in the graph with the help of an improved shortest path kernel. Then, these two parts of structural information are fused through the attention mechanism, which can capture the structural information of the graph and thus enrich the structural expression of graph neural network. Finally, the model is experimentally evaluated on seven publicly available graph classification datasets. Compared with the existing graph representation models, extensive experiments show that our model can better obtain graph representation and achieves more competitive performance.
{"title":"Kernel entropy graph isomorphism network for graph classification","authors":"Lixiang Xu , Wei Ge , Feiping Nie , Enhong Chen , Bin Luo","doi":"10.1016/j.patcog.2026.113182","DOIUrl":"10.1016/j.patcog.2026.113182","url":null,"abstract":"<div><div>Graph neural networks (GNNs) have been successfully applied to many graph classification tasks. However, most GNNs are based on message-passing neural network (MPNN) frameworks, making it difficult to utilize the structural information of the graph from multiple perspectives. To address the limitations of existing GNN methods, we incorporate structural information into graph embedding representation in two ways. On the one hand, the subgraph information in the neighborhood of a node is incorporated into the message passing process of GNN through graph entropy. On the other hand, we encode the path information in the graph with the help of an improved shortest path kernel. Then, these two parts of structural information are fused through the attention mechanism, which can capture the structural information of the graph and thus enrich the structural expression of graph neural network. Finally, the model is experimentally evaluated on seven publicly available graph classification datasets. Compared with the existing graph representation models, extensive experiments show that our model can better obtain graph representation and achieves more competitive performance.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113182"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1016/j.patcog.2026.113145
Shaoxiang Guo , Donald Risbridger , David A. Robb , Xianwen Kong , M. J. Daniel Esser , Michael J. Chantler , Richard M. Carter , Mustafa Suphi Erden
Accurate alignment of a laser resonator is essential for upscaling industrial laser manufacturing and precision processing. However, traditional manual or semi-automatic methods depend heavily on operator expertise, and struggle with the interdependence among multiple alignment parameters. To tackle this, we introduce the first real-world image dataset for automatic laser resonator alignment, collected on a laboratory-built resonator setup. It comprises over 6000 beam profiler images annotated with four key alignment parameters (intracavity iris aperture diameter, output coupler pitch and yaw actuator displacements, and axial position of the output coupler), with over 500,000 paired samples for data-driven alignment. Given a pair of beam profiler images exhibiting distinct beam patterns under different configurations, the system predicts the control-parameter changes required to realign the resonator. Leveraging this dataset, we propose a novel two-stage deep learning framework for automatic resonator alignment. In Stage 1, a multi-scale CNN augmented with cross-attention and correlation-difference modules, extracts features and outputs an initial coarse prediction of alignment parameters. In Stage 2, a feature-difference map is computed by subtracting the paired feature representations and fed into an iterative refinement module to correct residual misalignments. The final prediction combines coarse and refined estimates, integrating global context with fine-grained corrections for accurate inference. Experiments on our dataset and a different instance of the same physical system from which the CNN was trained suggest superior accuracy and practicality to manual alignment.
{"title":"A two-stage learning framework with a beam image dataset for automatic laser resonator alignment","authors":"Shaoxiang Guo , Donald Risbridger , David A. Robb , Xianwen Kong , M. J. Daniel Esser , Michael J. Chantler , Richard M. Carter , Mustafa Suphi Erden","doi":"10.1016/j.patcog.2026.113145","DOIUrl":"10.1016/j.patcog.2026.113145","url":null,"abstract":"<div><div>Accurate alignment of a laser resonator is essential for upscaling industrial laser manufacturing and precision processing. However, traditional manual or semi-automatic methods depend heavily on operator expertise, and struggle with the interdependence among multiple alignment parameters. To tackle this, we introduce the first real-world image dataset for automatic laser resonator alignment, collected on a laboratory-built resonator setup. It comprises over 6000 beam profiler images annotated with four key alignment parameters (intracavity iris aperture diameter, output coupler pitch and yaw actuator displacements, and axial position of the output coupler), with over 500,000 paired samples for data-driven alignment. Given a pair of beam profiler images exhibiting distinct beam patterns under different configurations, the system predicts the control-parameter changes required to realign the resonator. Leveraging this dataset, we propose a novel two-stage deep learning framework for automatic resonator alignment. In Stage 1, a multi-scale CNN augmented with cross-attention and correlation-difference modules, extracts features and outputs an initial coarse prediction of alignment parameters. In Stage 2, a feature-difference map is computed by subtracting the paired feature representations and fed into an iterative refinement module to correct residual misalignments. The final prediction combines coarse and refined estimates, integrating global context with fine-grained corrections for accurate inference. Experiments on our dataset and a different instance of the same physical system from which the CNN was trained suggest superior accuracy and practicality to manual alignment.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113145"},"PeriodicalIF":7.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1016/j.patcog.2026.113166
Xincheng Tang , Mengqi Rong , Bin Fan , Hongmin Liu , Shuhan Shen
Learning-based Multi-View Stereo (MVS) has become a key technique for reconstructing dense 3D point clouds from multiple calibrated images. However, real-world challenges such as occlusions and textureless regions often hinder accurate depth estimation. Recent advances in monocular Vision Foundation Models (VFMs) have demonstrated strong generalization capabilities in scene understanding, offering new opportunities to enhance the robustness of MVS. In this paper, we present MC-MVSNet, a novel MVS framework that integrates diverse monocular cues to improve depth estimation under challenging conditions. During feature extraction, we fuse conventional CNN features with VFM-derived representations through a hybrid feature fusion module, effectively combining local details and global context for more discriminative feature matching. We also propose a cost volume filtering module that enforces cross-view geometric consistency on monocular depth predictions, pruning redundant depth hypotheses to reduce the depth search space and mitigate matching ambiguity. Additionally, we leverage monocular surface normals to construct a curved patch cost aggregation module that aggregates costs over geometry-aligned curved patches, which improves depth estimation accuracy in curved and textureless regions. Extensive experiments on the DTU, Tanks and Temples, and ETH3D benchmarks demonstrate that MC-MVSNet achieves state-of-the-art performance and exhibits strong generalization capabilities, validating the effectiveness and robustness of the proposed method.
基于学习的多视点立体(MVS)技术已经成为从多个标定图像中重建密集三维点云的关键技术。然而,现实世界的挑战,如遮挡和无纹理区域往往阻碍准确的深度估计。近年来,单目视觉基础模型(VFMs)在场景理解方面具有较强的泛化能力,为增强单目视觉基础模型的鲁棒性提供了新的机会。在本文中,我们提出了MC-MVSNet,这是一个新的MVS框架,它集成了多种单目线索,以提高在具有挑战性条件下的深度估计。在特征提取过程中,我们通过混合特征融合模块将传统的CNN特征与vfm衍生的表征融合在一起,有效地将局部细节与全局上下文相结合,实现更具判别性的特征匹配。我们还提出了一个代价体积过滤模块,该模块强制单目深度预测的跨视图几何一致性,修剪冗余深度假设以减少深度搜索空间并减轻匹配歧义。此外,我们利用单眼表面法线构建了一个弯曲斑块成本聚合模块,该模块可以聚合几何对齐的弯曲斑块上的成本,从而提高了弯曲和无纹理区域的深度估计精度。在DTU、Tanks and Temples和ETH3D基准测试上进行的大量实验表明,MC-MVSNet实现了最先进的性能,并表现出强大的泛化能力,验证了所提出方法的有效性和鲁棒性。
{"title":"MC-MVSNet: When multi-view stereo meets monocular cues","authors":"Xincheng Tang , Mengqi Rong , Bin Fan , Hongmin Liu , Shuhan Shen","doi":"10.1016/j.patcog.2026.113166","DOIUrl":"10.1016/j.patcog.2026.113166","url":null,"abstract":"<div><div>Learning-based Multi-View Stereo (MVS) has become a key technique for reconstructing dense 3D point clouds from multiple calibrated images. However, real-world challenges such as occlusions and textureless regions often hinder accurate depth estimation. Recent advances in monocular Vision Foundation Models (VFMs) have demonstrated strong generalization capabilities in scene understanding, offering new opportunities to enhance the robustness of MVS. In this paper, we present MC-MVSNet, a novel MVS framework that integrates diverse monocular cues to improve depth estimation under challenging conditions. During feature extraction, we fuse conventional CNN features with VFM-derived representations through a hybrid feature fusion module, effectively combining local details and global context for more discriminative feature matching. We also propose a cost volume filtering module that enforces cross-view geometric consistency on monocular depth predictions, pruning redundant depth hypotheses to reduce the depth search space and mitigate matching ambiguity. Additionally, we leverage monocular surface normals to construct a curved patch cost aggregation module that aggregates costs over geometry-aligned curved patches, which improves depth estimation accuracy in curved and textureless regions. Extensive experiments on the DTU, Tanks and Temples, and ETH3D benchmarks demonstrate that MC-MVSNet achieves state-of-the-art performance and exhibits strong generalization capabilities, validating the effectiveness and robustness of the proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113166"},"PeriodicalIF":7.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1016/j.patcog.2026.113152
Zhaodi Wang , Biao Leng , Shuo Zhang
Unbiased Scene Graph Generation (SGG) aims to parse visual scenes into highly informative graphs under the long-tail challenge. While prototype-based methods have shown promise in unbiased SGG, they highlight the importance of learning discriminative features that are intra-class compact and inter-class separable. In this paper, we revisit prototype-based methods and analyze critical roles of representation learning and prototype classifier in driving unbiased SGG, and accordingly propose a novel framework DuoNet. To enhance intra-class compactness, we introduce a Bi-Directional Representation Refinement (BiDR2) module that captures relation-sensitive visual variability and within-relation visual consistency of entities. This module adopts relation-to-entity-to-relation refinement by integrating dual-level relation pattern modeling with a relation-specific entity constraint. Furthermore, a Knowledge-Guided Prototype Learning (KGPL) module is devised to strengthen inter-class separability by constructing an equidistributed prototypical classifier with maximum inter-class margins. The equidistributed prototype classifier is frozen during SGG training to mitigate long-tail bias, thus a knowledge-driven triplet loss is developed to strengthen the learning of BiDR2, enhancing relation-prototype matching. Extensive experiments demonstrate the effectiveness of our method, which sets new state-of-the-art performance on Visual Genome, GQA and Open Images datasets.
无偏场景图生成(Unbiased Scene Graph Generation, SGG)的目标是在长尾挑战下将视觉场景解析成高信息量的图。虽然基于原型的方法在无偏SGG中显示出了希望,但它们强调了学习类内紧凑和类间可分离的判别特征的重要性。本文回顾了基于原型的方法,分析了表征学习和原型分类器在驱动无偏SGG中的关键作用,并据此提出了一个新的框架DuoNet。为了增强类内的紧凑性,我们引入了双向表示细化(BiDR2)模块,该模块捕获关系敏感的视觉可变性和实体的关系内视觉一致性。该模块通过将双级关系模式建模与特定于关系的实体约束集成,采用关系到实体到关系的细化。在此基础上,设计了知识引导原型学习(knowledge guided Prototype Learning, KGPL)模块,通过构造类间边界最大的等分布原型分类器来增强类间可分性。在SGG训练过程中冻结等分布原型分类器以减轻长尾偏差,因此开发了知识驱动的三重损失来加强BiDR2的学习,增强关系原型匹配。大量的实验证明了我们的方法的有效性,它在视觉基因组,GQA和开放图像数据集上设置了新的最先进的性能。
{"title":"DuoNet: Joint optimization of representation learning and prototype classifier for unbiased scene graph generation","authors":"Zhaodi Wang , Biao Leng , Shuo Zhang","doi":"10.1016/j.patcog.2026.113152","DOIUrl":"10.1016/j.patcog.2026.113152","url":null,"abstract":"<div><div>Unbiased Scene Graph Generation (SGG) aims to parse visual scenes into highly informative graphs under the long-tail challenge. While prototype-based methods have shown promise in unbiased SGG, they highlight the importance of learning discriminative features that are intra-class compact and inter-class separable. In this paper, we revisit prototype-based methods and analyze critical roles of representation learning and prototype classifier in driving unbiased SGG, and accordingly propose a novel framework DuoNet. To enhance intra-class compactness, we introduce a Bi-Directional Representation Refinement (BiDR<sup>2</sup>) module that captures relation-sensitive visual variability and within-relation visual consistency of entities. This module adopts relation-to-entity-to-relation refinement by integrating dual-level relation pattern modeling with a relation-specific entity constraint. Furthermore, a Knowledge-Guided Prototype Learning (KGPL) module is devised to strengthen inter-class separability by constructing an equidistributed prototypical classifier with maximum inter-class margins. The equidistributed prototype classifier is frozen during SGG training to mitigate long-tail bias, thus a knowledge-driven triplet loss is developed to strengthen the learning of BiDR<sup>2</sup>, enhancing relation-prototype matching. Extensive experiments demonstrate the effectiveness of our method, which sets new state-of-the-art performance on Visual Genome, GQA and Open Images datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113152"},"PeriodicalIF":7.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1016/j.patcog.2026.113161
Yujian Ma , Xianquan Jiang , Jinqiu Sang , Ruizhe Li
Despite the strong acoustic modeling capabilities of large pre-trained speech models such as Whisper, their direct application to speech emotion recognition (SER) is hindered by task mismatch, high computational cost, and limited retention of affective cues. To address these challenges, we propose DFWe (Distillation of Fine-Tuned Whisper Encoder), a two-stage knowledge distillation (KD) framework combining parameter-efficient adaptation with multi-objective supervision. In Stage 1, a subset of upper layers in the Whisper encoder is fine-tuned along with a lightweight projector and classification head, enabling the model to preserve general acoustic knowledge while adapting to emotion-specific features. In Stage 2, knowledge is distilled into a compact Whisper-Small student using a hybrid loss that integrates hard-label cross-entropy, confidence-aware soft-label KL divergence, and intermediate feature alignment via Centered Kernel Alignment (CKA). On the IEMOCAP dataset with 10-fold cross-validation (CV), DFWe achieves a 7.21 × reduction in model size while retaining 99.99% of the teacher’s unweighted average recall (UAR) and reaching 79.82% weighted average recall (WAR) and 81.32% UAR, representing state-of-the-art performance among knowledge-distillation-based SER methods. Ablation studies highlight the benefits of adaptive temperature scaling, multi-level supervision, and targeted augmentation in improving both accuracy and robustness. Case analyses further show that DFWe yields more confident and stable predictions in emotionally ambiguous scenarios, underscoring its practical effectiveness. Overall, DFWe offers a scalable, generalizable solution for deploying SER systems in resource-constrained environments.
{"title":"DFWe: Efficient knowledge distillation of fine-tuned Whisper encoder for speech emotion recognition","authors":"Yujian Ma , Xianquan Jiang , Jinqiu Sang , Ruizhe Li","doi":"10.1016/j.patcog.2026.113161","DOIUrl":"10.1016/j.patcog.2026.113161","url":null,"abstract":"<div><div>Despite the strong acoustic modeling capabilities of large pre-trained speech models such as Whisper, their direct application to speech emotion recognition (SER) is hindered by task mismatch, high computational cost, and limited retention of affective cues. To address these challenges, we propose DFWe (Distillation of Fine-Tuned Whisper Encoder), a two-stage knowledge distillation (KD) framework combining parameter-efficient adaptation with multi-objective supervision. In Stage 1, a subset of upper layers in the Whisper encoder is fine-tuned along with a lightweight projector and classification head, enabling the model to preserve general acoustic knowledge while adapting to emotion-specific features. In Stage 2, knowledge is distilled into a compact Whisper-Small student using a hybrid loss that integrates hard-label cross-entropy, confidence-aware soft-label KL divergence, and intermediate feature alignment via Centered Kernel Alignment (CKA). On the IEMOCAP dataset with 10-fold cross-validation (CV), DFWe achieves a 7.21 × reduction in model size while retaining 99.99% of the teacher’s unweighted average recall (UAR) and reaching 79.82% weighted average recall (WAR) and 81.32% UAR, representing state-of-the-art performance among knowledge-distillation-based SER methods. Ablation studies highlight the benefits of adaptive temperature scaling, multi-level supervision, and targeted augmentation in improving both accuracy and robustness. Case analyses further show that DFWe yields more confident and stable predictions in emotionally ambiguous scenarios, underscoring its practical effectiveness. Overall, DFWe offers a scalable, generalizable solution for deploying SER systems in resource-constrained environments.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113161"},"PeriodicalIF":7.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-25DOI: 10.1016/j.patcog.2026.113158
Gang Li , Enze Xie , Chongjian Ge , Xiang Li , Lingyu Si , Changwen Zheng , Zhenguo Li
Denoising Diffusion Probabilistic Models (DDPMs) have made significant progress in image generation. Recent works in semantic-to-image (S2I) synthesis have also shifted from the previously de facto GAN-based methods to DDPMs, yielding better results. However, these works mostly employ a U-Net structure and vanilla training-from-scratch scheme for S2I, unconsciously neglecting the potential benefits offered by task-related pre-training. In this work, we introduce a Transformer-based architecture, namely S2I-DiT, and reconsider the merits of a pre-trained large diffusion model for cross-task adaptation (i.e., from the class-conditional generation to S2I). In S2I-DiT, we propose the integration of semantic embedders within Diffusion Transformers (DiTs) to maximize the utilization of semantic information. The semantic embedder densely encodes semantic layouts to guide the adaptive normalization process. We configure semantic embedders in a layer-wise manner to learn pixel-level correspondence, enabling finer-grained semantic-to-image control. Besides, to fully unleash the cross-task transferability of DDPMs, we introduce a two-stage fine-tuning strategy, which involves initially adapting the semantic embedders in the pixel-level space, followed by fine-tuning the partial/entire model for cross-task adaptation. Notably, S2I-DiT pioneers the application of Large Diffusion Transformers to cross-task fine-tuning. Extensive experiments on four benchmark datasets demonstrate S2I-DiT’s effectiveness, as it achieves state-of-the-art performance in terms of quality (FID) and diversity (LPIPS), while consuming fewer training iterations. This work establishes a new state-of-the-art for semantic-to-image generation and provides valuable insights into cross-task transferability of large generative models.
{"title":"S2I-DiT: Unlocking the semantic-to-image transferability by fine-tuning large diffusion transformer models","authors":"Gang Li , Enze Xie , Chongjian Ge , Xiang Li , Lingyu Si , Changwen Zheng , Zhenguo Li","doi":"10.1016/j.patcog.2026.113158","DOIUrl":"10.1016/j.patcog.2026.113158","url":null,"abstract":"<div><div>Denoising Diffusion Probabilistic Models (DDPMs) have made significant progress in image generation. Recent works in semantic-to-image (S2I) synthesis have also shifted from the previously <em>de facto</em> GAN-based methods to DDPMs, yielding better results. However, these works mostly employ a U-Net structure and vanilla training-from-scratch scheme for S2I, unconsciously neglecting the potential benefits offered by task-related pre-training. In this work, we introduce a Transformer-based architecture, namely S2I-DiT, and reconsider the merits of a pre-trained large diffusion model for cross-task adaptation (i.e., from the class-conditional generation to S2I). In S2I-DiT, we propose the integration of semantic embedders within Diffusion Transformers (DiTs) to maximize the utilization of semantic information. The semantic embedder densely encodes semantic layouts to guide the adaptive normalization process. We configure semantic embedders in a layer-wise manner to learn pixel-level correspondence, enabling finer-grained semantic-to-image control. Besides, to fully unleash the cross-task transferability of DDPMs, we introduce a two-stage fine-tuning strategy, which involves initially adapting the semantic embedders in the pixel-level space, followed by fine-tuning the partial/entire model for cross-task adaptation. Notably, S2I-DiT pioneers the application of Large Diffusion Transformers to cross-task fine-tuning. Extensive experiments on four benchmark datasets demonstrate S2I-DiT’s effectiveness, as it achieves state-of-the-art performance in terms of quality (FID) and diversity (LPIPS), while consuming fewer training iterations. This work establishes a new state-of-the-art for semantic-to-image generation and provides valuable insights into cross-task transferability of large generative models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113158"},"PeriodicalIF":7.6,"publicationDate":"2026-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.patcog.2026.113139
Qiufu Li , Zewen Li , Linlin Shen
Compared to back-propagation algorithms, the forward-forward (FF) algorithm proposed by Hinton [1] can in parallel optimize all layers of deep network models, while requiring less storage and achieving higher computational efficiency. However, the current FF methods cannot fully leverage the label information of samples, which suppress the learning of discriminative features. In this paper, we propose prototype learning within the FF algorithm (PLFF). When optimizing each convolutional layer, PLFF first divides the convolutional kernels into various groups according to the number K of classes, which serve as class prototypes in the optimizing, referred to as convolutional prototypes. For every sample, K goodness scores are calculated based on its convolutional results between the sample data and the convolutional prototypes. Then, using multiple binary cross-entropy losses, PLFF maximizes the positive goodness score corresponding to the sample label while minimizing other negative goodness scores, to learn discriminative features. Meanwhile, PLFF maximizes the cosine distances among the K convolutional prototypes, which enhances their discrimination and, in turn, promotes the learning of features. The image classification results across multiple datasets show that PLFF achieves the best results among different FF methods. Finally, for the first time, we verify the long-tailed recognition performance of different FF methods, demonstrating that our PLFF achieves superior results.
{"title":"Learning discriminative features within forward-Forward algorithm using convolutional prototype","authors":"Qiufu Li , Zewen Li , Linlin Shen","doi":"10.1016/j.patcog.2026.113139","DOIUrl":"10.1016/j.patcog.2026.113139","url":null,"abstract":"<div><div>Compared to back-propagation algorithms, the forward-forward (FF) algorithm proposed by Hinton [1] can in parallel optimize all layers of deep network models, while requiring less storage and achieving higher computational efficiency. However, the current FF methods cannot fully leverage the label information of samples, which suppress the learning of discriminative features. In this paper, we propose prototype learning within the FF algorithm (PLFF). When optimizing each convolutional layer, PLFF first divides the convolutional kernels into various groups according to the number <em>K</em> of classes, which serve as class prototypes in the optimizing, referred to as convolutional prototypes. For every sample, <em>K</em> goodness scores are calculated based on its convolutional results between the sample data and the convolutional prototypes. Then, using multiple binary cross-entropy losses, PLFF maximizes the positive goodness score corresponding to the sample label while minimizing other negative goodness scores, to learn discriminative features. Meanwhile, PLFF maximizes the cosine distances among the <em>K</em> convolutional prototypes, which enhances their discrimination and, in turn, promotes the learning of features. The image classification results across multiple datasets show that PLFF achieves the best results among different FF methods. Finally, for the first time, we verify the long-tailed recognition performance of different FF methods, demonstrating that our PLFF achieves superior results.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113139"},"PeriodicalIF":7.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.patcog.2026.113127
Yonglin Chen , Binzhi Fan , Nan Liu , Yalong Yang , Jinhui Tang
Adverse weather conditions severely degrade visual perception in autonomous driving systems, primarily due to image quality deterioration, object occlusion, and unstable illumination. Current deep learning-based detection methods exhibit limited robustness in such scenarios, as corrupted features and inefficient algorithmic adaptation impair their performance under weather variations. To overcome these challenges, we propose EEnvA-Mamba, a computationally efficient architecture that synergizes real-time processing with high detection accuracy. The framework features three core components: (1) AVSSBlock, a vision state-space block that incorporates environment-aware gating (dynamically adjusting feature-channel weights based on weather conditions) and weather-conditioned channel weighting (unequal channel responses under different weather types), effectively mitigating feature degradation; (2) A linear-complexity computation scheme that replaces conventional quadratic Transformer operations while preserving discriminative feature learning; (3) AStem, an attention-guided dual-branch module that strengthens local feature extraction via spatial-channel interactions while employing frequency domain denoising techniques to suppress noise across various frequencies, ensuring precise dependency modeling. To support rigorous validation, We collected and annotated a dataset: VOC-SNOW—a dedicated snowy road dataset comprising 2700 annotated images with diverse illumination and snowfall levels. Comparative experiments under multiple datasets verify our method’s superiority, demonstrating state-of-the-art performance with 66.4% APval (4.5% higher than leading counterparts). The source code has been released at https://github.com/fbzahwy/EEnvA-Mamba.
{"title":"EEnvA-Mamba: Effective and environtology-aware adaptive Mamba for road object detection in adverse weather scenes","authors":"Yonglin Chen , Binzhi Fan , Nan Liu , Yalong Yang , Jinhui Tang","doi":"10.1016/j.patcog.2026.113127","DOIUrl":"10.1016/j.patcog.2026.113127","url":null,"abstract":"<div><div>Adverse weather conditions severely degrade visual perception in autonomous driving systems, primarily due to image quality deterioration, object occlusion, and unstable illumination. Current deep learning-based detection methods exhibit limited robustness in such scenarios, as corrupted features and inefficient algorithmic adaptation impair their performance under weather variations. To overcome these challenges, we propose EEnvA-Mamba, a computationally efficient architecture that synergizes real-time processing with high detection accuracy. The framework features three core components: (1) AVSSBlock, a vision state-space block that incorporates environment-aware gating (dynamically adjusting feature-channel weights based on weather conditions) and weather-conditioned channel weighting (unequal channel responses under different weather types), effectively mitigating feature degradation; (2) A linear-complexity computation scheme that replaces conventional quadratic Transformer operations while preserving discriminative feature learning; (3) AStem, an attention-guided dual-branch module that strengthens local feature extraction via spatial-channel interactions while employing frequency domain denoising techniques to suppress noise across various frequencies, ensuring precise dependency modeling. To support rigorous validation, We collected and annotated a dataset: VOC-SNOW—a dedicated snowy road dataset comprising 2700 annotated images with diverse illumination and snowfall levels. Comparative experiments under multiple datasets verify our method’s superiority, demonstrating state-of-the-art performance with 66.4% AP<sup><em>val</em></sup> (4.5% higher than leading counterparts). The source code has been released at <span><span>https://github.com/fbzahwy/EEnvA-Mamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113127"},"PeriodicalIF":7.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce FeatureSORT, a simple yet effective online multiple object tracker that reinforces the baselines with a redesigned detector and additional feature cues, while keeping computational complexity low. In contrast to conventional detectors that only provide bounding boxes, our designed detector architecture is extended to output multiple appearance attributes, including clothing color, clothing style, and motion direction, alongside the bounding boxes. These feature cues, together with a ReID network, form complementary embeddings that substantially improve association accuracy. The rationale behind selecting and combining these attributes is thoroughly examined in extensive ablation studies. Furthermore, we incorporate stronger post-processing strategies, such as global linking and Gaussian Smoothing Process interpolation, to handle missing associations and detections. During online tracking, we define a measurement-to-track distance function that jointly considers IoU, direction, color, style, and ReID similarity. This design enables FeatureSORT to maintain consistent identities through longer occlusions while reducing identity switches. Extensive experiments on standard MOT benchmarks demonstrate that FeatureSORT achieves state-of-the-art (SOTA) online performance, with MOTA scores of 79.7 on MOT16, 80.6 on MOT17, 77.9 on MOT20, and 92.2 on DanceTrack, underscoring the effectiveness of feature-enriched detection in advancing multi-object tracking. Our Github repository includes code implementation.
{"title":"FeatureSORT: A robust tracker with optimized feature integration","authors":"Hamidreza Hashempoor, Rosemary Koikara, Yu Dong Hwang","doi":"10.1016/j.patcog.2026.113148","DOIUrl":"10.1016/j.patcog.2026.113148","url":null,"abstract":"<div><div>We introduce FeatureSORT, a simple yet effective online multiple object tracker that reinforces the baselines with a redesigned detector and additional feature cues, while keeping computational complexity low. In contrast to conventional detectors that only provide bounding boxes, our designed detector architecture is extended to output multiple appearance attributes, including clothing color, clothing style, and motion direction, alongside the bounding boxes. These feature cues, together with a ReID network, form complementary embeddings that substantially improve association accuracy. The rationale behind selecting and combining these attributes is thoroughly examined in extensive ablation studies. Furthermore, we incorporate stronger post-processing strategies, such as global linking and Gaussian Smoothing Process interpolation, to handle missing associations and detections. During online tracking, we define a measurement-to-track distance function that jointly considers IoU, direction, color, style, and ReID similarity. This design enables FeatureSORT to maintain consistent identities through longer occlusions while reducing identity switches. Extensive experiments on standard MOT benchmarks demonstrate that FeatureSORT achieves state-of-the-art (SOTA) online performance, with MOTA scores of 79.7 on MOT16, 80.6 on MOT17, 77.9 on MOT20, and 92.2 on DanceTrack, underscoring the effectiveness of feature-enriched detection in advancing multi-object tracking. <span><span>Our Github repository</span><svg><path></path></svg></span> includes code implementation.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113148"},"PeriodicalIF":7.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.patcog.2026.113154
Ze Feng , Sen Yang , Jiang-Jiang Liu , Wankou Yang
In this paper, we explore the task of 3D whole-body pose estimation based on a single-frame image and propose a new paradigm called PoseAdapter, which exploits a well-pretrained 2D human pose estimation model equipped with Adapter. The mainstream paradigms for 3D human pose estimation typically require multiple stages, such as human box detection, 2D pose estimation, and lifting to 3D coordinates. Such a multi-stage approach probably loses context information in the compression process, resulting in inferior pose results, particularly for the dense prediction tasks such as 3D Whole-Body pose estimation. To improve the accuracy of pose estimation, some methods even use multi-frame fusion to enhance the current pose, including input from future frames, which is inherently non-causal. Considering that end-to-end 2D human pose methods could extract human-related and keypoint-specific visual features, we want to employ them as a general vision-based human analysis model and enable it to predict 3D whole-body poses. By freezing most of the parameters of the 2D model and tuning the newly added adapter, PoseAdapter could transfer the 2D estimator to the 3D pose task in a parameter-efficient manner, while retaining the original ability of distinguishing multiple human instances. Quantitative experimental results on H3WB demonstrate that PoseAdapter with fewer trainable parameters achieves an accuracy of 62.74 mm MPJPE. Qualitative research also shows that PoseAdapter could predict multi-person 3D Whole-Body pose results and can generalize to out-of-domain datasets, such as COCO.
在本文中,我们探索了基于单帧图像的3D全身姿态估计任务,并提出了一种名为PoseAdapter的新范式,该范式利用了配备了Adapter的预训练良好的2D人体姿态估计模型。三维人体姿态估计的主流范式通常需要多个阶段,如人体盒检测、二维姿态估计和提升到三维坐标。这种多阶段的方法可能会在压缩过程中丢失上下文信息,导致姿态结果较差,特别是对于3D全身姿态估计等密集预测任务。为了提高姿态估计的准确性,有些方法甚至使用多帧融合来增强当前姿态,包括来自未来帧的输入,这本身是非因果的。考虑到端到端的2D人体姿态方法可以提取与人体相关和特定关键点的视觉特征,我们希望将其作为一种通用的基于视觉的人体分析模型,并使其能够预测3D全身姿态。通过冻结2D模型的大部分参数并调整新添加的适配器,PoseAdapter可以将2D估计器以参数有效的方式转移到3D姿态任务中,同时保留原有的识别多个人体实例的能力。在H3WB上的定量实验结果表明,在可训练参数较少的情况下,PoseAdapter达到了62.74 mm MPJPE的精度。定性研究还表明,PoseAdapter可以预测多人3D全身姿势结果,并可以推广到域外数据集,如COCO。
{"title":"PoseAdapter: Efficiently transferring 2D human pose estimator to 3D whole-body task via adapter","authors":"Ze Feng , Sen Yang , Jiang-Jiang Liu , Wankou Yang","doi":"10.1016/j.patcog.2026.113154","DOIUrl":"10.1016/j.patcog.2026.113154","url":null,"abstract":"<div><div>In this paper, we explore the task of 3D whole-body pose estimation based on a single-frame image and propose a new paradigm called PoseAdapter, which exploits a well-pretrained 2D human pose estimation model equipped with Adapter. The mainstream paradigms for 3D human pose estimation typically require multiple stages, such as human box detection, 2D pose estimation, and lifting to 3D coordinates. Such a multi-stage approach probably loses context information in the compression process, resulting in inferior pose results, particularly for the dense prediction tasks such as 3D Whole-Body pose estimation. To improve the accuracy of pose estimation, some methods even use multi-frame fusion to enhance the current pose, including input from future frames, which is inherently non-causal. Considering that end-to-end 2D human pose methods could extract human-related and keypoint-specific visual features, we want to employ them as a general vision-based human analysis model and enable it to predict 3D whole-body poses. By freezing most of the parameters of the 2D model and tuning the newly added adapter, PoseAdapter could transfer the 2D estimator to the 3D pose task in a parameter-efficient manner, while retaining the original ability of distinguishing multiple human instances. Quantitative experimental results on H3WB demonstrate that PoseAdapter with fewer trainable parameters achieves an accuracy of 62.74 mm MPJPE. Qualitative research also shows that PoseAdapter could predict multi-person 3D Whole-Body pose results and can generalize to out-of-domain datasets, such as COCO.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113154"},"PeriodicalIF":7.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}