首页 > 最新文献

Pattern Recognition最新文献

英文 中文
Kernel entropy graph isomorphism network for graph classification 核熵图同构网络的图分类
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-28 DOI: 10.1016/j.patcog.2026.113182
Lixiang Xu , Wei Ge , Feiping Nie , Enhong Chen , Bin Luo
Graph neural networks (GNNs) have been successfully applied to many graph classification tasks. However, most GNNs are based on message-passing neural network (MPNN) frameworks, making it difficult to utilize the structural information of the graph from multiple perspectives. To address the limitations of existing GNN methods, we incorporate structural information into graph embedding representation in two ways. On the one hand, the subgraph information in the neighborhood of a node is incorporated into the message passing process of GNN through graph entropy. On the other hand, we encode the path information in the graph with the help of an improved shortest path kernel. Then, these two parts of structural information are fused through the attention mechanism, which can capture the structural information of the graph and thus enrich the structural expression of graph neural network. Finally, the model is experimentally evaluated on seven publicly available graph classification datasets. Compared with the existing graph representation models, extensive experiments show that our model can better obtain graph representation and achieves more competitive performance.
图神经网络(gnn)已经成功地应用于许多图分类任务。然而,大多数gnn是基于消息传递神经网络(MPNN)框架的,很难从多个角度利用图的结构信息。为了解决现有GNN方法的局限性,我们以两种方式将结构信息纳入图嵌入表示。一方面,通过图熵将节点邻域的子图信息纳入到GNN的消息传递过程中;另一方面,我们借助改进的最短路径核对图中的路径信息进行编码。然后,通过注意机制将这两部分结构信息融合,从而捕获图的结构信息,从而丰富图神经网络的结构表达。最后,在七个公开的图分类数据集上对该模型进行了实验评估。与现有的图表示模型相比,大量的实验表明,我们的模型可以更好地获得图表示,并取得更有竞争力的性能。
{"title":"Kernel entropy graph isomorphism network for graph classification","authors":"Lixiang Xu ,&nbsp;Wei Ge ,&nbsp;Feiping Nie ,&nbsp;Enhong Chen ,&nbsp;Bin Luo","doi":"10.1016/j.patcog.2026.113182","DOIUrl":"10.1016/j.patcog.2026.113182","url":null,"abstract":"<div><div>Graph neural networks (GNNs) have been successfully applied to many graph classification tasks. However, most GNNs are based on message-passing neural network (MPNN) frameworks, making it difficult to utilize the structural information of the graph from multiple perspectives. To address the limitations of existing GNN methods, we incorporate structural information into graph embedding representation in two ways. On the one hand, the subgraph information in the neighborhood of a node is incorporated into the message passing process of GNN through graph entropy. On the other hand, we encode the path information in the graph with the help of an improved shortest path kernel. Then, these two parts of structural information are fused through the attention mechanism, which can capture the structural information of the graph and thus enrich the structural expression of graph neural network. Finally, the model is experimentally evaluated on seven publicly available graph classification datasets. Compared with the existing graph representation models, extensive experiments show that our model can better obtain graph representation and achieves more competitive performance.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113182"},"PeriodicalIF":7.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A two-stage learning framework with a beam image dataset for automatic laser resonator alignment 基于光束图像数据集的激光谐振器自动对准两阶段学习框架
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-27 DOI: 10.1016/j.patcog.2026.113145
Shaoxiang Guo , Donald Risbridger , David A. Robb , Xianwen Kong , M. J. Daniel Esser , Michael J. Chantler , Richard M. Carter , Mustafa Suphi Erden
Accurate alignment of a laser resonator is essential for upscaling industrial laser manufacturing and precision processing. However, traditional manual or semi-automatic methods depend heavily on operator expertise, and struggle with the interdependence among multiple alignment parameters. To tackle this, we introduce the first real-world image dataset for automatic laser resonator alignment, collected on a laboratory-built resonator setup. It comprises over 6000 beam profiler images annotated with four key alignment parameters (intracavity iris aperture diameter, output coupler pitch and yaw actuator displacements, and axial position of the output coupler), with over 500,000 paired samples for data-driven alignment. Given a pair of beam profiler images exhibiting distinct beam patterns under different configurations, the system predicts the control-parameter changes required to realign the resonator. Leveraging this dataset, we propose a novel two-stage deep learning framework for automatic resonator alignment. In Stage 1, a multi-scale CNN augmented with cross-attention and correlation-difference modules, extracts features and outputs an initial coarse prediction of alignment parameters. In Stage 2, a feature-difference map is computed by subtracting the paired feature representations and fed into an iterative refinement module to correct residual misalignments. The final prediction combines coarse and refined estimates, integrating global context with fine-grained corrections for accurate inference. Experiments on our dataset and a different instance of the same physical system from which the CNN was trained suggest superior accuracy and practicality to manual alignment.
激光谐振腔的精确对准是提高工业激光制造和精密加工水平的必要条件。然而,传统的手动或半自动方法在很大程度上依赖于操作人员的专业知识,并且难以处理多个对准参数之间的相互依赖关系。为了解决这个问题,我们引入了第一个用于自动激光谐振器对准的真实世界图像数据集,该数据集收集于实验室构建的谐振器设置上。它包括6000多张带有四个关键对准参数(腔内光圈孔径、输出耦合器节距和偏转执行器位移、输出耦合器轴向位置)注释的光束剖面图像,超过50万对样本用于数据驱动的对准。给定一对在不同配置下显示不同光束模式的光束剖面仪图像,该系统预测重新调整谐振器所需的控制参数变化。利用该数据集,我们提出了一种新的两阶段深度学习框架,用于自动谐振器对齐。在阶段1中,多尺度CNN增强了交叉关注和相关差分模块,提取特征并输出对准参数的初始粗预测。在第二阶段,通过减去配对的特征表示来计算特征差图,并将其输入迭代细化模块以纠正剩余的不对齐。最终的预测结合了粗糙和精细的估计,将全局上下文与精细的修正相结合,以进行准确的推断。在我们的数据集和训练CNN的同一物理系统的不同实例上进行的实验表明,人工校准的准确性和实用性优于人工校准。
{"title":"A two-stage learning framework with a beam image dataset for automatic laser resonator alignment","authors":"Shaoxiang Guo ,&nbsp;Donald Risbridger ,&nbsp;David A. Robb ,&nbsp;Xianwen Kong ,&nbsp;M. J. Daniel Esser ,&nbsp;Michael J. Chantler ,&nbsp;Richard M. Carter ,&nbsp;Mustafa Suphi Erden","doi":"10.1016/j.patcog.2026.113145","DOIUrl":"10.1016/j.patcog.2026.113145","url":null,"abstract":"<div><div>Accurate alignment of a laser resonator is essential for upscaling industrial laser manufacturing and precision processing. However, traditional manual or semi-automatic methods depend heavily on operator expertise, and struggle with the interdependence among multiple alignment parameters. To tackle this, we introduce the first real-world image dataset for automatic laser resonator alignment, collected on a laboratory-built resonator setup. It comprises over 6000 beam profiler images annotated with four key alignment parameters (intracavity iris aperture diameter, output coupler pitch and yaw actuator displacements, and axial position of the output coupler), with over 500,000 paired samples for data-driven alignment. Given a pair of beam profiler images exhibiting distinct beam patterns under different configurations, the system predicts the control-parameter changes required to realign the resonator. Leveraging this dataset, we propose a novel two-stage deep learning framework for automatic resonator alignment. In Stage 1, a multi-scale CNN augmented with cross-attention and correlation-difference modules, extracts features and outputs an initial coarse prediction of alignment parameters. In Stage 2, a feature-difference map is computed by subtracting the paired feature representations and fed into an iterative refinement module to correct residual misalignments. The final prediction combines coarse and refined estimates, integrating global context with fine-grained corrections for accurate inference. Experiments on our dataset and a different instance of the same physical system from which the CNN was trained suggest superior accuracy and practicality to manual alignment.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113145"},"PeriodicalIF":7.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MC-MVSNet: When multi-view stereo meets monocular cues 当多视角立体遇到单目线索时
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-27 DOI: 10.1016/j.patcog.2026.113166
Xincheng Tang , Mengqi Rong , Bin Fan , Hongmin Liu , Shuhan Shen
Learning-based Multi-View Stereo (MVS) has become a key technique for reconstructing dense 3D point clouds from multiple calibrated images. However, real-world challenges such as occlusions and textureless regions often hinder accurate depth estimation. Recent advances in monocular Vision Foundation Models (VFMs) have demonstrated strong generalization capabilities in scene understanding, offering new opportunities to enhance the robustness of MVS. In this paper, we present MC-MVSNet, a novel MVS framework that integrates diverse monocular cues to improve depth estimation under challenging conditions. During feature extraction, we fuse conventional CNN features with VFM-derived representations through a hybrid feature fusion module, effectively combining local details and global context for more discriminative feature matching. We also propose a cost volume filtering module that enforces cross-view geometric consistency on monocular depth predictions, pruning redundant depth hypotheses to reduce the depth search space and mitigate matching ambiguity. Additionally, we leverage monocular surface normals to construct a curved patch cost aggregation module that aggregates costs over geometry-aligned curved patches, which improves depth estimation accuracy in curved and textureless regions. Extensive experiments on the DTU, Tanks and Temples, and ETH3D benchmarks demonstrate that MC-MVSNet achieves state-of-the-art performance and exhibits strong generalization capabilities, validating the effectiveness and robustness of the proposed method.
基于学习的多视点立体(MVS)技术已经成为从多个标定图像中重建密集三维点云的关键技术。然而,现实世界的挑战,如遮挡和无纹理区域往往阻碍准确的深度估计。近年来,单目视觉基础模型(VFMs)在场景理解方面具有较强的泛化能力,为增强单目视觉基础模型的鲁棒性提供了新的机会。在本文中,我们提出了MC-MVSNet,这是一个新的MVS框架,它集成了多种单目线索,以提高在具有挑战性条件下的深度估计。在特征提取过程中,我们通过混合特征融合模块将传统的CNN特征与vfm衍生的表征融合在一起,有效地将局部细节与全局上下文相结合,实现更具判别性的特征匹配。我们还提出了一个代价体积过滤模块,该模块强制单目深度预测的跨视图几何一致性,修剪冗余深度假设以减少深度搜索空间并减轻匹配歧义。此外,我们利用单眼表面法线构建了一个弯曲斑块成本聚合模块,该模块可以聚合几何对齐的弯曲斑块上的成本,从而提高了弯曲和无纹理区域的深度估计精度。在DTU、Tanks and Temples和ETH3D基准测试上进行的大量实验表明,MC-MVSNet实现了最先进的性能,并表现出强大的泛化能力,验证了所提出方法的有效性和鲁棒性。
{"title":"MC-MVSNet: When multi-view stereo meets monocular cues","authors":"Xincheng Tang ,&nbsp;Mengqi Rong ,&nbsp;Bin Fan ,&nbsp;Hongmin Liu ,&nbsp;Shuhan Shen","doi":"10.1016/j.patcog.2026.113166","DOIUrl":"10.1016/j.patcog.2026.113166","url":null,"abstract":"<div><div>Learning-based Multi-View Stereo (MVS) has become a key technique for reconstructing dense 3D point clouds from multiple calibrated images. However, real-world challenges such as occlusions and textureless regions often hinder accurate depth estimation. Recent advances in monocular Vision Foundation Models (VFMs) have demonstrated strong generalization capabilities in scene understanding, offering new opportunities to enhance the robustness of MVS. In this paper, we present MC-MVSNet, a novel MVS framework that integrates diverse monocular cues to improve depth estimation under challenging conditions. During feature extraction, we fuse conventional CNN features with VFM-derived representations through a hybrid feature fusion module, effectively combining local details and global context for more discriminative feature matching. We also propose a cost volume filtering module that enforces cross-view geometric consistency on monocular depth predictions, pruning redundant depth hypotheses to reduce the depth search space and mitigate matching ambiguity. Additionally, we leverage monocular surface normals to construct a curved patch cost aggregation module that aggregates costs over geometry-aligned curved patches, which improves depth estimation accuracy in curved and textureless regions. Extensive experiments on the DTU, Tanks and Temples, and ETH3D benchmarks demonstrate that MC-MVSNet achieves state-of-the-art performance and exhibits strong generalization capabilities, validating the effectiveness and robustness of the proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113166"},"PeriodicalIF":7.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DuoNet: Joint optimization of representation learning and prototype classifier for unbiased scene graph generation DuoNet:用于无偏场景图生成的表示学习和原型分类器联合优化
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-27 DOI: 10.1016/j.patcog.2026.113152
Zhaodi Wang , Biao Leng , Shuo Zhang
Unbiased Scene Graph Generation (SGG) aims to parse visual scenes into highly informative graphs under the long-tail challenge. While prototype-based methods have shown promise in unbiased SGG, they highlight the importance of learning discriminative features that are intra-class compact and inter-class separable. In this paper, we revisit prototype-based methods and analyze critical roles of representation learning and prototype classifier in driving unbiased SGG, and accordingly propose a novel framework DuoNet. To enhance intra-class compactness, we introduce a Bi-Directional Representation Refinement (BiDR2) module that captures relation-sensitive visual variability and within-relation visual consistency of entities. This module adopts relation-to-entity-to-relation refinement by integrating dual-level relation pattern modeling with a relation-specific entity constraint. Furthermore, a Knowledge-Guided Prototype Learning (KGPL) module is devised to strengthen inter-class separability by constructing an equidistributed prototypical classifier with maximum inter-class margins. The equidistributed prototype classifier is frozen during SGG training to mitigate long-tail bias, thus a knowledge-driven triplet loss is developed to strengthen the learning of BiDR2, enhancing relation-prototype matching. Extensive experiments demonstrate the effectiveness of our method, which sets new state-of-the-art performance on Visual Genome, GQA and Open Images datasets.
无偏场景图生成(Unbiased Scene Graph Generation, SGG)的目标是在长尾挑战下将视觉场景解析成高信息量的图。虽然基于原型的方法在无偏SGG中显示出了希望,但它们强调了学习类内紧凑和类间可分离的判别特征的重要性。本文回顾了基于原型的方法,分析了表征学习和原型分类器在驱动无偏SGG中的关键作用,并据此提出了一个新的框架DuoNet。为了增强类内的紧凑性,我们引入了双向表示细化(BiDR2)模块,该模块捕获关系敏感的视觉可变性和实体的关系内视觉一致性。该模块通过将双级关系模式建模与特定于关系的实体约束集成,采用关系到实体到关系的细化。在此基础上,设计了知识引导原型学习(knowledge guided Prototype Learning, KGPL)模块,通过构造类间边界最大的等分布原型分类器来增强类间可分性。在SGG训练过程中冻结等分布原型分类器以减轻长尾偏差,因此开发了知识驱动的三重损失来加强BiDR2的学习,增强关系原型匹配。大量的实验证明了我们的方法的有效性,它在视觉基因组,GQA和开放图像数据集上设置了新的最先进的性能。
{"title":"DuoNet: Joint optimization of representation learning and prototype classifier for unbiased scene graph generation","authors":"Zhaodi Wang ,&nbsp;Biao Leng ,&nbsp;Shuo Zhang","doi":"10.1016/j.patcog.2026.113152","DOIUrl":"10.1016/j.patcog.2026.113152","url":null,"abstract":"<div><div>Unbiased Scene Graph Generation (SGG) aims to parse visual scenes into highly informative graphs under the long-tail challenge. While prototype-based methods have shown promise in unbiased SGG, they highlight the importance of learning discriminative features that are intra-class compact and inter-class separable. In this paper, we revisit prototype-based methods and analyze critical roles of representation learning and prototype classifier in driving unbiased SGG, and accordingly propose a novel framework DuoNet. To enhance intra-class compactness, we introduce a Bi-Directional Representation Refinement (BiDR<sup>2</sup>) module that captures relation-sensitive visual variability and within-relation visual consistency of entities. This module adopts relation-to-entity-to-relation refinement by integrating dual-level relation pattern modeling with a relation-specific entity constraint. Furthermore, a Knowledge-Guided Prototype Learning (KGPL) module is devised to strengthen inter-class separability by constructing an equidistributed prototypical classifier with maximum inter-class margins. The equidistributed prototype classifier is frozen during SGG training to mitigate long-tail bias, thus a knowledge-driven triplet loss is developed to strengthen the learning of BiDR<sup>2</sup>, enhancing relation-prototype matching. Extensive experiments demonstrate the effectiveness of our method, which sets new state-of-the-art performance on Visual Genome, GQA and Open Images datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113152"},"PeriodicalIF":7.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DFWe: Efficient knowledge distillation of fine-tuned Whisper encoder for speech emotion recognition 面向语音情感识别的高效知识精馏微调耳语编码器
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-27 DOI: 10.1016/j.patcog.2026.113161
Yujian Ma , Xianquan Jiang , Jinqiu Sang , Ruizhe Li
Despite the strong acoustic modeling capabilities of large pre-trained speech models such as Whisper, their direct application to speech emotion recognition (SER) is hindered by task mismatch, high computational cost, and limited retention of affective cues. To address these challenges, we propose DFWe (Distillation of Fine-Tuned Whisper Encoder), a two-stage knowledge distillation (KD) framework combining parameter-efficient adaptation with multi-objective supervision. In Stage 1, a subset of upper layers in the Whisper encoder is fine-tuned along with a lightweight projector and classification head, enabling the model to preserve general acoustic knowledge while adapting to emotion-specific features. In Stage 2, knowledge is distilled into a compact Whisper-Small student using a hybrid loss that integrates hard-label cross-entropy, confidence-aware soft-label KL divergence, and intermediate feature alignment via Centered Kernel Alignment (CKA). On the IEMOCAP dataset with 10-fold cross-validation (CV), DFWe achieves a 7.21 ×  reduction in model size while retaining 99.99% of the teacher’s unweighted average recall (UAR) and reaching 79.82% weighted average recall (WAR) and 81.32% UAR, representing state-of-the-art performance among knowledge-distillation-based SER methods. Ablation studies highlight the benefits of adaptive temperature scaling, multi-level supervision, and targeted augmentation in improving both accuracy and robustness. Case analyses further show that DFWe yields more confident and stable predictions in emotionally ambiguous scenarios, underscoring its practical effectiveness. Overall, DFWe offers a scalable, generalizable solution for deploying SER systems in resource-constrained environments.
尽管大型预训练语音模型(如Whisper)具有强大的声学建模能力,但它们在语音情绪识别(SER)中的直接应用受到任务不匹配、高计算成本和情感线索保留有限的阻碍。为了解决这些挑战,我们提出了DFWe(精馏微调耳语编码器),这是一个两阶段的知识精馏(KD)框架,结合了参数有效自适应和多目标监督。在第一阶段,对Whisper编码器上层的一个子集进行微调,以及一个轻量级的投影仪和分类头,使模型能够保留一般的声学知识,同时适应特定的情感特征。在第二阶段,使用混合损失将知识提取成紧凑的Whisper-Small学生,该混合损失集成了硬标签交叉熵、置信度感知软标签KL散度和通过中心核对齐(CKA)进行的中间特征对齐。在具有10倍交叉验证(CV)的IEMOCAP数据集上,DFWe在模型大小上实现了7.21 × 的减少,同时保留了99.99%的教师未加权平均召回率(UAR),达到79.82%的加权平均召回率(WAR)和81.32%的UAR,代表了基于知识提取的SER方法中最先进的性能。消融研究强调了自适应温度标度、多层次监督和靶向增强在提高准确性和鲁棒性方面的好处。案例分析进一步表明,在情绪模糊的情况下,DFWe产生了更自信和稳定的预测,强调了其实际有效性。总的来说,DFWe为在资源受限的环境中部署SER系统提供了一个可扩展的、通用的解决方案。
{"title":"DFWe: Efficient knowledge distillation of fine-tuned Whisper encoder for speech emotion recognition","authors":"Yujian Ma ,&nbsp;Xianquan Jiang ,&nbsp;Jinqiu Sang ,&nbsp;Ruizhe Li","doi":"10.1016/j.patcog.2026.113161","DOIUrl":"10.1016/j.patcog.2026.113161","url":null,"abstract":"<div><div>Despite the strong acoustic modeling capabilities of large pre-trained speech models such as Whisper, their direct application to speech emotion recognition (SER) is hindered by task mismatch, high computational cost, and limited retention of affective cues. To address these challenges, we propose DFWe (Distillation of Fine-Tuned Whisper Encoder), a two-stage knowledge distillation (KD) framework combining parameter-efficient adaptation with multi-objective supervision. In Stage 1, a subset of upper layers in the Whisper encoder is fine-tuned along with a lightweight projector and classification head, enabling the model to preserve general acoustic knowledge while adapting to emotion-specific features. In Stage 2, knowledge is distilled into a compact Whisper-Small student using a hybrid loss that integrates hard-label cross-entropy, confidence-aware soft-label KL divergence, and intermediate feature alignment via Centered Kernel Alignment (CKA). On the IEMOCAP dataset with 10-fold cross-validation (CV), DFWe achieves a 7.21 ×  reduction in model size while retaining 99.99% of the teacher’s unweighted average recall (UAR) and reaching 79.82% weighted average recall (WAR) and 81.32% UAR, representing state-of-the-art performance among knowledge-distillation-based SER methods. Ablation studies highlight the benefits of adaptive temperature scaling, multi-level supervision, and targeted augmentation in improving both accuracy and robustness. Case analyses further show that DFWe yields more confident and stable predictions in emotionally ambiguous scenarios, underscoring its practical effectiveness. Overall, DFWe offers a scalable, generalizable solution for deploying SER systems in resource-constrained environments.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113161"},"PeriodicalIF":7.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
S2I-DiT: Unlocking the semantic-to-image transferability by fine-tuning large diffusion transformer models S2I-DiT:通过微调大型扩散变压器模型解锁语义到图像的可转移性
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-25 DOI: 10.1016/j.patcog.2026.113158
Gang Li , Enze Xie , Chongjian Ge , Xiang Li , Lingyu Si , Changwen Zheng , Zhenguo Li
Denoising Diffusion Probabilistic Models (DDPMs) have made significant progress in image generation. Recent works in semantic-to-image (S2I) synthesis have also shifted from the previously de facto GAN-based methods to DDPMs, yielding better results. However, these works mostly employ a U-Net structure and vanilla training-from-scratch scheme for S2I, unconsciously neglecting the potential benefits offered by task-related pre-training. In this work, we introduce a Transformer-based architecture, namely S2I-DiT, and reconsider the merits of a pre-trained large diffusion model for cross-task adaptation (i.e., from the class-conditional generation to S2I). In S2I-DiT, we propose the integration of semantic embedders within Diffusion Transformers (DiTs) to maximize the utilization of semantic information. The semantic embedder densely encodes semantic layouts to guide the adaptive normalization process. We configure semantic embedders in a layer-wise manner to learn pixel-level correspondence, enabling finer-grained semantic-to-image control. Besides, to fully unleash the cross-task transferability of DDPMs, we introduce a two-stage fine-tuning strategy, which involves initially adapting the semantic embedders in the pixel-level space, followed by fine-tuning the partial/entire model for cross-task adaptation. Notably, S2I-DiT pioneers the application of Large Diffusion Transformers to cross-task fine-tuning. Extensive experiments on four benchmark datasets demonstrate S2I-DiT’s effectiveness, as it achieves state-of-the-art performance in terms of quality (FID) and diversity (LPIPS), while consuming fewer training iterations. This work establishes a new state-of-the-art for semantic-to-image generation and provides valuable insights into cross-task transferability of large generative models.
消噪扩散概率模型(ddpm)在图像生成方面取得了重大进展。最近在语义到图像(S2I)合成方面的工作也从以前事实上的基于gan的方法转向了ddpm,产生了更好的结果。然而,这些作品大多采用U-Net结构和香草的S2I从头开始训练方案,无意识地忽略了与任务相关的预训练提供的潜在好处。在这项工作中,我们引入了一个基于transformer的架构,即S2I- dit,并重新考虑了用于跨任务适应的预训练大型扩散模型的优点(即从类条件生成到S2I)。在S2I-DiT中,我们提出在扩散转换器(dit)中集成语义嵌入器,以最大限度地利用语义信息。语义嵌入器对语义布局进行密集编码,引导自适应归一化过程。我们以分层方式配置语义嵌入器以学习像素级对应,从而实现更细粒度的语义到图像控制。此外,为了充分释放ddpm的跨任务可移植性,我们引入了一种两阶段的微调策略,即首先在像素级空间调整语义嵌入器,然后对部分/整个模型进行微调以进行跨任务适应。值得注意的是,S2I-DiT率先将大型扩散变压器应用于跨任务微调。在四个基准数据集上进行的大量实验证明了S2I-DiT的有效性,因为它在质量(FID)和多样性(LPIPS)方面达到了最先进的性能,同时消耗了更少的训练迭代。这项工作建立了语义到图像生成的新技术,并为大型生成模型的跨任务可移植性提供了有价值的见解。
{"title":"S2I-DiT: Unlocking the semantic-to-image transferability by fine-tuning large diffusion transformer models","authors":"Gang Li ,&nbsp;Enze Xie ,&nbsp;Chongjian Ge ,&nbsp;Xiang Li ,&nbsp;Lingyu Si ,&nbsp;Changwen Zheng ,&nbsp;Zhenguo Li","doi":"10.1016/j.patcog.2026.113158","DOIUrl":"10.1016/j.patcog.2026.113158","url":null,"abstract":"<div><div>Denoising Diffusion Probabilistic Models (DDPMs) have made significant progress in image generation. Recent works in semantic-to-image (S2I) synthesis have also shifted from the previously <em>de facto</em> GAN-based methods to DDPMs, yielding better results. However, these works mostly employ a U-Net structure and vanilla training-from-scratch scheme for S2I, unconsciously neglecting the potential benefits offered by task-related pre-training. In this work, we introduce a Transformer-based architecture, namely S2I-DiT, and reconsider the merits of a pre-trained large diffusion model for cross-task adaptation (i.e., from the class-conditional generation to S2I). In S2I-DiT, we propose the integration of semantic embedders within Diffusion Transformers (DiTs) to maximize the utilization of semantic information. The semantic embedder densely encodes semantic layouts to guide the adaptive normalization process. We configure semantic embedders in a layer-wise manner to learn pixel-level correspondence, enabling finer-grained semantic-to-image control. Besides, to fully unleash the cross-task transferability of DDPMs, we introduce a two-stage fine-tuning strategy, which involves initially adapting the semantic embedders in the pixel-level space, followed by fine-tuning the partial/entire model for cross-task adaptation. Notably, S2I-DiT pioneers the application of Large Diffusion Transformers to cross-task fine-tuning. Extensive experiments on four benchmark datasets demonstrate S2I-DiT’s effectiveness, as it achieves state-of-the-art performance in terms of quality (FID) and diversity (LPIPS), while consuming fewer training iterations. This work establishes a new state-of-the-art for semantic-to-image generation and provides valuable insights into cross-task transferability of large generative models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113158"},"PeriodicalIF":7.6,"publicationDate":"2026-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning discriminative features within forward-Forward algorithm using convolutional prototype 利用卷积原型学习前向算法中的判别特征
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-24 DOI: 10.1016/j.patcog.2026.113139
Qiufu Li , Zewen Li , Linlin Shen
Compared to back-propagation algorithms, the forward-forward (FF) algorithm proposed by Hinton [1] can in parallel optimize all layers of deep network models, while requiring less storage and achieving higher computational efficiency. However, the current FF methods cannot fully leverage the label information of samples, which suppress the learning of discriminative features. In this paper, we propose prototype learning within the FF algorithm (PLFF). When optimizing each convolutional layer, PLFF first divides the convolutional kernels into various groups according to the number K of classes, which serve as class prototypes in the optimizing, referred to as convolutional prototypes. For every sample, K goodness scores are calculated based on its convolutional results between the sample data and the convolutional prototypes. Then, using multiple binary cross-entropy losses, PLFF maximizes the positive goodness score corresponding to the sample label while minimizing other negative goodness scores, to learn discriminative features. Meanwhile, PLFF maximizes the cosine distances among the K convolutional prototypes, which enhances their discrimination and, in turn, promotes the learning of features. The image classification results across multiple datasets show that PLFF achieves the best results among different FF methods. Finally, for the first time, we verify the long-tailed recognition performance of different FF methods, demonstrating that our PLFF achieves superior results.
与反向传播算法相比,Hinton[1]提出的forward-forward (FF)算法可以并行优化深度网络模型的所有层,同时需要更少的存储空间和更高的计算效率。然而,目前的FF方法不能充分利用样本的标签信息,这抑制了判别特征的学习。在本文中,我们提出了FF算法中的原型学习(PLFF)。在优化每个卷积层时,PLFF首先根据类的个数K将卷积核分成不同的组,作为优化过程中的类原型,称为卷积原型。对于每个样本,根据样本数据与卷积原型之间的卷积结果计算K个优度分数。然后,利用多重二值交叉熵损失,PLFF最大化样本标签对应的正优值,同时最小化其他负优值,学习判别特征。同时,PLFF最大化了K个卷积原型之间的余弦距离,增强了它们的辨别能力,进而促进了特征的学习。跨多个数据集的图像分类结果表明,在不同的图像分类方法中,PLFF方法的分类效果最好。最后,我们首次验证了不同FF方法的长尾识别性能,表明我们的PLFF方法取得了较好的效果。
{"title":"Learning discriminative features within forward-Forward algorithm using convolutional prototype","authors":"Qiufu Li ,&nbsp;Zewen Li ,&nbsp;Linlin Shen","doi":"10.1016/j.patcog.2026.113139","DOIUrl":"10.1016/j.patcog.2026.113139","url":null,"abstract":"<div><div>Compared to back-propagation algorithms, the forward-forward (FF) algorithm proposed by Hinton [1] can in parallel optimize all layers of deep network models, while requiring less storage and achieving higher computational efficiency. However, the current FF methods cannot fully leverage the label information of samples, which suppress the learning of discriminative features. In this paper, we propose prototype learning within the FF algorithm (PLFF). When optimizing each convolutional layer, PLFF first divides the convolutional kernels into various groups according to the number <em>K</em> of classes, which serve as class prototypes in the optimizing, referred to as convolutional prototypes. For every sample, <em>K</em> goodness scores are calculated based on its convolutional results between the sample data and the convolutional prototypes. Then, using multiple binary cross-entropy losses, PLFF maximizes the positive goodness score corresponding to the sample label while minimizing other negative goodness scores, to learn discriminative features. Meanwhile, PLFF maximizes the cosine distances among the <em>K</em> convolutional prototypes, which enhances their discrimination and, in turn, promotes the learning of features. The image classification results across multiple datasets show that PLFF achieves the best results among different FF methods. Finally, for the first time, we verify the long-tailed recognition performance of different FF methods, demonstrating that our PLFF achieves superior results.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113139"},"PeriodicalIF":7.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EEnvA-Mamba: Effective and environtology-aware adaptive Mamba for road object detection in adverse weather scenes EEnvA-Mamba:在恶劣天气场景中有效的、具有环境意识的自适应Mamba道路目标检测
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-24 DOI: 10.1016/j.patcog.2026.113127
Yonglin Chen , Binzhi Fan , Nan Liu , Yalong Yang , Jinhui Tang
Adverse weather conditions severely degrade visual perception in autonomous driving systems, primarily due to image quality deterioration, object occlusion, and unstable illumination. Current deep learning-based detection methods exhibit limited robustness in such scenarios, as corrupted features and inefficient algorithmic adaptation impair their performance under weather variations. To overcome these challenges, we propose EEnvA-Mamba, a computationally efficient architecture that synergizes real-time processing with high detection accuracy. The framework features three core components: (1) AVSSBlock, a vision state-space block that incorporates environment-aware gating (dynamically adjusting feature-channel weights based on weather conditions) and weather-conditioned channel weighting (unequal channel responses under different weather types), effectively mitigating feature degradation; (2) A linear-complexity computation scheme that replaces conventional quadratic Transformer operations while preserving discriminative feature learning; (3) AStem, an attention-guided dual-branch module that strengthens local feature extraction via spatial-channel interactions while employing frequency domain denoising techniques to suppress noise across various frequencies, ensuring precise dependency modeling. To support rigorous validation, We collected and annotated a dataset: VOC-SNOW—a dedicated snowy road dataset comprising 2700 annotated images with diverse illumination and snowfall levels. Comparative experiments under multiple datasets verify our method’s superiority, demonstrating state-of-the-art performance with 66.4% APval (4.5% higher than leading counterparts). The source code has been released at https://github.com/fbzahwy/EEnvA-Mamba.
恶劣的天气条件严重降低了自动驾驶系统的视觉感知,主要是由于图像质量下降、物体遮挡和照明不稳定。当前基于深度学习的检测方法在这种情况下表现出有限的鲁棒性,因为损坏的特征和低效的算法适应会损害它们在天气变化下的性能。为了克服这些挑战,我们提出了EEnvA-Mamba,这是一种计算效率高的架构,可以将实时处理与高检测精度相结合。该框架具有三个核心组件:(1)AVSSBlock,一个视觉状态空间块,结合了环境感知门控(根据天气条件动态调整特征通道权重)和天气条件通道权重(不同天气类型下的不均匀通道响应),有效缓解了特征退化;(2)一种线性复杂度计算方案,取代了传统的二次型变压器运算,同时保留了判别特征学习;(3) asstem,一个注意引导的双分支模块,通过空间通道相互作用加强局部特征提取,同时采用频域去噪技术抑制不同频率的噪声,确保精确的依赖关系建模。为了支持严格的验证,我们收集并注释了一个数据集:voc - snow -一个专用的雪路数据集,包含2700张不同照明和降雪量的注释图像。在多个数据集上的对比实验验证了我们的方法的优越性,显示出最先进的性能,APval为66.4%(比领先的同行高出4.5%)。源代码已在https://github.com/fbzahwy/EEnvA-Mamba上发布。
{"title":"EEnvA-Mamba: Effective and environtology-aware adaptive Mamba for road object detection in adverse weather scenes","authors":"Yonglin Chen ,&nbsp;Binzhi Fan ,&nbsp;Nan Liu ,&nbsp;Yalong Yang ,&nbsp;Jinhui Tang","doi":"10.1016/j.patcog.2026.113127","DOIUrl":"10.1016/j.patcog.2026.113127","url":null,"abstract":"<div><div>Adverse weather conditions severely degrade visual perception in autonomous driving systems, primarily due to image quality deterioration, object occlusion, and unstable illumination. Current deep learning-based detection methods exhibit limited robustness in such scenarios, as corrupted features and inefficient algorithmic adaptation impair their performance under weather variations. To overcome these challenges, we propose EEnvA-Mamba, a computationally efficient architecture that synergizes real-time processing with high detection accuracy. The framework features three core components: (1) AVSSBlock, a vision state-space block that incorporates environment-aware gating (dynamically adjusting feature-channel weights based on weather conditions) and weather-conditioned channel weighting (unequal channel responses under different weather types), effectively mitigating feature degradation; (2) A linear-complexity computation scheme that replaces conventional quadratic Transformer operations while preserving discriminative feature learning; (3) AStem, an attention-guided dual-branch module that strengthens local feature extraction via spatial-channel interactions while employing frequency domain denoising techniques to suppress noise across various frequencies, ensuring precise dependency modeling. To support rigorous validation, We collected and annotated a dataset: VOC-SNOW—a dedicated snowy road dataset comprising 2700 annotated images with diverse illumination and snowfall levels. Comparative experiments under multiple datasets verify our method’s superiority, demonstrating state-of-the-art performance with 66.4% AP<sup><em>val</em></sup> (4.5% higher than leading counterparts). The source code has been released at <span><span>https://github.com/fbzahwy/EEnvA-Mamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113127"},"PeriodicalIF":7.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FeatureSORT: A robust tracker with optimized feature integration FeatureSORT:一个强大的跟踪器与优化的功能集成
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-24 DOI: 10.1016/j.patcog.2026.113148
Hamidreza Hashempoor, Rosemary Koikara, Yu Dong Hwang
We introduce FeatureSORT, a simple yet effective online multiple object tracker that reinforces the baselines with a redesigned detector and additional feature cues, while keeping computational complexity low. In contrast to conventional detectors that only provide bounding boxes, our designed detector architecture is extended to output multiple appearance attributes, including clothing color, clothing style, and motion direction, alongside the bounding boxes. These feature cues, together with a ReID network, form complementary embeddings that substantially improve association accuracy. The rationale behind selecting and combining these attributes is thoroughly examined in extensive ablation studies. Furthermore, we incorporate stronger post-processing strategies, such as global linking and Gaussian Smoothing Process interpolation, to handle missing associations and detections. During online tracking, we define a measurement-to-track distance function that jointly considers IoU, direction, color, style, and ReID similarity. This design enables FeatureSORT to maintain consistent identities through longer occlusions while reducing identity switches. Extensive experiments on standard MOT benchmarks demonstrate that FeatureSORT achieves state-of-the-art (SOTA) online performance, with MOTA scores of 79.7 on MOT16, 80.6 on MOT17, 77.9 on MOT20, and 92.2 on DanceTrack, underscoring the effectiveness of feature-enriched detection in advancing multi-object tracking. Our Github repository includes code implementation.
我们介绍FeatureSORT,一个简单而有效的在线多目标跟踪器,通过重新设计的检测器和额外的特征线索加强基线,同时保持较低的计算复杂性。与仅提供边界框的传统检测器相比,我们设计的检测器架构扩展到输出多个外观属性,包括服装颜色,服装样式和运动方向,以及边界框。这些特征线索与ReID网络一起形成互补嵌入,大大提高了关联的准确性。在广泛的消融研究中,选择和组合这些属性背后的基本原理得到了彻底的检验。此外,我们结合了更强的后处理策略,如全局链接和高斯平滑过程插值,来处理缺失的关联和检测。在在线跟踪过程中,我们定义了一个测量到跟踪的距离函数,该函数联合考虑了IoU、方向、颜色、样式和ReID相似性。这种设计使FeatureSORT能够通过更长的遮挡保持一致的身份,同时减少身份切换。在标准MOT基准测试上的大量实验表明,FeatureSORT实现了最先进的(SOTA)在线性能,MOTA在MOT16上的得分为79.7,在MOT17上的得分为80.6,在MOT20上的得分为77.9,在DanceTrack上的得分为92.2,强调了特征丰富检测在推进多目标跟踪方面的有效性。我们的Github存储库包含代码实现。
{"title":"FeatureSORT: A robust tracker with optimized feature integration","authors":"Hamidreza Hashempoor,&nbsp;Rosemary Koikara,&nbsp;Yu Dong Hwang","doi":"10.1016/j.patcog.2026.113148","DOIUrl":"10.1016/j.patcog.2026.113148","url":null,"abstract":"<div><div>We introduce FeatureSORT, a simple yet effective online multiple object tracker that reinforces the baselines with a redesigned detector and additional feature cues, while keeping computational complexity low. In contrast to conventional detectors that only provide bounding boxes, our designed detector architecture is extended to output multiple appearance attributes, including clothing color, clothing style, and motion direction, alongside the bounding boxes. These feature cues, together with a ReID network, form complementary embeddings that substantially improve association accuracy. The rationale behind selecting and combining these attributes is thoroughly examined in extensive ablation studies. Furthermore, we incorporate stronger post-processing strategies, such as global linking and Gaussian Smoothing Process interpolation, to handle missing associations and detections. During online tracking, we define a measurement-to-track distance function that jointly considers IoU, direction, color, style, and ReID similarity. This design enables FeatureSORT to maintain consistent identities through longer occlusions while reducing identity switches. Extensive experiments on standard MOT benchmarks demonstrate that FeatureSORT achieves state-of-the-art (SOTA) online performance, with MOTA scores of 79.7 on MOT16, 80.6 on MOT17, 77.9 on MOT20, and 92.2 on DanceTrack, underscoring the effectiveness of feature-enriched detection in advancing multi-object tracking. <span><span>Our Github repository</span><svg><path></path></svg></span> includes code implementation.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113148"},"PeriodicalIF":7.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PoseAdapter: Efficiently transferring 2D human pose estimator to 3D whole-body task via adapter PoseAdapter:通过适配器有效地将2D人体姿势估计器转换为3D全身任务
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-24 DOI: 10.1016/j.patcog.2026.113154
Ze Feng , Sen Yang , Jiang-Jiang Liu , Wankou Yang
In this paper, we explore the task of 3D whole-body pose estimation based on a single-frame image and propose a new paradigm called PoseAdapter, which exploits a well-pretrained 2D human pose estimation model equipped with Adapter. The mainstream paradigms for 3D human pose estimation typically require multiple stages, such as human box detection, 2D pose estimation, and lifting to 3D coordinates. Such a multi-stage approach probably loses context information in the compression process, resulting in inferior pose results, particularly for the dense prediction tasks such as 3D Whole-Body pose estimation. To improve the accuracy of pose estimation, some methods even use multi-frame fusion to enhance the current pose, including input from future frames, which is inherently non-causal. Considering that end-to-end 2D human pose methods could extract human-related and keypoint-specific visual features, we want to employ them as a general vision-based human analysis model and enable it to predict 3D whole-body poses. By freezing most of the parameters of the 2D model and tuning the newly added adapter, PoseAdapter could transfer the 2D estimator to the 3D pose task in a parameter-efficient manner, while retaining the original ability of distinguishing multiple human instances. Quantitative experimental results on H3WB demonstrate that PoseAdapter with fewer trainable parameters achieves an accuracy of 62.74 mm MPJPE. Qualitative research also shows that PoseAdapter could predict multi-person 3D Whole-Body pose results and can generalize to out-of-domain datasets, such as COCO.
在本文中,我们探索了基于单帧图像的3D全身姿态估计任务,并提出了一种名为PoseAdapter的新范式,该范式利用了配备了Adapter的预训练良好的2D人体姿态估计模型。三维人体姿态估计的主流范式通常需要多个阶段,如人体盒检测、二维姿态估计和提升到三维坐标。这种多阶段的方法可能会在压缩过程中丢失上下文信息,导致姿态结果较差,特别是对于3D全身姿态估计等密集预测任务。为了提高姿态估计的准确性,有些方法甚至使用多帧融合来增强当前姿态,包括来自未来帧的输入,这本身是非因果的。考虑到端到端的2D人体姿态方法可以提取与人体相关和特定关键点的视觉特征,我们希望将其作为一种通用的基于视觉的人体分析模型,并使其能够预测3D全身姿态。通过冻结2D模型的大部分参数并调整新添加的适配器,PoseAdapter可以将2D估计器以参数有效的方式转移到3D姿态任务中,同时保留原有的识别多个人体实例的能力。在H3WB上的定量实验结果表明,在可训练参数较少的情况下,PoseAdapter达到了62.74 mm MPJPE的精度。定性研究还表明,PoseAdapter可以预测多人3D全身姿势结果,并可以推广到域外数据集,如COCO。
{"title":"PoseAdapter: Efficiently transferring 2D human pose estimator to 3D whole-body task via adapter","authors":"Ze Feng ,&nbsp;Sen Yang ,&nbsp;Jiang-Jiang Liu ,&nbsp;Wankou Yang","doi":"10.1016/j.patcog.2026.113154","DOIUrl":"10.1016/j.patcog.2026.113154","url":null,"abstract":"<div><div>In this paper, we explore the task of 3D whole-body pose estimation based on a single-frame image and propose a new paradigm called PoseAdapter, which exploits a well-pretrained 2D human pose estimation model equipped with Adapter. The mainstream paradigms for 3D human pose estimation typically require multiple stages, such as human box detection, 2D pose estimation, and lifting to 3D coordinates. Such a multi-stage approach probably loses context information in the compression process, resulting in inferior pose results, particularly for the dense prediction tasks such as 3D Whole-Body pose estimation. To improve the accuracy of pose estimation, some methods even use multi-frame fusion to enhance the current pose, including input from future frames, which is inherently non-causal. Considering that end-to-end 2D human pose methods could extract human-related and keypoint-specific visual features, we want to employ them as a general vision-based human analysis model and enable it to predict 3D whole-body poses. By freezing most of the parameters of the 2D model and tuning the newly added adapter, PoseAdapter could transfer the 2D estimator to the 3D pose task in a parameter-efficient manner, while retaining the original ability of distinguishing multiple human instances. Quantitative experimental results on H3WB demonstrate that PoseAdapter with fewer trainable parameters achieves an accuracy of 62.74 mm MPJPE. Qualitative research also shows that PoseAdapter could predict multi-person 3D Whole-Body pose results and can generalize to out-of-domain datasets, such as COCO.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113154"},"PeriodicalIF":7.6,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1