首页 > 最新文献

Pattern Recognition最新文献

英文 中文
Boosting the patch-based self-supervised learning through past-to-present smoothing 通过从过去到现在的平滑增强基于补丁的自监督学习
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-11 DOI: 10.1016/j.patcog.2025.112871
Hanpeng Liu, Shuoxi Zhang, Kaiyuan Gao, Kun He
Self-supervised learning (SSL) has recently achieved remarkable success in computer vision, primarily through joint embedding architectures. These models train dual networks by aligning different augmentations of the same image, as well as preventing feature space collapse. Building upon this, previous work establishes a mathematical connection between joint embedding SSL and the co-occurrences of image patches. Moreover, there have been a number of efforts to scale patch-based SSL to a vast number of image patches, demonstrating rapid convergence and notable performance. However, the efficiency of these methods is hindered by the excessive use of cropped patches. Addressing this issue, we propose a novel framework named Past-to-Present (P2P) smoothing that leverages the model’s previous outputs as a supervisory signal. Specifically, we divide the patch augmentations of a single image into two portions. One portion is used to update the model at iteration t1 and retained as past information of iteration t. The other portion is used for comparison in iteration t, serving as present information to be complementary to the past. This design allows us to spread the patches of the same image across different batches, thereby enhancing the utilization rate of patch-based learning in our model. Through extensive experimentation and validation, our method achieves outstanding accuracy, scoring 94.2 % on CIFAR-10, 74.2 % on CIFAR-100, 49.5 % on TinyImageNet, and 78.2 % on ImageNet-100. Besides, additional experiments demonstrate its enhanced transferability to out-of-domain datasets when compared to other SSL baselines.
自监督学习(SSL)最近在计算机视觉领域取得了显著的成功,主要是通过联合嵌入架构。这些模型通过对齐同一图像的不同增强来训练双重网络,并防止特征空间崩溃。在此基础上,先前的工作建立了联合嵌入SSL和图像补丁共现之间的数学联系。此外,已经有许多工作将基于补丁的SSL扩展到大量映像补丁,展示了快速收敛和显著的性能。然而,这些方法的效率受到过度使用裁剪斑块的阻碍。为了解决这个问题,我们提出了一个名为过去到现在(P2P)平滑的新框架,它利用模型以前的输出作为监督信号。具体来说,我们将单幅图像的patch增强分为两部分。一部分用于在迭代t−1时更新模型,保留为迭代t的过去信息。另一部分用于在迭代t中进行比较,作为现在信息与过去信息相补充。这种设计允许我们将同一图像的patch分散到不同批次,从而提高了我们模型中基于patch的学习的利用率。通过大量的实验和验证,我们的方法在CIFAR-10上的准确率为94.2%,在CIFAR-100上的准确率为74.2%,在TinyImageNet上的准确率为49.5%,在ImageNet-100上的准确率为78.2%。此外,与其他SSL基线相比,实验证明了其对域外数据集的可移植性增强。
{"title":"Boosting the patch-based self-supervised learning through past-to-present smoothing","authors":"Hanpeng Liu,&nbsp;Shuoxi Zhang,&nbsp;Kaiyuan Gao,&nbsp;Kun He","doi":"10.1016/j.patcog.2025.112871","DOIUrl":"10.1016/j.patcog.2025.112871","url":null,"abstract":"<div><div>Self-supervised learning (SSL) has recently achieved remarkable success in computer vision, primarily through joint embedding architectures. These models train dual networks by aligning different augmentations of the same image, as well as preventing feature space collapse. Building upon this, previous work establishes a mathematical connection between joint embedding SSL and the co-occurrences of image patches. Moreover, there have been a number of efforts to scale patch-based SSL to a vast number of image patches, demonstrating rapid convergence and notable performance. However, the efficiency of these methods is hindered by the excessive use of cropped patches. Addressing this issue, we propose a novel framework named Past-to-Present (P2P) smoothing that leverages the model’s previous outputs as a supervisory signal. Specifically, we divide the patch augmentations of a single image into two portions. One portion is used to update the model at iteration <span><math><mrow><mi>t</mi><mo>−</mo><mn>1</mn></mrow></math></span> and retained as past information of iteration <em>t</em>. The other portion is used for comparison in iteration <em>t</em>, serving as present information to be complementary to the past. This design allows us to spread the patches of the same image across different batches, thereby enhancing the utilization rate of patch-based learning in our model. Through extensive experimentation and validation, our method achieves outstanding accuracy, scoring 94.2 % on CIFAR-10, 74.2 % on CIFAR-100, 49.5 % on TinyImageNet, and 78.2 % on ImageNet-100. Besides, additional experiments demonstrate its enhanced transferability to out-of-domain datasets when compared to other SSL baselines.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112871"},"PeriodicalIF":7.6,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Text-guided weakly supervised framework for dynamic facial expression recognition 动态面部表情识别的文本引导弱监督框架
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-11 DOI: 10.1016/j.patcog.2025.112910
Gunho Jung , Heejo Kong , Seong-Whan Lee
Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.
动态面部表情识别(DFER)旨在通过模拟视频序列中面部运动的时间变化来识别情绪状态。DFER的一个关键挑战是多对一标记问题,其中由许多帧组成的视频被分配到单个情感标签。缓解这个问题的一个常见策略是将DFER表述为多实例学习(MIL)问题。然而,基于mil的方法固有地受到情绪表达的视觉多样性和时间动态复杂性的影响。为了解决这一挑战,我们提出了TG-DFER,这是一个文本引导的弱监督框架,通过结合语义引导和连贯的时间建模来增强基于mil的DFER。我们整合了视觉语言预训练(VLP)模型,通过对情感上下文的细粒度文本描述提供语义指导。此外,我们引入了视觉提示,将丰富的文本情感标签与视觉实例特征对齐,从而实现细粒度推理和帧级相关性估计。此外,设计了一个多粒度的时间网络来共同捕获短期面部动态和长期情绪流,确保跨时间的连贯情感理解。广泛的结果表明,TG-DFER在弱监督下具有更好的泛化、可解释性和时间敏感性。
{"title":"Text-guided weakly supervised framework for dynamic facial expression recognition","authors":"Gunho Jung ,&nbsp;Heejo Kong ,&nbsp;Seong-Whan Lee","doi":"10.1016/j.patcog.2025.112910","DOIUrl":"10.1016/j.patcog.2025.112910","url":null,"abstract":"<div><div>Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112910"},"PeriodicalIF":7.6,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LFT-Net: A lightweight frequency-based transformer for low-light enhancement and exposure correction LFT-Net:用于弱光增强和曝光校正的轻型基于频率的变压器
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-11 DOI: 10.1016/j.patcog.2025.112887
Zilong Qi , Shijie Sun , Xinyi Liu , Luqian Zhou , Jingqi Qiao , Li Zhu
low-light images often suffer from composite degradations, including underexposure, overexposure, and diminished visibility. Existing low-light enhancement and exposure correction methods demonstrate two critical limitations: computational inefficiency and neglect of frequency-specific restoration patterns. Inspired by the concept that high-low frequency components capture different aspects of images, we propose LFT-Net, a lightweight frequency-based transformer for low-light enhancement and exposure correction. The model features a dual-branch architecture, with parallel detail processing and global correction pathways. The detail branch produces the base-enhanced image, while the correction branch refines it through gamma adjustment and color calibration. We propose a High-Low Frequency Enhancement (HLFE) module, which leverages two branches to independently extract and enhance high-low frequency features, and a Residual Detail Enhancement (RDE) module to improve the generation of local image components. Evaluation of benchmark datasets, including the LOL dataset and ME dataset, for low-light enhancement and exposure correction demonstrates that our method surpasses existing models in performance. Our code is available at https://github.com/Qqsoe/low-light-enhancement.git.
低光图像经常遭受复合退化,包括曝光不足,曝光过度和能见度降低。现有的低光增强和曝光校正方法存在两个关键的局限性:计算效率低下和忽略频率特异性恢复模式。受高低频分量捕获图像不同方面的概念的启发,我们提出了LFT-Net,一种用于弱光增强和曝光校正的轻型基于频率的变压器。该模型具有双分支结构,具有并行细节处理和全局校正路径。细节分支产生基增强图像,而校正分支通过伽马调整和颜色校准对其进行细化。我们提出了一个高低频增强(HLFE)模块,它利用两个分支来独立提取和增强高低频特征,以及一个残差细节增强(RDE)模块来改进局部图像分量的生成。对包括LOL数据集和ME数据集在内的基准数据集进行的弱光增强和曝光校正评估表明,我们的方法在性能上优于现有模型。我们的代码可在https://github.com/Qqsoe/low-light-enhancement.git上获得。
{"title":"LFT-Net: A lightweight frequency-based transformer for low-light enhancement and exposure correction","authors":"Zilong Qi ,&nbsp;Shijie Sun ,&nbsp;Xinyi Liu ,&nbsp;Luqian Zhou ,&nbsp;Jingqi Qiao ,&nbsp;Li Zhu","doi":"10.1016/j.patcog.2025.112887","DOIUrl":"10.1016/j.patcog.2025.112887","url":null,"abstract":"<div><div>low-light images often suffer from composite degradations, including underexposure, overexposure, and diminished visibility. Existing low-light enhancement and exposure correction methods demonstrate two critical limitations: computational inefficiency and neglect of frequency-specific restoration patterns. Inspired by the concept that high-low frequency components capture different aspects of images, we propose LFT-Net, a lightweight frequency-based transformer for low-light enhancement and exposure correction. The model features a dual-branch architecture, with parallel detail processing and global correction pathways. The detail branch produces the base-enhanced image, while the correction branch refines it through gamma adjustment and color calibration. We propose a <strong>H</strong>igh-<strong>L</strong>ow <strong>F</strong>requency <strong>E</strong>nhancement (HLFE) module, which leverages two branches to independently extract and enhance high-low frequency features, and a <strong>R</strong>esidual <strong>D</strong>etail <strong>E</strong>nhancement (RDE) module to improve the generation of local image components. Evaluation of benchmark datasets, including the LOL dataset and ME dataset, for low-light enhancement and exposure correction demonstrates that our method surpasses existing models in performance. Our code is available at <span><span>https://github.com/Qqsoe/low-light-enhancement.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112887"},"PeriodicalIF":7.6,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global and local Mamba network for multi-modality medical image super-resolution 多模态医学图像超分辨率的全球和本地曼巴网络
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-11 DOI: 10.1016/j.patcog.2025.112888
Zexin Ji , Beiji Zou , Xiaoyan Kui , Sébastien Thureau , Su Ruan
Convolutional neural networks and Transformer have made significant progress in multi-modality medical image super-resolution. However, these methods either have a fixed receptive field for local learning or significant computational burdens for global learning, limiting the super-resolution performance. To solve this problem, State Space Models, notably Mamba, is introduced to efficiently model long-range dependencies in images with linear computational complexity. Relying on the Mamba and the fact that low-resolution images rely on global information to compensate for missing details, while high-resolution reference images need to provide more local details for accurate super-resolution, we propose a global and local Mamba network (GLMamba) for multi-modality medical image super-resolution. To be specific, our GLMamba is a two-branch network equipped with a global Mamba branch and a local Mamba branch. The global Mamba branch captures long-range relationships in low-resolution inputs, and the local Mamba branch focuses more on short-range details in high-resolution reference images. We also use the deform block to adaptively extract features of both branches to enhance the representation ability. A modulator is designed to further enhance deformable features in both global and local Mamba blocks. To effectively incorporate reference guidance into low-resolution image super-resolution(SR), we further develop a multi-modality feature fusion block to adaptively fuse features by considering similarities, differences, and complementary aspects between modalities. In addition, a contrastive edge loss (CELoss) is developed for sufficient enhancement of edge textures and contrast in medical images. Quantitative and qualitative experimental results show that our GLMamba achieves superior super-resolution performance on BraTS2021, IXI and fastMRI datasets. We also validate the effectiveness of our approach on the downstream tumor segmentation task.
卷积神经网络和Transformer在多模态医学图像超分辨率方面取得了重大进展。然而,这些方法要么对局部学习有固定的接受域,要么对全局学习有很大的计算负担,从而限制了超分辨率的性能。为了解决这一问题,引入状态空间模型,特别是Mamba,有效地对具有线性计算复杂度的图像中的远程依赖关系进行建模。基于曼巴的特点,以及低分辨率图像依赖全局信息来弥补缺失的细节,而高分辨率参考图像需要提供更多的局部细节来实现精确的超分辨率,我们提出了一种用于多模态医学图像超分辨率的全局和局部曼巴网络(GLMamba)。具体来说,我们的GLMamba是一个由两个分支机构组成的网络,包括全球曼巴分支机构和本地曼巴分支机构。全球曼巴分支在低分辨率输入中捕获远程关系,而本地曼巴分支更多地关注高分辨率参考图像中的短距离细节。我们还使用变形块自适应提取两个分支的特征,以提高表示能力。调制器旨在进一步增强全局和局部曼巴块的可变形特性。为了将参考引导有效地整合到低分辨率图像超分辨率(SR)中,我们进一步开发了一个多模态特征融合块,通过考虑模态之间的相似性、差异性和互补性,自适应融合特征。此外,对比边缘损失(celloss)的发展,以充分增强边缘纹理和对比度在医学图像。定量和定性实验结果表明,我们的GLMamba在BraTS2021、IXI和fastMRI数据集上取得了优异的超分辨率性能。我们还验证了我们的方法在下游肿瘤分割任务上的有效性。
{"title":"Global and local Mamba network for multi-modality medical image super-resolution","authors":"Zexin Ji ,&nbsp;Beiji Zou ,&nbsp;Xiaoyan Kui ,&nbsp;Sébastien Thureau ,&nbsp;Su Ruan","doi":"10.1016/j.patcog.2025.112888","DOIUrl":"10.1016/j.patcog.2025.112888","url":null,"abstract":"<div><div>Convolutional neural networks and Transformer have made significant progress in multi-modality medical image super-resolution. However, these methods either have a fixed receptive field for local learning or significant computational burdens for global learning, limiting the super-resolution performance. To solve this problem, State Space Models, notably Mamba, is introduced to efficiently model long-range dependencies in images with linear computational complexity. Relying on the Mamba and the fact that low-resolution images rely on global information to compensate for missing details, while high-resolution reference images need to provide more local details for accurate super-resolution, we propose a global and local Mamba network (GLMamba) for multi-modality medical image super-resolution. To be specific, our GLMamba is a two-branch network equipped with a global Mamba branch and a local Mamba branch. The global Mamba branch captures long-range relationships in low-resolution inputs, and the local Mamba branch focuses more on short-range details in high-resolution reference images. We also use the deform block to adaptively extract features of both branches to enhance the representation ability. A modulator is designed to further enhance deformable features in both global and local Mamba blocks. To effectively incorporate reference guidance into low-resolution image super-resolution(SR), we further develop a multi-modality feature fusion block to adaptively fuse features by considering similarities, differences, and complementary aspects between modalities. In addition, a contrastive edge loss (CELoss) is developed for sufficient enhancement of edge textures and contrast in medical images. Quantitative and qualitative experimental results show that our GLMamba achieves superior super-resolution performance on BraTS2021, IXI and fastMRI datasets. We also validate the effectiveness of our approach on the downstream tumor segmentation task.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112888"},"PeriodicalIF":7.6,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal alignment of event and text streams in spiking neural networks for human action recognition 脉冲神经网络中用于人类动作识别的事件和文本流的多模态对齐
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-11 DOI: 10.1016/j.patcog.2025.112897
Ziliang Ren , Jiaqi Chen , Qieshi Zhang , Weiyu Yu , Fuyong Zhang
Event cameras provide advantages for efficient Human Action Recognition (HAR) due to their sparse output. Spiking Neural Networks (SNNs) offer a promising approach for processing this sparse data with low energy consumption. However, existing SNN methods focus only on spatiotemporal patterns in event streams, creating a semantic gap that hampers distinguishing similar actions. Integrating semantic guidance from text into SNNs remains largely unexplored but is vital for linking neuromorphic sensing with high-level reasoning. Conversely, multimodal frameworks based on Artificial Neural Networks (ANNs) face challenges in computational efficiency and natively processing sparse event streams. To address these limitations, we propose the Spiking Semantic-Auxiliary Event-Variant (SSAEV) fusion framework. SSAEV integrates a dual event stream with a text stream utilizing Large Language Model (LLM)-generated action descriptions. Additionally, we introduce a novel Spiking Integrate-and-Fire Approximation (SIFA) neuron to enhance temporal modeling and develop a contrastive loss function that semantically aligns event stream and text embeddings. Experimental results demonstrate that SSAEV achieves competitive accuracy while maintaining low computational cost on the THUEACT-50-CHL, SeACT, UCF101-DVS, and DailyDVS-200 datasets. This performance underscores the framework’s effectiveness in bridging neuromorphic sensing with semantic reasoning for HAR.
事件相机的稀疏输出为高效的人类行为识别(HAR)提供了优势。尖峰神经网络(snn)为低能耗处理稀疏数据提供了一种很有前途的方法。然而,现有的SNN方法只关注事件流中的时空模式,造成了语义差距,阻碍了区分相似的动作。将来自文本的语义引导集成到snn中仍未被探索,但对于将神经形态感知与高级推理联系起来至关重要。相反,基于人工神经网络(ann)的多模态框架在计算效率和原生处理稀疏事件流方面面临挑战。为了解决这些限制,我们提出了尖峰语义辅助事件变量(SSAEV)融合框架。SSAEV利用大型语言模型(LLM)生成的动作描述集成了双事件流和文本流。此外,我们引入了一种新的spike - integration -and- fire Approximation (SIFA)神经元来增强时间建模,并开发了一个对比损失函数,该函数在语义上对齐事件流和文本嵌入。实验结果表明,在THUE−ACT-50-CHL、SeACT、UCF101-DVS和DailyDVS-200数据集上,SSAEV在保持较低计算成本的同时获得了具有竞争力的精度。这一性能强调了该框架在连接HAR的神经形态感知和语义推理方面的有效性。
{"title":"Multimodal alignment of event and text streams in spiking neural networks for human action recognition","authors":"Ziliang Ren ,&nbsp;Jiaqi Chen ,&nbsp;Qieshi Zhang ,&nbsp;Weiyu Yu ,&nbsp;Fuyong Zhang","doi":"10.1016/j.patcog.2025.112897","DOIUrl":"10.1016/j.patcog.2025.112897","url":null,"abstract":"<div><div>Event cameras provide advantages for efficient Human Action Recognition (HAR) due to their sparse output. Spiking Neural Networks (SNNs) offer a promising approach for processing this sparse data with low energy consumption. However, existing SNN methods focus only on spatiotemporal patterns in event streams, creating a semantic gap that hampers distinguishing similar actions. Integrating semantic guidance from text into SNNs remains largely unexplored but is vital for linking neuromorphic sensing with high-level reasoning. Conversely, multimodal frameworks based on Artificial Neural Networks (ANNs) face challenges in computational efficiency and natively processing sparse event streams. To address these limitations, we propose the Spiking Semantic-Auxiliary Event-Variant (SSAEV) fusion framework. SSAEV integrates a dual event stream with a text stream utilizing Large Language Model (LLM)-generated action descriptions. Additionally, we introduce a novel Spiking Integrate-and-Fire Approximation (SIFA) neuron to enhance temporal modeling and develop a contrastive loss function that semantically aligns event stream and text embeddings. Experimental results demonstrate that SSAEV achieves competitive accuracy while maintaining low computational cost on the <span><math><msup><mrow><mrow><mi>T</mi></mrow><mi>H</mi><mi>U</mi></mrow><mrow><mrow><mi>E</mi></mrow><mo>−</mo><mi>A</mi><mi>C</mi><mi>T</mi></mrow></msup></math></span>-50-CHL, SeACT, UCF101-DVS, and DailyDVS-200 datasets. This performance underscores the framework’s effectiveness in bridging neuromorphic sensing with semantic reasoning for HAR.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112897"},"PeriodicalIF":7.6,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global context guided refinement and aggregation network for lightweight surface defect detection 基于全局上下文的轻量表面缺陷检测优化与聚合网络
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-10 DOI: 10.1016/j.patcog.2025.112893
Feng Yan , Xiaoheng Jiang , Yang Lu , Lisha Cui , Jiale Cao , Mingliang Xu
Surface defect detection (SDD) is an important task in industrial manufacturing to ensure product quality, which is challenging due to weak defect appearances and background distractions. Despite the great advances in this task, few pixel-level defect detection methods achieve a satisfactory trade-off between accuracy and running efficiency. To this end, we develop a Global Context Guided Refinement and Aggregation Network (GCRANet) for lightweight surface defect detection, which fully utilizes global guidance to highlight defect details and suppress background noise in the lightweight network. Specifically, a lightweight Depthwise Self-Attention (DSA) module with linear complexity is introduced to capture global information based on deep features. Global information is combined with local features to capture more complete contours for weak defects. Furthermore, a Channel Cross-Attention (CCA) module is introduced to suppress background noise from multi-level features by exploiting channel dependencies between low-level features and semantic features. The experimental results on public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency than other state-of-the-art methods. Specifically, the proposed method achieves detection speed of 272.2 fps with 1.84M parameters, while yielding competitive accuracy (SD-saliency-900: WF of 91.79 %; Magnetic tile: WF of 80.64 %; DAGM 2007: WF of 86.53 %; CrackSeg9k: WF of 68.81 %; MVTec AD: WF of 80.37 % and 76.36 % on texture and object categories).
表面缺陷检测(SDD)是工业制造中保证产品质量的重要任务,但由于缺陷外观较弱和背景干扰,使得表面缺陷检测具有挑战性。尽管在这项任务中取得了很大的进步,但很少有像素级缺陷检测方法能够在精度和运行效率之间取得令人满意的平衡。为此,我们开发了一种用于轻量化表面缺陷检测的全局上下文引导细化和聚合网络(GCRANet),该网络充分利用全局引导来突出缺陷细节并抑制轻量化网络中的背景噪声。具体来说,引入了一个具有线性复杂度的轻量级深度自关注(DSA)模块来捕获基于深度特征的全局信息。将全局信息与局部特征相结合,以捕获更完整的弱缺陷轮廓。此外,还引入了信道交叉注意(CCA)模块,利用低级特征与语义特征之间的信道依赖关系来抑制多层次特征的背景噪声。在公共缺陷数据集上的实验结果表明,该网络在准确率和运行效率之间取得了较好的平衡。具体而言,该方法在1.84M参数下实现了272.2 fps的检测速度,同时产生了竞争精度(sd - salicy -900: WF为91.79%;Magnetic tile: WF为80.64%;DAGM 2007: WF为86.53%;CrackSeg9k: WF为68.81%;MVTec AD: WF分别为80.37%和76.36%)。
{"title":"Global context guided refinement and aggregation network for lightweight surface defect detection","authors":"Feng Yan ,&nbsp;Xiaoheng Jiang ,&nbsp;Yang Lu ,&nbsp;Lisha Cui ,&nbsp;Jiale Cao ,&nbsp;Mingliang Xu","doi":"10.1016/j.patcog.2025.112893","DOIUrl":"10.1016/j.patcog.2025.112893","url":null,"abstract":"<div><div>Surface defect detection (SDD) is an important task in industrial manufacturing to ensure product quality, which is challenging due to weak defect appearances and background distractions. Despite the great advances in this task, few pixel-level defect detection methods achieve a satisfactory trade-off between accuracy and running efficiency. To this end, we develop a Global Context Guided Refinement and Aggregation Network (GCRANet) for lightweight surface defect detection, which fully utilizes global guidance to highlight defect details and suppress background noise in the lightweight network. Specifically, a lightweight Depthwise Self-Attention (DSA) module with linear complexity is introduced to capture global information based on deep features. Global information is combined with local features to capture more complete contours for weak defects. Furthermore, a Channel Cross-Attention (CCA) module is introduced to suppress background noise from multi-level features by exploiting channel dependencies between low-level features and semantic features. The experimental results on public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency than other state-of-the-art methods. Specifically, the proposed method achieves detection speed of 272.2 fps with 1.84M parameters, while yielding competitive accuracy (SD-saliency-900: WF of 91.79 %; Magnetic tile: WF of 80.64 %; DAGM 2007: WF of 86.53 %; CrackSeg9k: WF of 68.81 %; MVTec AD: WF of 80.37 % and 76.36 % on texture and object categories).</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112893"},"PeriodicalIF":7.6,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RVGCL: Towards robust recommendation via graph contrastive learning with variational inference RVGCL:通过变分推理的图对比学习实现鲁棒推荐
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-10 DOI: 10.1016/j.patcog.2025.112891
Fengjie Li , Hong Zhang , Mingyu Zhang , Liqiang Wang , Miao Wang
Inspired by the widespread success of Graph Neural Networks (GNNs) and contrastive learning, graph contrastive learning (GCL) has gained traction in recommender systems, demonstrating its powerful capabilities in representation learning and debiasing. However, current GCL approaches rely on heuristic graph and feature augmentations, which inevitably distort the original data semantics. Additionally, real-world recommendation scenarios are usually filled with noisy signals, yet existing methods ignore the inherent robustness needed to address this challenge. In this paper, we propose a robust recommendation framework based on GCL with Variational inference, named RVGCL. This framework guides the construction of harder contrastive views from a generative perspective, inducing the model to learn robust perturbation boundaries. Specifically, variational graph inference (VGI) is primarily designed to estimate the posterior distribution of the original data, and its incorporation helps the model preserve the robustness of the graph structure when generating adversarial perturbations. Moreover, a global-local contrastive strategy is designed to achieve an optimal balance in contrastive learning. We conduct comprehensive experiments on five benchmark datasets to demonstrate the effectiveness and robustness of the proposed model.
受图神经网络(gnn)和对比学习广泛成功的启发,图对比学习(GCL)在推荐系统中获得了牵引力,展示了其在表示学习和去偏方面的强大能力。然而,当前的GCL方法依赖于启发式图和特征增强,这不可避免地扭曲了原始数据语义。此外,现实世界的推荐场景通常充满了噪声信号,而现有的方法忽略了解决这一挑战所需的固有鲁棒性。在本文中,我们提出了一个基于变分推理GCL的鲁棒推荐框架,命名为RVGCL。该框架从生成的角度指导构建更硬的对比视图,诱导模型学习鲁棒摄动边界。具体来说,变分图推理(VGI)主要用于估计原始数据的后验分布,它的结合有助于模型在产生对抗性扰动时保持图结构的鲁棒性。此外,还设计了一种全局-局部对比策略,以达到对比学习的最佳平衡。我们在五个基准数据集上进行了全面的实验,以证明所提出模型的有效性和鲁棒性。
{"title":"RVGCL: Towards robust recommendation via graph contrastive learning with variational inference","authors":"Fengjie Li ,&nbsp;Hong Zhang ,&nbsp;Mingyu Zhang ,&nbsp;Liqiang Wang ,&nbsp;Miao Wang","doi":"10.1016/j.patcog.2025.112891","DOIUrl":"10.1016/j.patcog.2025.112891","url":null,"abstract":"<div><div>Inspired by the widespread success of Graph Neural Networks (GNNs) and contrastive learning, graph contrastive learning (GCL) has gained traction in recommender systems, demonstrating its powerful capabilities in representation learning and debiasing. However, current GCL approaches rely on heuristic graph and feature augmentations, which inevitably distort the original data semantics. Additionally, real-world recommendation scenarios are usually filled with noisy signals, yet existing methods ignore the inherent robustness needed to address this challenge. In this paper, we propose a robust recommendation framework based on GCL with Variational inference, named RVGCL. This framework guides the construction of harder contrastive views from a generative perspective, inducing the model to learn robust perturbation boundaries. Specifically, variational graph inference (VGI) is primarily designed to estimate the posterior distribution of the original data, and its incorporation helps the model preserve the robustness of the graph structure when generating adversarial perturbations. Moreover, a global-local contrastive strategy is designed to achieve an optimal balance in contrastive learning. We conduct comprehensive experiments on five benchmark datasets to demonstrate the effectiveness and robustness of the proposed model.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112891"},"PeriodicalIF":7.6,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DPtSTrip: Adversarially robust learning with distance-aware point-to-set triplet loss DPtSTrip:具有距离感知的点对集三元组损失的对抗鲁棒学习
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-10 DOI: 10.1016/j.patcog.2025.112840
Ran Wang , Meng Hu , Xinlei Zhou , Yuheng Jia
The vulnerability of deep neural networks (DNNs) in the face of adversarial attacks has led to serious unreliability issues. In particular, the adversarial perturbations shift the latent representations of examples, causing the perturbed examples to easily cross the classification boundary and be misclassified. In order to establish more robust classifiers, it is necessary to restrict the latent space with properly designed objective function. In this paper, a distance-aware point-to-set triplet loss is proposed for adversarial training, named as DPtSTrip. It treats each perturbed example in latent space as an anchor point, and constructs the reference sets for the anchor point based on all clean examples. By constraining the positional relationships between the anchor point and its reference sets, the perturbed example is forced to be closer to its ground-truth class while farther away from all the false classes, thereby enhancing model robustness. A distance-aware weighting strategy is designed, which enables the model to pay more attention to important examples. Moreover, the within-class distance of clean examples is used as an additional regularization term, further improving the intra-class compactness. We provide a theoretical explanation from the perspective of prediction uncertainty to illustrate why the proposed DPtSTrip can bring better robustness. Extensive experiments on datasets MNIST, CIFAR-10, CIFAR-100, SVHN and Tiny-ImageNet demonstrate the comprehensive effectiveness and superiority of the proposed method. Compared to the classic triplet loss, it achieves a robust accuracy improvement up to 7.53 % on benchmark datasets in the face of typical attacks.
深度神经网络(dnn)在面对对抗性攻击时的脆弱性导致了严重的不可靠性问题。特别是,对抗性扰动改变了样本的潜在表征,导致被扰动的样本很容易越过分类边界而被错误分类。为了建立鲁棒性更强的分类器,需要用合理设计的目标函数来限制潜在空间。本文提出了一种用于对抗训练的距离感知点到集三重损失算法,命名为DPtSTrip。它将潜在空间中的每个扰动样例视为锚点,并基于所有干净的样例构建锚点的参考集。通过约束锚点与其参考集之间的位置关系,被扰动的示例被迫更接近其真类,而远离所有假类,从而增强了模型的鲁棒性。设计了距离感知加权策略,使模型更加关注重要的例子。此外,将干净样本的类内距离作为一个额外的正则化项,进一步提高了类内紧密性。我们从预测不确定性的角度提供了一个理论解释来说明为什么所提出的DPtSTrip可以带来更好的鲁棒性。在MNIST、CIFAR-10、CIFAR-100、SVHN和Tiny-ImageNet数据集上的大量实验证明了该方法的综合有效性和优越性。与经典的三重丢失相比,面对典型攻击,它在基准数据集上实现了高达7.53%的鲁棒精度提高。
{"title":"DPtSTrip: Adversarially robust learning with distance-aware point-to-set triplet loss","authors":"Ran Wang ,&nbsp;Meng Hu ,&nbsp;Xinlei Zhou ,&nbsp;Yuheng Jia","doi":"10.1016/j.patcog.2025.112840","DOIUrl":"10.1016/j.patcog.2025.112840","url":null,"abstract":"<div><div>The vulnerability of deep neural networks (DNNs) in the face of adversarial attacks has led to serious unreliability issues. In particular, the adversarial perturbations shift the latent representations of examples, causing the perturbed examples to easily cross the classification boundary and be misclassified. In order to establish more robust classifiers, it is necessary to restrict the latent space with properly designed objective function. In this paper, a distance-aware point-to-set triplet loss is proposed for adversarial training, named as DPtSTrip. It treats each perturbed example in latent space as an anchor point, and constructs the reference sets for the anchor point based on all clean examples. By constraining the positional relationships between the anchor point and its reference sets, the perturbed example is forced to be closer to its ground-truth class while farther away from all the false classes, thereby enhancing model robustness. A distance-aware weighting strategy is designed, which enables the model to pay more attention to important examples. Moreover, the within-class distance of clean examples is used as an additional regularization term, further improving the intra-class compactness. We provide a theoretical explanation from the perspective of prediction uncertainty to illustrate why the proposed DPtSTrip can bring better robustness. Extensive experiments on datasets MNIST, CIFAR-10, CIFAR-100, SVHN and Tiny-ImageNet demonstrate the comprehensive effectiveness and superiority of the proposed method. Compared to the classic triplet loss, it achieves a robust accuracy improvement up to 7.53 % on benchmark datasets in the face of typical attacks.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112840"},"PeriodicalIF":7.6,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic prototype with discriminative representation for rapid adaptation in new organ segmentation 基于判别表示的动态原型快速适应新器官分割
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-10 DOI: 10.1016/j.patcog.2025.112870
Hailing Wang , Yu Chen , Xinyue Zhang , Guitao Cao , Wenming Cao
Recent work in label-efficient prototype-based learning have demonstrated significant potential for rapid adaptation in new organ segmentation. However, a prevalent challenge in prototypical extraction within the medical domain is semantic bias. To address this issue, we propose a Dynamic Prototype with Discriminative Representation Network (DPDRNet), to enhance the effectiveness of semantic class prototype for new organ. Specifically, we introduce a self-attention mechanism to generate dynamic prototype, enhancing the efficient utilization of local information. This is accomplished by capturing interdependencies among pixel-level prototypes from limited labeled samples. Subsequently, we design a prototype contrastive learning method to maintain the discriminative representation of dynamic prototype in the high-level feature space. This method enhances the correlation between dynamic prototype and foreground features while simultaneously increasing the distinction from background features. By incorporating a self-attention mechanism with contrastive learning, the proposed dynamic prototype exhibits enhanced generalization capabilities, facilitating more precise segmentation of new organ structures. Experimental results demonstrate that our method achieves effective performance on Cardiac and Abdominal MRI segmentation tasks.
最近在基于标签的高效原型学习方面的工作已经证明了在新的器官分割中快速适应的巨大潜力。然而,在医学领域的原型提取中,一个普遍的挑战是语义偏差。为了解决这一问题,我们提出了一种带有判别表示的动态原型网络(Dynamic Prototype with Discriminative Representation Network, DPDRNet),以提高新器官语义类原型的有效性。具体来说,我们引入了自关注机制来生成动态原型,提高了对局部信息的有效利用。这是通过从有限的标记样本中捕获像素级原型之间的相互依赖性来完成的。随后,我们设计了一种原型对比学习方法来保持动态原型在高级特征空间中的判别表示。该方法增强了动态原型与前景特征之间的相关性,同时增强了与背景特征的区别。通过将自注意机制与对比学习相结合,所提出的动态原型具有更强的泛化能力,有助于更精确地分割新的器官结构。实验结果表明,该方法在心脏和腹部MRI分割任务中取得了较好的效果。
{"title":"Dynamic prototype with discriminative representation for rapid adaptation in new organ segmentation","authors":"Hailing Wang ,&nbsp;Yu Chen ,&nbsp;Xinyue Zhang ,&nbsp;Guitao Cao ,&nbsp;Wenming Cao","doi":"10.1016/j.patcog.2025.112870","DOIUrl":"10.1016/j.patcog.2025.112870","url":null,"abstract":"<div><div>Recent work in label-efficient prototype-based learning have demonstrated significant potential for rapid adaptation in new organ segmentation. However, a prevalent challenge in prototypical extraction within the medical domain is semantic bias. To address this issue, we propose a <em>Dynamic Prototype with Discriminative Representation Network (DPDRNet)</em>, to enhance the effectiveness of semantic class prototype for new organ. Specifically, we introduce a self-attention mechanism to generate dynamic prototype, enhancing the efficient utilization of local information. This is accomplished by capturing interdependencies among pixel-level prototypes from limited labeled samples. Subsequently, we design a prototype contrastive learning method to maintain the discriminative representation of dynamic prototype in the high-level feature space. This method enhances the correlation between dynamic prototype and foreground features while simultaneously increasing the distinction from background features. By incorporating a self-attention mechanism with contrastive learning, the proposed dynamic prototype exhibits enhanced generalization capabilities, facilitating more precise segmentation of new organ structures. Experimental results demonstrate that our method achieves effective performance on Cardiac and Abdominal MRI segmentation tasks.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112870"},"PeriodicalIF":7.6,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiMo: Diffusion transformers for monocular human motion estimation in the world system 世界系统中用于单目人体运动估计的扩散变压器
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-10 DOI: 10.1016/j.patcog.2025.112884
Xuesi Qiu , Zhao Wang , Jun Xiao , Yinyu Nie
Estimating human motion in the world system from monocular RGB videos demands precise pose predictions in both spatial and temporal space. Such task is challenging by factors such as 2D-to-3D ambiguity, multiple occlusions, and most importantly, the movement of both the camera and the human body. Existing methods often rely on paired 3D supervision to derive global trajectories and shapes from videos, which is difficult to scale. Weakly supervised methods have also been employed by focusing on aligning with frame-wise 2D keypoint detections. However, it often leads to motion jitters in 3D space. We observe that the incoherency and sudden changes in the action recordings are mostly raised by the camera movement, since the human motion itself is usually consistent and continuous in the world system. In this paper, we introduce a novel optimization method, DiMo. It ensures the alignment between prediction results and 2D observations as well as conforming to the natural motion distribution. To this end, a motion prior module based on a Diffusion Transformer is designed to align our motion segments with the learned natural motion distribution. In addition, DiMo does not require paired 3D supervision for motion optimization. Furthermore, the proposed motion prior features a plug-and-play design, making it a versatile module that can enhance existing methods. Extensive experiments demonstrate that our method achieves the state-of-the-art performance in various datasets, including Egobody, 3DPW, RICH and EMDB.
从单目RGB视频中估计世界系统中的人体运动需要在空间和时间空间中进行精确的姿势预测。这样的任务是具有挑战性的因素,如2d到3d的模糊性,多重遮挡,最重要的是,相机和人体的运动。现有的方法通常依赖于成对的3D监督来从视频中获得全局轨迹和形状,这很难进行缩放。弱监督方法也被采用,专注于与逐帧二维关键点检测对齐。然而,它经常导致3D空间中的运动抖动。我们观察到,动作记录中的不连贯和突然变化主要是由摄像机运动引起的,因为人体运动本身在世界系统中通常是一致和连续的。本文介绍了一种新的优化方法——DiMo。它保证了预测结果与二维观测结果的一致性,并符合自然运动分布。为此,设计了一个基于扩散变压器的运动先验模块,将我们的运动段与学习到的自然运动分布对齐。此外,DiMo不需要成对的3D监督来进行运动优化。此外,所提出的运动先验具有即插即用设计,使其成为一个多功能模块,可以增强现有方法。大量的实验表明,我们的方法在各种数据集(包括Egobody, 3DPW, RICH和EMDB)中实现了最先进的性能。
{"title":"DiMo: Diffusion transformers for monocular human motion estimation in the world system","authors":"Xuesi Qiu ,&nbsp;Zhao Wang ,&nbsp;Jun Xiao ,&nbsp;Yinyu Nie","doi":"10.1016/j.patcog.2025.112884","DOIUrl":"10.1016/j.patcog.2025.112884","url":null,"abstract":"<div><div>Estimating human motion in the world system from monocular RGB videos demands precise pose predictions in both spatial and temporal space. Such task is challenging by factors such as 2D-to-3D ambiguity, multiple occlusions, and most importantly, the movement of both the camera and the human body. Existing methods often rely on paired 3D supervision to derive global trajectories and shapes from videos, which is difficult to scale. Weakly supervised methods have also been employed by focusing on aligning with frame-wise 2D keypoint detections. However, it often leads to motion jitters in 3D space. We observe that the incoherency and sudden changes in the action recordings are mostly raised by the camera movement, since the human motion itself is usually consistent and continuous in the world system. In this paper, we introduce a novel optimization method, <em>DiMo</em>. It ensures the alignment between prediction results and 2D observations as well as conforming to the natural motion distribution. To this end, a motion prior module based on a Diffusion Transformer is designed to align our motion segments with the learned natural motion distribution. In addition, <em>DiMo</em> does not require paired 3D supervision for motion optimization. Furthermore, the proposed motion prior features a plug-and-play design, making it a versatile module that can enhance existing methods. Extensive experiments demonstrate that our method achieves the state-of-the-art performance in various datasets, including Egobody, 3DPW, RICH and EMDB.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"173 ","pages":"Article 112884"},"PeriodicalIF":7.6,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1