首页 > 最新文献

计算机科学最新文献

英文 中文
IF:
Learning modality knowledge with proxy for RGB-Infrared object detection 用代理学习rgb -红外目标检测的模态知识
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-02-03 DOI: 10.1016/j.patcog.2026.113227
You Ma, Lin Chai, Shihan Mao, Yucheng Zhang
RGB-infrared object detection aims to improve detection performance in complex environments by integrating complementary information from RGB and infrared images. While transformer-based methods have demonstrated significant advancements in this field by directly modeling dense relationships between modality tokens to enable cross-modality long-range interactions, they neglect the inherent discrepancies in feature distributions across modalities. Such discrepancies attenuate the reliability of the established relationships, thereby restricting the effective exploitation of complementary information between modalities. To alleviate this problem, we propose a framework for learning modality knowledge with proxy. The core innovation lies in the design of a proxy-guided cross-modality feature fusion module, which realizes dual-modality interactions by using lightweight proxy tokens as intermediate representations. Specifically, self-attention is firstly utilized to facilitate the proxy tokens to learn the global information of each modality; then, the relationship between dual-modality proxy tokens is constructed to capture modality complementary information while also mitigating the interference of modality discrepancies; and finally, the knowledge in the updated proxy tokens is fed back to each modality through cross-attention for enhancing the features of each modality. Additionally, a mixture of knowledge decoupled experts module is designed to effectively fuse enhanced features of the two modalities. This module leverages multiple gating networks to assign modality-specific and modality-shared knowledge to separate expert groups for learning, thus highlighting the advantageous features of the different modalities. Extensive experiments on four RGB-infrared datasets demonstrate that our method outperforms existing state-of-the-art methods.
RGB-红外目标检测旨在将RGB图像和红外图像的互补信息相结合,提高复杂环境下的检测性能。虽然基于转换器的方法通过直接建模模态标记之间的密集关系来实现跨模态远程交互,在该领域取得了重大进展,但它们忽略了模态之间特征分布的固有差异。这种差异削弱了已建立的关系的可靠性,从而限制了对模式之间互补信息的有效利用。为了缓解这一问题,我们提出了一个使用代理学习模态知识的框架。核心创新点在于设计了以代理为导向的跨模态特征融合模块,利用轻量级代理令牌作为中间表示实现双模态交互。具体而言,首先利用自关注促进代理令牌学习各模态的全局信息;然后,构建双模态代理令牌之间的关系,在获取模态互补信息的同时减轻模态差异的干扰;最后,通过交叉关注的方式将更新后的代理令牌中的知识反馈给各个模态,增强各个模态的特征。此外,还设计了一个知识解耦的混合专家模块,以有效地融合两种模式的增强特征。该模块利用多个门控网络将模式特定和模式共享的知识分配给不同的专家组进行学习,从而突出不同模式的优势特征。在四个rgb红外数据集上进行的大量实验表明,我们的方法优于现有的最先进的方法。
{"title":"Learning modality knowledge with proxy for RGB-Infrared object detection","authors":"You Ma,&nbsp;Lin Chai,&nbsp;Shihan Mao,&nbsp;Yucheng Zhang","doi":"10.1016/j.patcog.2026.113227","DOIUrl":"10.1016/j.patcog.2026.113227","url":null,"abstract":"<div><div>RGB-infrared object detection aims to improve detection performance in complex environments by integrating complementary information from RGB and infrared images. While transformer-based methods have demonstrated significant advancements in this field by directly modeling dense relationships between modality tokens to enable cross-modality long-range interactions, they neglect the inherent discrepancies in feature distributions across modalities. Such discrepancies attenuate the reliability of the established relationships, thereby restricting the effective exploitation of complementary information between modalities. To alleviate this problem, we propose a framework for learning modality knowledge with proxy. The core innovation lies in the design of a proxy-guided cross-modality feature fusion module, which realizes dual-modality interactions by using lightweight proxy tokens as intermediate representations. Specifically, self-attention is firstly utilized to facilitate the proxy tokens to learn the global information of each modality; then, the relationship between dual-modality proxy tokens is constructed to capture modality complementary information while also mitigating the interference of modality discrepancies; and finally, the knowledge in the updated proxy tokens is fed back to each modality through cross-attention for enhancing the features of each modality. Additionally, a mixture of knowledge decoupled experts module is designed to effectively fuse enhanced features of the two modalities. This module leverages multiple gating networks to assign modality-specific and modality-shared knowledge to separate expert groups for learning, thus highlighting the advantageous features of the different modalities. Extensive experiments on four RGB-infrared datasets demonstrate that our method outperforms existing state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113227"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving episodic few-shot visual question answering via spatial and frequency domain dual-calibration 基于空间和频域双标定的情景式少镜头视觉问答改进
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-02-04 DOI: 10.1016/j.patcog.2026.113165
Jing Zhang, Yifan Wei, Yunzuo Hu, Zhe Wang
Considering that the frequency domain information in the image can make up for the deficiency of the spatial domain information in the global structure representation, we proposed a novel Dual-domain Feature and Distribution dual-calibration Network (DFDN) for episodic few shot visual question answering to achieve a deep and comprehensive understanding of the image content and cross-modal reasoning. In DFDN, spatial and frequency-domain information are mutually calibrated to achieve complementary information advantages, and more effective cross-modal reasoning is achieved through dual calibration of both features and distributions. A dual-domain feature calibration module is proposed, which employs mutual mapping and dynamic masking techniques to extract task-relevant features, and calibrate dual-domain information at the feature level. Meanwhile, a new dual-domain mutual distillation distribution calibration module is proposed to achieve mutual calibration of data distributions across spatial and frequency domains, further improving the cross-modal reasoning ability of DFDN. Experimental results across four public benchmark datasets demonstrated that DFDN achieved excellent performance and outperformed current state-of-the-art methods on episodic few shot visual question answering. Code is available at anonymous account: https://github.com/Harold1810/DFDN.
考虑到图像中的频域信息可以弥补全局结构表示中空间域信息的不足,我们提出了一种新的双域特征和分布双校准网络(DFDN),用于情景式少镜头视觉问答,以实现对图像内容的深入全面理解和跨模态推理。在DFDN中,空间和频域信息相互校准,实现信息优势互补,通过特征和分布的双重校准,实现更有效的跨模态推理。提出了一种双域特征校准模块,该模块采用互映射和动态掩蔽技术提取任务相关特征,并在特征级对双域信息进行校准。同时,提出了一种新的双域互精馏分布标定模块,实现了跨空间域和频域数据分布的互标定,进一步提高了DFDN的跨模态推理能力。在四个公共基准数据集上的实验结果表明,DFDN在情景性少镜头视觉问答方面取得了优异的性能,优于当前最先进的方法。代码可在匿名帐户:https://github.com/Harold1810/DFDN。
{"title":"Improving episodic few-shot visual question answering via spatial and frequency domain dual-calibration","authors":"Jing Zhang,&nbsp;Yifan Wei,&nbsp;Yunzuo Hu,&nbsp;Zhe Wang","doi":"10.1016/j.patcog.2026.113165","DOIUrl":"10.1016/j.patcog.2026.113165","url":null,"abstract":"<div><div>Considering that the frequency domain information in the image can make up for the deficiency of the spatial domain information in the global structure representation, we proposed a novel Dual-domain Feature and Distribution dual-calibration Network (DFDN) for episodic few shot visual question answering to achieve a deep and comprehensive understanding of the image content and cross-modal reasoning. In DFDN, spatial and frequency-domain information are mutually calibrated to achieve complementary information advantages, and more effective cross-modal reasoning is achieved through dual calibration of both features and distributions. A dual-domain feature calibration module is proposed, which employs mutual mapping and dynamic masking techniques to extract task-relevant features, and calibrate dual-domain information at the feature level. Meanwhile, a new dual-domain mutual distillation distribution calibration module is proposed to achieve mutual calibration of data distributions across spatial and frequency domains, further improving the cross-modal reasoning ability of DFDN. Experimental results across four public benchmark datasets demonstrated that DFDN achieved excellent performance and outperformed current state-of-the-art methods on episodic few shot visual question answering. Code is available at anonymous account: <span><span>https://github.com/Harold1810/DFDN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113165"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DEER: Diffusion-empowered efficient restoration for underwater images 鹿:扩散增强水下图像的有效恢复
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-31 DOI: 10.1016/j.patcog.2026.113167
Cheng Wang , Junyang Chen , Weichen Zhao , Wanhui Gao , Ge Jiao
Underwater images suffer from severe degradation caused by light attenuation and noise interference. Existing methods struggle to achieve a balance between performance and inference efficiency. To address this problem, we propose a Diffusion-Empowered Efficient Restoration (DEER) framework, comprising an enhancement network and a restoration network. The former incorporates two key modules: the High-Frequency Detail Enhancement (HFDE) module introduces a min-pooling channel to recover dark details suppressed by medium absorption, complementing max-pooling and average-pooling to capture comprehensive physical edge characteristics; meanwhile, the Multi-Scale Fusion (MSF) module utilizes multi-scale analysis to address the spatial non-uniformity of color casts. Collectively, they provide rich frequency-domain priors for the subsequent restoration. Regarding the latter, unlike previous approaches that directly utilized a diffusion model for generation, we employ it as a diffusion-guided learned prior. By providing dynamic gradient guidance during the training phase, the lightweight network learns the natural image manifold while avoiding smoothing artifacts induced by pixel-wise mimicry. During inference, the diffusion model is discarded, allowing the lightweight restoration model to achieve accelerated inference. Experimental results demonstrate that DEER outperforms state-of-the-art approaches, achieves improvement of 0.7% ∼ 5.6% across nearly all metrics on the LSUI and UIEB datasets. Our code is available at here.
水下图像受到光衰减和噪声干扰的严重影响。现有的方法很难在性能和推理效率之间取得平衡。为了解决这个问题,我们提出了一个扩散授权的有效恢复(DEER)框架,该框架包括一个增强网络和一个恢复网络。前者包含两个关键模块:高频细节增强(HFDE)模块引入最小池化通道来恢复被介质吸收抑制的暗细节,补充了最大池化和平均池化,以捕获全面的物理边缘特征;同时,多尺度融合(MSF)模块利用多尺度分析来解决色模的空间非均匀性问题。总的来说,它们为随后的恢复提供了丰富的频域先验。对于后者,与之前直接使用扩散模型进行生成的方法不同,我们将其用作扩散引导学习先验。通过在训练阶段提供动态梯度引导,轻量级网络学习自然图像流形,同时避免由逐像素模仿引起的平滑伪影。在推理过程中,放弃扩散模型,允许轻量级恢复模型实现加速推理。实验结果表明,DEER优于最先进的方法,在LSUI和UIEB数据集的几乎所有指标上实现了0.7% ~ 5.6%的改进。我们的代码可以在这里找到。
{"title":"DEER: Diffusion-empowered efficient restoration for underwater images","authors":"Cheng Wang ,&nbsp;Junyang Chen ,&nbsp;Weichen Zhao ,&nbsp;Wanhui Gao ,&nbsp;Ge Jiao","doi":"10.1016/j.patcog.2026.113167","DOIUrl":"10.1016/j.patcog.2026.113167","url":null,"abstract":"<div><div>Underwater images suffer from severe degradation caused by light attenuation and noise interference. Existing methods struggle to achieve a balance between performance and inference efficiency. To address this problem, we propose a Diffusion-Empowered Efficient Restoration (DEER) framework, comprising an enhancement network and a restoration network. The former incorporates two key modules: the High-Frequency Detail Enhancement (HFDE) module introduces a min-pooling channel to recover dark details suppressed by medium absorption, complementing max-pooling and average-pooling to capture comprehensive physical edge characteristics; meanwhile, the Multi-Scale Fusion (MSF) module utilizes multi-scale analysis to address the spatial non-uniformity of color casts. Collectively, they provide rich frequency-domain priors for the subsequent restoration. Regarding the latter, unlike previous approaches that directly utilized a diffusion model for generation, we employ it as a diffusion-guided learned prior. By providing dynamic gradient guidance during the training phase, the lightweight network learns the natural image manifold while avoiding smoothing artifacts induced by pixel-wise mimicry. During inference, the diffusion model is discarded, allowing the lightweight restoration model to achieve accelerated inference. Experimental results demonstrate that DEER outperforms state-of-the-art approaches, achieves improvement of 0.7% ∼ 5.6% across nearly all metrics on the LSUI and UIEB datasets. Our code is available at <span><span>here</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113167"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adversarial training with attention-guided feature fusion and inclusive contrastive learning 基于注意引导特征融合和包容性对比学习的对抗训练
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-02-02 DOI: 10.1016/j.patcog.2026.113220
Xiao Sun , Song Wang , Jucheng Yang
Numerous studies show that deep neural networks (DNNs) are vulnerable to adversarial patch attacks. Many existing adversarial defense strategies present two major drawbacks. First, they cannot handle adversarial patches of random locations and sizes. Second, they attempt to improve defense performance by integrating information from clean and adversarial examples, but this is susceptible to salient and camouflaged features, resulting in weakened generalization and natural accuracy. To address these issues, in this paper we propose an adversarial training method equipped with a novel mechanism of attention-guided feature fusion (or AttFus in short) and inclusive contrastive learning (ICL). By generating attention difference maps based on clean and adversarial examples and performing piecewise fusion of features, AttFus enables the DNN model to refocus on key areas of the image, overcoming the negative effect of adversarial patches, thereby achieving highly accurate image classification. Moreover, the proposed ICL using both clean and adversarial examples as positives allows for a smooth transition between similar examples in the representation space and better discriminates between signal and noise, thus heightening the model’s natural accuracy and resistance to adversarial attacks. Compared with state-of-the-art adversarial defense methods on benchmark datasets, the proposed method demonstrates competitive performance. When faced with the cross-attack, cross-model and cross-dataset challenges, the proposed method demonstrates excellent robustness and generalization. Our code is available at https://github.com/SunX81/AT-with-AttFus-and-ICL.
大量研究表明,深度神经网络(dnn)容易受到对抗性补丁攻击。许多现有的对抗性防御策略存在两个主要缺陷。首先,它们无法处理随机位置和大小的对抗性斑块。其次,他们试图通过整合来自干净和对抗示例的信息来提高防御性能,但这容易受到显著和伪装特征的影响,从而削弱泛化和自然准确性。为了解决这些问题,本文提出了一种对抗训练方法,该方法配备了一种新的注意引导特征融合(或简称AttFus)和包容性对比学习(ICL)机制。AttFus算法基于干净的和对抗性的样本生成注意力差异图,并对特征进行分段融合,使DNN模型能够重新聚焦于图像的关键区域,克服对抗性补丁的负面影响,从而实现高度精确的图像分类。此外,所提出的ICL使用干净和对抗性示例作为阳性,允许在表示空间中的相似示例之间平滑过渡,更好地区分信号和噪声,从而提高模型的自然准确性和对对抗性攻击的抵抗力。在基准数据集上与最先进的对抗性防御方法进行了比较,证明了该方法具有竞争力。在面对交叉攻击、跨模型和跨数据集挑战时,该方法表现出良好的鲁棒性和泛化性。我们的代码可在https://github.com/SunX81/AT-with-AttFus-and-ICL上获得。
{"title":"Adversarial training with attention-guided feature fusion and inclusive contrastive learning","authors":"Xiao Sun ,&nbsp;Song Wang ,&nbsp;Jucheng Yang","doi":"10.1016/j.patcog.2026.113220","DOIUrl":"10.1016/j.patcog.2026.113220","url":null,"abstract":"<div><div>Numerous studies show that deep neural networks (DNNs) are vulnerable to adversarial patch attacks. Many existing adversarial defense strategies present two major drawbacks. First, they cannot handle adversarial patches of random locations and sizes. Second, they attempt to improve defense performance by integrating information from clean and adversarial examples, but this is susceptible to salient and camouflaged features, resulting in weakened generalization and natural accuracy. To address these issues, in this paper we propose an adversarial training method equipped with a novel mechanism of attention-guided feature fusion (or AttFus in short) and inclusive contrastive learning (ICL). By generating attention difference maps based on clean and adversarial examples and performing piecewise fusion of features, AttFus enables the DNN model to refocus on key areas of the image, overcoming the negative effect of adversarial patches, thereby achieving highly accurate image classification. Moreover, the proposed ICL using both clean and adversarial examples as positives allows for a smooth transition between similar examples in the representation space and better discriminates between signal and noise, thus heightening the model’s natural accuracy and resistance to adversarial attacks. Compared with state-of-the-art adversarial defense methods on benchmark datasets, the proposed method demonstrates competitive performance. When faced with the cross-attack, cross-model and cross-dataset challenges, the proposed method demonstrates excellent robustness and generalization. Our code is available at <span><span>https://github.com/SunX81/AT-with-AttFus-and-ICL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113220"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DFWe: Efficient knowledge distillation of fine-tuned Whisper encoder for speech emotion recognition 面向语音情感识别的高效知识精馏微调耳语编码器
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-27 DOI: 10.1016/j.patcog.2026.113161
Yujian Ma , Xianquan Jiang , Jinqiu Sang , Ruizhe Li
Despite the strong acoustic modeling capabilities of large pre-trained speech models such as Whisper, their direct application to speech emotion recognition (SER) is hindered by task mismatch, high computational cost, and limited retention of affective cues. To address these challenges, we propose DFWe (Distillation of Fine-Tuned Whisper Encoder), a two-stage knowledge distillation (KD) framework combining parameter-efficient adaptation with multi-objective supervision. In Stage 1, a subset of upper layers in the Whisper encoder is fine-tuned along with a lightweight projector and classification head, enabling the model to preserve general acoustic knowledge while adapting to emotion-specific features. In Stage 2, knowledge is distilled into a compact Whisper-Small student using a hybrid loss that integrates hard-label cross-entropy, confidence-aware soft-label KL divergence, and intermediate feature alignment via Centered Kernel Alignment (CKA). On the IEMOCAP dataset with 10-fold cross-validation (CV), DFWe achieves a 7.21 ×  reduction in model size while retaining 99.99% of the teacher’s unweighted average recall (UAR) and reaching 79.82% weighted average recall (WAR) and 81.32% UAR, representing state-of-the-art performance among knowledge-distillation-based SER methods. Ablation studies highlight the benefits of adaptive temperature scaling, multi-level supervision, and targeted augmentation in improving both accuracy and robustness. Case analyses further show that DFWe yields more confident and stable predictions in emotionally ambiguous scenarios, underscoring its practical effectiveness. Overall, DFWe offers a scalable, generalizable solution for deploying SER systems in resource-constrained environments.
尽管大型预训练语音模型(如Whisper)具有强大的声学建模能力,但它们在语音情绪识别(SER)中的直接应用受到任务不匹配、高计算成本和情感线索保留有限的阻碍。为了解决这些挑战,我们提出了DFWe(精馏微调耳语编码器),这是一个两阶段的知识精馏(KD)框架,结合了参数有效自适应和多目标监督。在第一阶段,对Whisper编码器上层的一个子集进行微调,以及一个轻量级的投影仪和分类头,使模型能够保留一般的声学知识,同时适应特定的情感特征。在第二阶段,使用混合损失将知识提取成紧凑的Whisper-Small学生,该混合损失集成了硬标签交叉熵、置信度感知软标签KL散度和通过中心核对齐(CKA)进行的中间特征对齐。在具有10倍交叉验证(CV)的IEMOCAP数据集上,DFWe在模型大小上实现了7.21 × 的减少,同时保留了99.99%的教师未加权平均召回率(UAR),达到79.82%的加权平均召回率(WAR)和81.32%的UAR,代表了基于知识提取的SER方法中最先进的性能。消融研究强调了自适应温度标度、多层次监督和靶向增强在提高准确性和鲁棒性方面的好处。案例分析进一步表明,在情绪模糊的情况下,DFWe产生了更自信和稳定的预测,强调了其实际有效性。总的来说,DFWe为在资源受限的环境中部署SER系统提供了一个可扩展的、通用的解决方案。
{"title":"DFWe: Efficient knowledge distillation of fine-tuned Whisper encoder for speech emotion recognition","authors":"Yujian Ma ,&nbsp;Xianquan Jiang ,&nbsp;Jinqiu Sang ,&nbsp;Ruizhe Li","doi":"10.1016/j.patcog.2026.113161","DOIUrl":"10.1016/j.patcog.2026.113161","url":null,"abstract":"<div><div>Despite the strong acoustic modeling capabilities of large pre-trained speech models such as Whisper, their direct application to speech emotion recognition (SER) is hindered by task mismatch, high computational cost, and limited retention of affective cues. To address these challenges, we propose DFWe (Distillation of Fine-Tuned Whisper Encoder), a two-stage knowledge distillation (KD) framework combining parameter-efficient adaptation with multi-objective supervision. In Stage 1, a subset of upper layers in the Whisper encoder is fine-tuned along with a lightweight projector and classification head, enabling the model to preserve general acoustic knowledge while adapting to emotion-specific features. In Stage 2, knowledge is distilled into a compact Whisper-Small student using a hybrid loss that integrates hard-label cross-entropy, confidence-aware soft-label KL divergence, and intermediate feature alignment via Centered Kernel Alignment (CKA). On the IEMOCAP dataset with 10-fold cross-validation (CV), DFWe achieves a 7.21 ×  reduction in model size while retaining 99.99% of the teacher’s unweighted average recall (UAR) and reaching 79.82% weighted average recall (WAR) and 81.32% UAR, representing state-of-the-art performance among knowledge-distillation-based SER methods. Ablation studies highlight the benefits of adaptive temperature scaling, multi-level supervision, and targeted augmentation in improving both accuracy and robustness. Case analyses further show that DFWe yields more confident and stable predictions in emotionally ambiguous scenarios, underscoring its practical effectiveness. Overall, DFWe offers a scalable, generalizable solution for deploying SER systems in resource-constrained environments.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113161"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semi-MedSAM: Adapting SAM-assisted semi-supervised multi-modality learning for medical endoscopic image segmentation Semi-MedSAM:将sam辅助的半监督多模态学习用于医学内窥镜图像分割
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-31 DOI: 10.1016/j.patcog.2026.113206
Junhao Li , Yun Li , Junhao Wu , Chaojie Ji , Zhijie Chen , Wenbin Lei , Ruxin Wang
Accurate recognition of lesions in endoscopic images is essential for effective diagnosis and treatment. Multi-modal learning effectively utilizes complementary clues derived from multiple modalities, which can promote performance in lesion area detection. However, accessing the amount of annotated paired images for multi-modal learning is time-consuming and costly. Segment Anything Model (SAM) is a powerful vision foundation model that excels in natural image segmentation, but it encounters performance degradation in endoscopic scenes due to a lack of medical-specific knowledge. Besides, the simple structure of the SAM decoder fails to effectively capture fine-grained details among complex lesion structures and low-contrast tissue organs in endoscopic images. To utilize the powerful feature extraction capability of the foundation model and address the scarcity dilemma in medical image annotation, we present a novel prompt-free SAM-assisted framework, Semi-MedSAM, for semi-supervised multi-modal learning. The proposed Semi-MedSAM integrates an effective SAM-based backbone comprising a designed multi-expert-instructed encoder as well as a hierarchical prototypical decoder into a prompt-free semi-supervised framework. Extensive experiments on three multi-modal endoscopic datasets demonstrate the superior segmentation performance of our Semi-MedSAM.
准确识别病变的内镜图像是必要的有效的诊断和治疗。多模态学习有效地利用了来自多模态的互补线索,可以提高损伤区域检测的性能。然而,为多模态学习访问大量带注释的配对图像是耗时且昂贵的。SAM (Segment Anything Model)是一种功能强大的视觉基础模型,在自然图像分割方面表现出色,但在内窥镜场景中由于缺乏医学专业知识而导致性能下降。此外,SAM解码器结构简单,无法有效捕获内镜图像中复杂病变结构和低对比度组织器官之间的细粒度细节。为了利用基础模型强大的特征提取能力,解决医学图像标注中的稀缺性困境,我们提出了一种新的基于半监督多模态学习的无提示sam辅助框架Semi-MedSAM。提出的Semi-MedSAM集成了一个有效的基于sam的骨干,包括设计的多专家指导编码器和分层原型解码器到一个无提示的半监督框架中。在三个多模态内窥镜数据集上进行的大量实验表明,我们的Semi-MedSAM具有优越的分割性能。
{"title":"Semi-MedSAM: Adapting SAM-assisted semi-supervised multi-modality learning for medical endoscopic image segmentation","authors":"Junhao Li ,&nbsp;Yun Li ,&nbsp;Junhao Wu ,&nbsp;Chaojie Ji ,&nbsp;Zhijie Chen ,&nbsp;Wenbin Lei ,&nbsp;Ruxin Wang","doi":"10.1016/j.patcog.2026.113206","DOIUrl":"10.1016/j.patcog.2026.113206","url":null,"abstract":"<div><div>Accurate recognition of lesions in endoscopic images is essential for effective diagnosis and treatment. Multi-modal learning effectively utilizes complementary clues derived from multiple modalities, which can promote performance in lesion area detection. However, accessing the amount of annotated paired images for multi-modal learning is time-consuming and costly. Segment Anything Model (SAM) is a powerful vision foundation model that excels in natural image segmentation, but it encounters performance degradation in endoscopic scenes due to a lack of medical-specific knowledge. Besides, the simple structure of the SAM decoder fails to effectively capture fine-grained details among complex lesion structures and low-contrast tissue organs in endoscopic images. To utilize the powerful feature extraction capability of the foundation model and address the scarcity dilemma in medical image annotation, we present a novel prompt-free SAM-assisted framework, Semi-MedSAM, for semi-supervised multi-modal learning. The proposed Semi-MedSAM integrates an effective SAM-based backbone comprising a designed multi-expert-instructed encoder as well as a hierarchical prototypical decoder into a prompt-free semi-supervised framework. Extensive experiments on three multi-modal endoscopic datasets demonstrate the superior segmentation performance of our Semi-MedSAM.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113206"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improve geometric accuracy in robotic forming 提高机器人成形的几何精度
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-08-01 Epub Date: 2026-01-12 DOI: 10.1016/j.rcim.2026.103236
Yanrong Zhang , Zeran Hou , Fan Yang , Bernd Kuhlenkötter , Antonio Sánchez Egea , Junying Min
Robotic forming is a die-less manufacturing process that utilizes single or multiple industrial robots to incrementally deform metal sheets into complex, customized parts. Owing to its high flexibility, it has become an important research direction in the field of intelligent manufacturing due to its high flexibility. However, issues such as complex local deformation mechanisms and insufficient support during robotic forming often lead to low geometric accuracy in fabricated parts, which is one of the most critical factors hindering the broader application of robotic forming in high-precision manufacturing. In this context, this paper presents a comprehensive review of the methods for improving geometric accuracy in robotic forming. Firstly, to address the core issue of geometric accuracy, the categories and origins of geometric errors in robotic forming are systematically examined. Subsequently, typical precision control strategies and related research are categorized into three aspects: process innovation, optimization of process parameters, and tool path planning and compensation. Current challenges associated with each strategy are also summarized. Finally, potential future research directions are discussed, incorporating advanced technologies such as multi-robot forming, artificial intelligence, multi-source data fusion, and digital twins.
机器人成形是一种无模具制造工艺,利用单个或多个工业机器人将金属板材逐渐变形成复杂的定制零件。由于其高柔性,使其成为智能制造领域的一个重要研究方向。然而,机器人成形过程中局部变形机制复杂、支撑不足等问题往往导致成形件几何精度低,这是阻碍机器人成形在高精度制造中广泛应用的最关键因素之一。在这种情况下,本文提出了一个全面的方法,以提高几何精度的机器人成形。首先,针对机器人成形过程中几何精度这一核心问题,系统分析了机器人成形过程中几何误差的种类和产生原因。随后,将典型的精密控制策略及其相关研究分为工艺创新、工艺参数优化和刀具轨迹规划与补偿三个方面。还总结了与每种战略相关的当前挑战。最后,讨论了未来的研究方向,包括多机器人成形、人工智能、多源数据融合和数字孪生等先进技术。
{"title":"Improve geometric accuracy in robotic forming","authors":"Yanrong Zhang ,&nbsp;Zeran Hou ,&nbsp;Fan Yang ,&nbsp;Bernd Kuhlenkötter ,&nbsp;Antonio Sánchez Egea ,&nbsp;Junying Min","doi":"10.1016/j.rcim.2026.103236","DOIUrl":"10.1016/j.rcim.2026.103236","url":null,"abstract":"<div><div>Robotic forming is a die-less manufacturing process that utilizes single or multiple industrial robots to incrementally deform metal sheets into complex, customized parts. Owing to its high flexibility, it has become an important research direction in the field of intelligent manufacturing due to its high flexibility. However, issues such as complex local deformation mechanisms and insufficient support during robotic forming often lead to low geometric accuracy in fabricated parts, which is one of the most critical factors hindering the broader application of robotic forming in high-precision manufacturing. In this context, this paper presents a comprehensive review of the methods for improving geometric accuracy in robotic forming. Firstly, to address the core issue of geometric accuracy, the categories and origins of geometric errors in robotic forming are systematically examined. Subsequently, typical precision control strategies and related research are categorized into three aspects: process innovation, optimization of process parameters, and tool path planning and compensation. Current challenges associated with each strategy are also summarized. Finally, potential future research directions are discussed, incorporating advanced technologies such as multi-robot forming, artificial intelligence, multi-source data fusion, and digital twins.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"100 ","pages":"Article 103236"},"PeriodicalIF":11.4,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145956531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CATCH: Causal attention enhanced meta-path semantic fusion for robust hyperbolic heterogeneous graph embedding 捕获:因果注意增强的元路径语义融合鲁棒双曲异构图嵌入
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-02-05 DOI: 10.1016/j.inffus.2026.104206
Bojia Liu , Conghui Zheng , Li Pan
Heterogeneous graph representation learning seeks to capture the complex structural and semantic properties in heterogeneous graphs. The integration of hyperbolic space, which is well-suited to modeling the intrinsic degree power-law distribution of graphs, has facilitated significant advancements in this area. Recent methods leverage hyperbolic attention mechanisms to fuse semantic information within metapath-induced subgraphs. Despite this progress, a major limitation remains: these methods leverage attention for information aggregation but fail to model the causal relationship between semantic fusion and downstream task performance, leading to spurious semantic associations that reduce robustness to noise and impair cross-task generalization. To address this challenge, we propose a Causal ATtention enhanCed Hyperbolic Heterogeneous Graph Neural Network (CATCH), intending to achieve sufficient semantic information fusion. To the best of our knowledge, CATCH is the first to integrate hyperbolic space with causal inference for heterogeneous graph representations, directly targeting spurious semantic correlations at the source. Specifically, CATCH explicitly encodes the Euclidean node attributes of different types into a shared semantic hyperbolic space. To capture the underlying semantics, context subgraphs based on one-order and high-order metapaths are constructed to facilitate hyperbolic attention-based intra-level and inter-level information aggregation, thus forming comprehensive representations. Finally, a causal attention enhancement mechanism is implemented with direct supervision on attention learning, leveraging counterfactual causal inference to generate counterfactual representations for computing direct causal effects. By jointly optimizing a task-specific objective alongside a causal loss, CATCH promotes more faithful semantic encoding, leading to improved robustness and generalization. Extensive experiments on four real-world datasets validate the superior performance of CATCH across multiple tasks. The implementation is available at https://github.com/Crystal-LiuBojia/CATCH.
Recommendation performance on Amazon-CD and Amazon-Book.
异构图表示学习旨在捕获异构图中复杂的结构和语义属性。双曲空间的积分非常适合于图形的内禀次幂律分布的建模,促进了这一领域的重大进展。最近的方法利用双曲注意机制来融合元路径诱导子图中的语义信息。尽管取得了这些进展,但仍然存在一个主要的限制:这些方法利用注意力进行信息聚合,但未能模拟语义融合与下游任务性能之间的因果关系,导致虚假的语义关联,从而降低了对噪声的鲁棒性并损害了跨任务泛化。为了解决这一挑战,我们提出了一种因果注意增强双曲异构图神经网络(CATCH),旨在实现足够的语义信息融合。据我们所知,CATCH是第一个将双曲空间与异构图表示的因果推理集成在一起的,直接针对来源的虚假语义关联。具体来说,CATCH将不同类型的欧几里得节点属性显式地编码到共享的语义双曲空间中。为了捕获底层语义,构建基于一阶和高阶元路径的上下文子图,促进基于双曲注意的层内和层间信息聚合,从而形成综合表征。最后,通过对注意学习的直接监督,实现了因果注意增强机制,利用反事实因果推理生成反事实表征来计算直接因果效应。通过联合优化特定于任务的目标和因果损失,CATCH促进了更忠实的语义编码,从而提高了鲁棒性和泛化。在四个真实数据集上进行的大量实验验证了CATCH跨多任务的卓越性能。该实现可在amazon.com - cd和Amazon-Book上的https://github.com/Crystal-LiuBojia/CATCH.Recommendation性能上获得。
{"title":"CATCH: Causal attention enhanced meta-path semantic fusion for robust hyperbolic heterogeneous graph embedding","authors":"Bojia Liu ,&nbsp;Conghui Zheng ,&nbsp;Li Pan","doi":"10.1016/j.inffus.2026.104206","DOIUrl":"10.1016/j.inffus.2026.104206","url":null,"abstract":"<div><div>Heterogeneous graph representation learning seeks to capture the complex structural and semantic properties in heterogeneous graphs. The integration of hyperbolic space, which is well-suited to modeling the intrinsic degree power-law distribution of graphs, has facilitated significant advancements in this area. Recent methods leverage hyperbolic attention mechanisms to fuse semantic information within metapath-induced subgraphs. Despite this progress, a major limitation remains: these methods leverage attention for information aggregation but fail to model the causal relationship between semantic fusion and downstream task performance, leading to spurious semantic associations that reduce robustness to noise and impair cross-task generalization. To address this challenge, we propose a <strong>C</strong>ausal <strong>AT</strong>tention enhan<strong>C</strong>ed <strong>H</strong>yperbolic Heterogeneous Graph Neural Network (<strong>CATCH</strong>), intending to achieve sufficient semantic information fusion. To the best of our knowledge, CATCH is the first to integrate hyperbolic space with causal inference for heterogeneous graph representations, directly targeting spurious semantic correlations at the source. Specifically, CATCH explicitly encodes the Euclidean node attributes of different types into a shared semantic hyperbolic space. To capture the underlying semantics, context subgraphs based on one-order and high-order metapaths are constructed to facilitate hyperbolic attention-based intra-level and inter-level information aggregation, thus forming comprehensive representations. Finally, a causal attention enhancement mechanism is implemented with direct supervision on attention learning, leveraging counterfactual causal inference to generate counterfactual representations for computing direct causal effects. By jointly optimizing a task-specific objective alongside a causal loss, CATCH promotes more faithful semantic encoding, leading to improved robustness and generalization. Extensive experiments on four real-world datasets validate the superior performance of CATCH across multiple tasks. The implementation is available at <span><span>https://github.com/Crystal-LiuBojia/CATCH</span><svg><path></path></svg></span>.</div><div>Recommendation performance on Amazon-CD and Amazon-Book.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104206"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146134527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VL-GRiP3: A hierarchical pipeline leveraging vision-language models for autonomous robotic 3D grasping VL-GRiP3:一种利用视觉语言模型的分层管道,用于自主机器人三维抓取
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-08-01 Epub Date: 2026-01-27 DOI: 10.1016/j.rcim.2026.103244
Mirco Polonara , Xingyu Yang , Luca Carbonari , Xuping Zhang
Autonomous grasping has long been a central topic in robotics, yet deployment in small and medium-sized enterprises (SMEs) is still hindered by low-level robot programming and the lack of natural language interaction. Recent Vision-Language-Action models (VLAs) allow robots to interpret natural language commands for intuitive interaction and control, but they still exhibit output uncertainty and are not yet well suited to directly generating reliable, precise actions in safety-critical industrial contexts. To address this gap, we present VL-GRiP3, a hierarchical Vision-Language model (VLM)-enabled pipeline for autonomous 3D robotic grasping that bridges natural language interaction and accurate, reliable manipulation in SME settings. The framework decomposes language understanding, perception, and action planning in a transparent modular architecture, improving flexibility and interpretability. Within this architecture, a single VLM backbone handles natural language interpretation, target perception, and high-level action planning. CAD-augmented point cloud registration then mitigates occlusions in single RGB-D views while keeping hardware cost low, and an M2T2-based grasp planner predicts accurate 3D grasp poses that explicitly account for complex object geometry from the augmented point cloud, enabling reliable manipulation of irregular industrial parts. Experiments show that our fine-tuned VLM modules achieve segmentation performance comparable to YOLOv8n, and VL-GRiP3 attains a 94.67% success rate over 150 randomized grasping trials. A comparative evaluation against state-of-the-art end-to-end VLAs further indicates that our modular, CAD-augmented design with explicit 3D grasp pose prediction yields more reliable and controllable behavior for SME manufacturing applications.
自主抓取一直是机器人技术的核心课题,但在中小企业(sme)中的部署仍然受到低级机器人编程和缺乏自然语言交互的阻碍。最近的视觉语言动作模型(VLAs)允许机器人解释自然语言命令,以实现直观的交互和控制,但它们仍然表现出输出的不确定性,并且还不适合在安全关键的工业环境中直接生成可靠、精确的动作。为了解决这一差距,我们提出了VL-GRiP3,这是一种支持分层视觉语言模型(VLM)的自主3D机器人抓取管道,可以在中小企业环境中架起自然语言交互和准确、可靠操作的桥梁。该框架将语言理解、感知和行动计划分解为透明的模块化架构,提高了灵活性和可解释性。在这个体系结构中,一个VLM主干处理自然语言解释、目标感知和高级行动计划。cad增强点云配准可以减轻单个RGB-D视图中的遮挡,同时保持较低的硬件成本,基于m2t2的抓取规划器可以预测精确的3D抓取姿势,明确考虑增强点云中复杂物体的几何形状,从而能够可靠地操作不规则工业零件。实验表明,我们的微调VLM模块的分割性能与YOLOv8n相当,VL-GRiP3在150次随机抓取试验中获得了94.67%的成功率。与最先进的端到端VLAs的比较评估进一步表明,我们的模块化cad增强设计具有明确的3D抓取姿势预测,为中小企业制造应用提供了更可靠和可控的行为。
{"title":"VL-GRiP3: A hierarchical pipeline leveraging vision-language models for autonomous robotic 3D grasping","authors":"Mirco Polonara ,&nbsp;Xingyu Yang ,&nbsp;Luca Carbonari ,&nbsp;Xuping Zhang","doi":"10.1016/j.rcim.2026.103244","DOIUrl":"10.1016/j.rcim.2026.103244","url":null,"abstract":"<div><div>Autonomous grasping has long been a central topic in robotics, yet deployment in small and medium-sized enterprises (SMEs) is still hindered by low-level robot programming and the lack of natural language interaction. Recent Vision-Language-Action models (VLAs) allow robots to interpret natural language commands for intuitive interaction and control, but they still exhibit output uncertainty and are not yet well suited to directly generating reliable, precise actions in safety-critical industrial contexts. To address this gap, we present VL-GRiP3, a hierarchical Vision-Language model (VLM)-enabled pipeline for autonomous 3D robotic grasping that bridges natural language interaction and accurate, reliable manipulation in SME settings. The framework decomposes language understanding, perception, and action planning in a transparent modular architecture, improving flexibility and interpretability. Within this architecture, a single VLM backbone handles natural language interpretation, target perception, and high-level action planning. CAD-augmented point cloud registration then mitigates occlusions in single RGB-D views while keeping hardware cost low, and an M2T2-based grasp planner predicts accurate 3D grasp poses that explicitly account for complex object geometry from the augmented point cloud, enabling reliable manipulation of irregular industrial parts. Experiments show that our fine-tuned VLM modules achieve segmentation performance comparable to YOLOv8n, and VL-GRiP3 attains a 94.67% success rate over 150 randomized grasping trials. A comparative evaluation against state-of-the-art end-to-end VLAs further indicates that our modular, CAD-augmented design with explicit 3D grasp pose prediction yields more reliable and controllable behavior for SME manufacturing applications.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"100 ","pages":"Article 103244"},"PeriodicalIF":11.4,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146048118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Physics-informed prediction of modal parameters and stability analysis for robotic mirror milling 机器人铣镜模态参数的物理预测与稳定性分析
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-08-01 Epub Date: 2026-01-20 DOI: 10.1016/j.rcim.2026.103237
Kun Chen , Haonan Ma , Chenghao Huang , Sheng Xu , Peng Xu , Bing Li
Mirror milling technology is widely used in the aerospace industry for manufacturing thin-walled parts, yet existing machine tool-based mirror milling systems are costly and inflexible. Robotic mirror milling system is a cost-effective and flexible alternative to machine tools. However, the modal parameters of the mirror-arranged robots vary with their postures, and the robots’ low stiffness, coupled with the flexibility of thin-walled parts, leads to unstable milling processes. To address these challenges, a physics-informed framework is proposed for modal parameters measurement, prediction, and optimization, thereby analyzing the robotic mirror milling stability. First, the robot’s vibration characteristics are examined through transfer matrices of dynamic models, while the robot’s modal parameters are collected at uniform configurations in joint space. Using these characteristics and measurements as physical constraints and training sets, a modified multi-task Gaussian process regression is developed to predict the modal parameters, with the results further optimized through the Bayesian derivation. This two-step process forms the physics-informed modal parameters prediction method. Then, the obtained modal parameters are utilized to construct the robotic mirror milling system’s dynamic model, which can analyze its milling stability. Simulations and experiments are conducted to confirm these theories and algorithms.
镜面铣削技术在航空航天工业中广泛应用于制造薄壁零件,但现有的基于机床的镜面铣削系统成本高且不灵活。机器人镜面铣削系统是一种具有成本效益和灵活性的机床替代方案。然而,镜面排列机器人的模态参数随其姿态而变化,并且机器人的低刚度加上薄壁零件的柔韧性导致铣削过程不稳定。为了解决这些挑战,提出了一个物理信息框架,用于模态参数的测量、预测和优化,从而分析机器人镜面铣削稳定性。首先,通过动力学模型的传递矩阵分析了机器人的振动特性,同时在关节空间中采集了机器人的均匀构型模态参数。利用这些特征和测量作为物理约束和训练集,提出了一种改进的多任务高斯过程回归来预测模态参数,并通过贝叶斯推导进一步优化结果。这两步过程形成了基于物理的模态参数预测方法。然后,利用得到的模态参数构建机器人镜面铣削系统的动力学模型,分析其铣削稳定性。通过仿真和实验验证了这些理论和算法。
{"title":"Physics-informed prediction of modal parameters and stability analysis for robotic mirror milling","authors":"Kun Chen ,&nbsp;Haonan Ma ,&nbsp;Chenghao Huang ,&nbsp;Sheng Xu ,&nbsp;Peng Xu ,&nbsp;Bing Li","doi":"10.1016/j.rcim.2026.103237","DOIUrl":"10.1016/j.rcim.2026.103237","url":null,"abstract":"<div><div>Mirror milling technology is widely used in the aerospace industry for manufacturing thin-walled parts, yet existing machine tool-based mirror milling systems are costly and inflexible. Robotic mirror milling system is a cost-effective and flexible alternative to machine tools. However, the modal parameters of the mirror-arranged robots vary with their postures, and the robots’ low stiffness, coupled with the flexibility of thin-walled parts, leads to unstable milling processes. To address these challenges, a physics-informed framework is proposed for modal parameters measurement, prediction, and optimization, thereby analyzing the robotic mirror milling stability. First, the robot’s vibration characteristics are examined through transfer matrices of dynamic models, while the robot’s modal parameters are collected at uniform configurations in joint space. Using these characteristics and measurements as physical constraints and training sets, a modified multi-task Gaussian process regression is developed to predict the modal parameters, with the results further optimized through the Bayesian derivation. This two-step process forms the physics-informed modal parameters prediction method. Then, the obtained modal parameters are utilized to construct the robotic mirror milling system’s dynamic model, which can analyze its milling stability. Simulations and experiments are conducted to confirm these theories and algorithms.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"100 ","pages":"Article 103237"},"PeriodicalIF":11.4,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146014514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
全部 J. Field Rob. J. Bionic Eng. ACTA INFORM Adv. Rob. AI MAG Ann. Math. Artif. Intell. Appl. Bionics Biomech. APPL INTELL APPL COMPUT ELECTROM APPL ARTIF INTELL Artif. Intell. ARTIF INTELL REV CHEMOMETR INTELL LAB China Commun. CMC-Comput. Mater. Continua Complex Intell. Syst. Comput. Sci. Eng. Commun. ACM COMPUTER Comput. Graphics Forum COMPUTING EMPIR SOFTW ENG Enterp. Inf. Syst. EPJ Data Sci. ETRI J EURASIP J WIREL COMM Evolving Systems FORM METHOD SYST DES Front. Neurorob. FRONT COMPUT SCI-CHI IEEE Trans. Commun. IEEE Trans. Comput. Social Syst. IEEE Trans. Dependable Secure Comput. IEEE Trans. Green Commun. Networking IEEE Trans. Cognit. Commun. Networking IEEE Access IEEE Trans. Comput. IEEE Antennas Propag. Mag. IEEE Micro IEEE Trans. Antennas Propag. IEEE Trans. Control Syst. Technol. IEEE Trans. Big Data IEEE Trans. Cybern. IEEE Internet Comput. IEEE Trans. Affective Comput. IEEE Trans. Emerging Top. Comput. Intell. IEEE SECUR PRIV IEEE Trans. Emerging Top. Comput. IEEE Trans. Aerosp. Electron. Syst. IEEE Trans. Broadcast. IEEE Intell. Syst. IEEE Commun. Lett. IEEE Trans. Autom. Control IEEE Trans. Cloud Comput. IEEE Trans. Evol. Comput. IEEE Trans. Consum. Electron. IEEE Trans. Fuzzy Syst. IEEE Trans. Haptic IEEE Trans. Image Process. IEEE Multimedia IEEE Rob. Autom. Lett. IEEE J. Sel. Areas Commun. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. IETE Tech. Rev. IEEE Trans. Serv. Comput. IEEE Trans. Parallel Distrib. Syst. IEEE Trans. Sustainable Comput. IEEE Trans. Multimedia IEEE Trans. Ind. Inf. IEEE Trans. Neural Networks Learn. Syst. IEEE Trans. Software Eng. IEEE-ACM T AUDIO SPE IEEE Wireless Commun. IEEE Wireless Commun. Lett. IET MICROW ANTENNA P IEEE Trans. Visual Comput. Graphics IEEE Trans. Ind. Electron. IET Optoelectron IEEE Trans. Veh. Technol. IEEE Trans. Netw. Serv. Manage. IEEE Trans. Pattern Anal. Mach. Intell. IEEE Trans. Wireless Commun. IEEE ACM T NETWORK IEEE Trans. Inf. Forensics Secur. IEEE Trans. Inf. Theory IEEE Trans. Knowl. Data Eng. INFORM SYST FRONT INFORMS J COMPUT INFOR Int. J. Comput. Vision Int. J. Approximate Reasoning Int. J. Control Int. J. Commun. Syst. Int. J. Imaging Syst. Technol. Int. J. Fuzzy Syst. Int. J. Intell. Syst. Int. J. Network Manage. Int. J. Parallel Program. Int. J. Social Rob. Int. J. Software Tools Technol. Trans.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1