首页 > 最新文献

计算机科学最新文献

英文 中文
IF:
A digital-twin framework for assembling of cylindrical parts 用于圆柱形零件装配的数字孪生框架
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-08-01 Epub Date: 2026-01-21 DOI: 10.1016/j.rcim.2026.103240
Yimin Song , Chen Li , Binbin Lian , Qi Li , Tao Sun
Digital twin (DT) has been recognized as a promising technology for enhanced planning, monitoring and control of automatic assembly, with the capability of efficiency, adaptability and flexibility. While most of DT-based assembly focus on electronic products, limited attention has been paid to the assembling of heavy and large-size product with tight tolerance. This paper presents a DT model facilitating prediction and real-time adjustment for intelligent assembling of cylindrical parts. Vision guided feature fitting and coordinate frame construction are presented. Herein, assembly targets are defined considering eight sets of pin and hole, and the contacting planes. This approach improves the assembly success rate. To ensure efficient and robust robot assembly, we proposed a prediction model based on the DT system. Unknown errors and uncertainty of physical space are considered by small displacement torsor (SDT) theory and Monte Carlo simulation (MCS). Assembly planning and execution would efficiently adjust guided by the prediction result. A middle point is set in the robot planning that leaves pure translation in the docking phase. Real-time adjustment method is proposed to accurately assemble the cylindrical parts. Simulations and experiments are carried out to verify the effectiveness and feasibility of the proposed DT-based assembling method. The results show that the prediction results are the same as the actual assembly. Our assembly strategy achieves 97.31% success rate. By employing the assembly strategy and real-time adjustment, our method ensures that the majority of axes mismatch is below 0.1mm/0.05 deg, plane non-contacting below 0.05 mm.
数字孪生技术(Digital twin, DT)以其高效、适应性强、灵活的特点,被认为是一种很有前途的自动化装配规划、监测和控制技术。目前,基于3d打印技术的装配大多集中在电子产品上,而对重、大尺寸、公差要求高的产品的装配关注甚少。针对圆柱件智能装配,提出了一种便于预测和实时调整的DT模型。提出了视觉引导特征拟合和坐标框架构建方法。其中,考虑8组销孔和接触面,定义装配目标。这种方法提高了装配成功率。为了保证机器人装配的效率和鲁棒性,提出了一种基于DT系统的预测模型。小位移变形量(SDT)理论和蒙特卡罗模拟(MCS)考虑了物理空间的未知误差和不确定性。在预测结果的指导下,有效地调整装配规划和执行。在机器人规划中设置一个中间点,在对接阶段保留纯平移。提出了圆柱零件精确装配的实时调整方法。仿真和实验验证了该方法的有效性和可行性。结果表明,预测结果与实际装配相吻合。我们的装配策略达到97.31%的成功率。通过采用装配策略和实时调整,保证了大部分轴错配在0.1mm/0.05°以下,平面不接触在0.05 mm以下。
{"title":"A digital-twin framework for assembling of cylindrical parts","authors":"Yimin Song ,&nbsp;Chen Li ,&nbsp;Binbin Lian ,&nbsp;Qi Li ,&nbsp;Tao Sun","doi":"10.1016/j.rcim.2026.103240","DOIUrl":"10.1016/j.rcim.2026.103240","url":null,"abstract":"<div><div>Digital twin (DT) has been recognized as a promising technology for enhanced planning, monitoring and control of automatic assembly, with the capability of efficiency, adaptability and flexibility. While most of DT-based assembly focus on electronic products, limited attention has been paid to the assembling of heavy and large-size product with tight tolerance. This paper presents a DT model facilitating prediction and real-time adjustment for intelligent assembling of cylindrical parts. Vision guided feature fitting and coordinate frame construction are presented. Herein, assembly targets are defined considering eight sets of pin and hole, and the contacting planes. This approach improves the assembly success rate. To ensure efficient and robust robot assembly, we proposed a prediction model based on the DT system. Unknown errors and uncertainty of physical space are considered by small displacement torsor (SDT) theory and Monte Carlo simulation (MCS). Assembly planning and execution would efficiently adjust guided by the prediction result. A middle point is set in the robot planning that leaves pure translation in the docking phase. Real-time adjustment method is proposed to accurately assemble the cylindrical parts. Simulations and experiments are carried out to verify the effectiveness and feasibility of the proposed DT-based assembling method. The results show that the prediction results are the same as the actual assembly. Our assembly strategy achieves 97.31% success rate. By employing the assembly strategy and real-time adjustment, our method ensures that the majority of axes mismatch is below 0.1mm/0.05 deg, plane non-contacting below 0.05 mm.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"100 ","pages":"Article 103240"},"PeriodicalIF":11.4,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146014798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FedFusionNet: Advancing oral cancer recurrence prediction through federated fusion modeling FedFusionNet:通过联邦融合建模推进口腔癌复发预测
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-02-07 DOI: 10.1016/j.inffus.2026.104205
Al Rafi Aurnob , Sharia Arfin Tanim , Tahmid Enam Shrestha , M.F. Mridha , Durjoy Mistry
Oral cancer represents a considerable global medical problem that requires the development of new technologies that offer reliable advanced therapies. This study introduced FedFusionNet, a fusion-centric model that was meticulously developed to advance early oral cancer diagnosis while preserving data privacy. The primary objective was to develop a model using federated learning (FL) to train across diverse healthcare facilities globally without compromising patient data confidentiality. This model uses features from the ResNeXt101 32X8D and InceptionV3 models to implement a single-level fusion via feature concatenation. This helps to enhance the effectiveness and stability of the model. Specifically, the federated averaging (FedAvg) technique fosters collaborative model training across multiple hospitals while safeguarding sensitive patient information. This ensured that each participating hospital could contribute to the development of the model without sharing the raw data. The proposed model was trained on a dataset of 10,002 images that included both healthy and cancerous oral tissues. Rigorous training and evaluation were conducted for both Independent and Identically Distributed (IID) and Independent and Non-Identically Distributed (Non-IID) settings. FedFusionNet demonstrated superior performance compared with pre-trained and some custom models for oral cancer diagnosis. This scalable and secure framework has profound implications for healthcare analytics. It is a proof-of-concept demonstration that utilizes publicly available data to establish the technical feasibility of the FedFusionNet framework. Future deployment in actual collaborative environments would demonstrate its security-by-design capabilities across hospitals, where patient data confidentiality is a priority.
口腔癌是一个相当大的全球性医疗问题,需要开发提供可靠的先进治疗方法的新技术。本研究介绍了FedFusionNet,这是一种以融合为中心的模型,精心开发以推进早期口腔癌诊断,同时保护数据隐私。主要目标是开发一个使用联邦学习(FL)的模型,以便在不影响患者数据机密性的情况下跨全球不同医疗机构进行培训。该模型使用ResNeXt101 32X8D和InceptionV3模型的特征,通过特征连接实现单级融合。这有助于提高模型的有效性和稳定性。具体来说,联邦平均(fedag)技术促进了跨多家医院的协作模型培训,同时保护了敏感的患者信息。这确保了每个参与的医院都可以在不共享原始数据的情况下为模型的开发做出贡献。所提出的模型是在包括健康和癌变口腔组织的1002张图像的数据集上进行训练的。对独立与同分布(IID)和独立与非同分布(Non-IID)设置进行了严格的培训和评估。与预训练模型和一些自定义模型相比,FedFusionNet在口腔癌诊断方面表现出优越的性能。这种可扩展且安全的框架对医疗保健分析具有深远的影响。这是一个概念验证演示,利用公开可用的数据来建立FedFusionNet框架的技术可行性。未来在实际协作环境中的部署将展示其在医院中的设计安全能力,在医院中,患者数据保密性是一个优先事项。
{"title":"FedFusionNet: Advancing oral cancer recurrence prediction through federated fusion modeling","authors":"Al Rafi Aurnob ,&nbsp;Sharia Arfin Tanim ,&nbsp;Tahmid Enam Shrestha ,&nbsp;M.F. Mridha ,&nbsp;Durjoy Mistry","doi":"10.1016/j.inffus.2026.104205","DOIUrl":"10.1016/j.inffus.2026.104205","url":null,"abstract":"<div><div>Oral cancer represents a considerable global medical problem that requires the development of new technologies that offer reliable advanced therapies. This study introduced FedFusionNet, a fusion-centric model that was meticulously developed to advance early oral cancer diagnosis while preserving data privacy. The primary objective was to develop a model using federated learning (FL) to train across diverse healthcare facilities globally without compromising patient data confidentiality. This model uses features from the ResNeXt101 32X8D and InceptionV3 models to implement a single-level fusion via feature concatenation. This helps to enhance the effectiveness and stability of the model. Specifically, the federated averaging (FedAvg) technique fosters collaborative model training across multiple hospitals while safeguarding sensitive patient information. This ensured that each participating hospital could contribute to the development of the model without sharing the raw data. The proposed model was trained on a dataset of 10,002 images that included both healthy and cancerous oral tissues. Rigorous training and evaluation were conducted for both Independent and Identically Distributed (IID) and Independent and Non-Identically Distributed (Non-IID) settings. FedFusionNet demonstrated superior performance compared with pre-trained and some custom models for oral cancer diagnosis. This scalable and secure framework has profound implications for healthcare analytics. It is a proof-of-concept demonstration that utilizes publicly available data to establish the technical feasibility of the FedFusionNet framework. Future deployment in actual collaborative environments would demonstrate its security-by-design capabilities across hospitals, where patient data confidentiality is a priority.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104205"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight music recommendation via multi-physiological feature fusion 基于多生理特征融合的轻量音乐推荐
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-02-07 DOI: 10.1016/j.inffus.2026.104211
Xiaoying Huang , Haonan Cheng , Sanyi Zhang , Xiaoxuan Guo , Long Ye
Music recommendation, as the core task of smart speakers, have an important impact on user experience in terms of recommendation speed and accuracy. However, existing music recommendation algorithms face challenges in generating adaptive playlists tailored to the user’s current state. This is primarily because achieving high recommendation accuracy typically necessitates substantial computing overheads. In addition, most of the existing music recommendation algorithms ignore smooth transitions between tracks, which further hurts the quality of the recommendations. To tackle these issues, we propose a novel Lightweight Music Recommendation (LMR) method via Multi-Physiological feature Fusion (MPF), which can be effectively applied in embedded smart speaker systems. Specifically, our proposed LMR method contains two core modules: a MPF-based music mapping module and a global-local similarity computation (GLSC) based playlist recommendation module. The lightweight MPF-based music mapping model is designed to solve the track-user adaptation problem. Furthermore, we propose a GLSC-based playlist recommendation algorithm to address the incoherence and unsmooth transitions within track sequences. Experiments demonstrate that the proposed method achieves more consistent playlist recommendations aligned with user contextual information, while also enabling smoother transitions between tracks and ensuring long-term content consistency across the entire sequence. Compared with other methods, our approach achieves a favorable balance between accuracy and efficiency.
音乐推荐作为智能音箱的核心任务,在推荐速度和准确性方面对用户体验有着重要的影响。然而,现有的音乐推荐算法在生成适合用户当前状态的自适应播放列表方面面临挑战。这主要是因为实现高推荐准确性通常需要大量的计算开销。此外,大多数现有的音乐推荐算法忽略了曲目之间的平滑过渡,这进一步损害了推荐的质量。为了解决这些问题,我们提出了一种新的基于多生理特征融合(MPF)的轻量级音乐推荐(LMR)方法,该方法可以有效地应用于嵌入式智能扬声器系统。具体来说,我们提出的LMR方法包含两个核心模块:基于mpf的音乐映射模块和基于全局-局部相似度计算(GLSC)的播放列表推荐模块。设计了一种基于mpf的轻量级音乐映射模型来解决音轨用户自适应问题。此外,我们提出了一种基于glsc的播放列表推荐算法来解决音轨序列中的不连贯和不平滑过渡问题。实验表明,所提出的方法实现了与用户上下文信息一致的更一致的播放列表推荐,同时也实现了音轨之间更平滑的过渡,并确保了整个序列的长期内容一致性。与其他方法相比,我们的方法在准确性和效率之间取得了良好的平衡。
{"title":"Lightweight music recommendation via multi-physiological feature fusion","authors":"Xiaoying Huang ,&nbsp;Haonan Cheng ,&nbsp;Sanyi Zhang ,&nbsp;Xiaoxuan Guo ,&nbsp;Long Ye","doi":"10.1016/j.inffus.2026.104211","DOIUrl":"10.1016/j.inffus.2026.104211","url":null,"abstract":"<div><div>Music recommendation, as the core task of smart speakers, have an important impact on user experience in terms of recommendation speed and accuracy. However, existing music recommendation algorithms face challenges in generating adaptive playlists tailored to the user’s current state. This is primarily because achieving high recommendation accuracy typically necessitates substantial computing overheads. In addition, most of the existing music recommendation algorithms ignore smooth transitions between tracks, which further hurts the quality of the recommendations. To tackle these issues, we propose a novel Lightweight Music Recommendation (LMR) method via Multi-Physiological feature Fusion (MPF), which can be effectively applied in embedded smart speaker systems. Specifically, our proposed LMR method contains two core modules: a MPF-based music mapping module and a global-local similarity computation (GLSC) based playlist recommendation module. The lightweight MPF-based music mapping model is designed to solve the track-user adaptation problem. Furthermore, we propose a GLSC-based playlist recommendation algorithm to address the incoherence and unsmooth transitions within track sequences. Experiments demonstrate that the proposed method achieves more consistent playlist recommendations aligned with user contextual information, while also enabling smoother transitions between tracks and ensuring long-term content consistency across the entire sequence. Compared with other methods, our approach achieves a favorable balance between accuracy and efficiency.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104211"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-efficient generalization for zero-shot composed image retrieval 零镜头合成图像检索的数据高效泛化
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-28 DOI: 10.1016/j.patcog.2026.113187
Zining Chen , Zhicheng Zhao , Fei Su , Shijian Lu
Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic Sample Pool (SSP) module. The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The SSP module exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.
Zero-shot组合图像检索(ZS-CIR)旨在基于参考图像和文本描述检索目标图像,而不需要在分布中三元组进行训练。一种流行的方法遵循视觉语言预训练范式,使用映射网络将图像嵌入转移到文本嵌入空间中的伪词标记。然而,由于训练和推理之间的模态差异和分布转移,这种方法容易阻碍网络的泛化。为此,我们提出了一个数据高效泛化(DeG)框架,其中包括两个新颖的设计,即文本补充(TS)模块和语义样本池(SSP)模块。TS模块在训练过程中利用组合文本语义,使伪词标记具有更多的语言语义,从而有效地缓解情态差异。SSP模块利用了预训练视觉语言模型(VLMs)的零射击能力,减轻了大规模图像文本数据冗余带来的分布偏移和过拟合问题。在四个ZS-CIR基准上进行的广泛实验表明,DeG在训练数据少得多的情况下优于最先进的(SOTA)方法,并为实际使用节省了大量的训练和推理时间。
{"title":"Data-efficient generalization for zero-shot composed image retrieval","authors":"Zining Chen ,&nbsp;Zhicheng Zhao ,&nbsp;Fei Su ,&nbsp;Shijian Lu","doi":"10.1016/j.patcog.2026.113187","DOIUrl":"10.1016/j.patcog.2026.113187","url":null,"abstract":"<div><div>Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic Sample Pool (SSP) module. The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The SSP module exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113187"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prioritized scanning: Combining spatial information multiple instance learning for computational pathology 优先扫描:结合空间信息多实例学习的计算病理学
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-24 DOI: 10.1016/j.patcog.2026.113151
Yuqi Zhang , Jiakai Wang , Baoyu Liang , Yuancheng Yang , Siyang Wu , Chao Tong
Multiple instance learning (MIL) has emerged as a reliable paradigm that has propelled the integration of computational pathology (CPath) into clinical histopathology. However, despite significant advancements, current MIL approaches continue to face challenges due to inadequate spatial information representation resulting from the disorder of the original whole slide images (WSIs). To address this limitation, we first demonstrate the importance of prioritized scanning within the structured state space models (SSM). We introduce a MIL framework that incorporates spatial information, termed Prioritized Scanning MIL (PSMIL). PSMIL primarily comprises two branches and a fusion block. The first branch, known as the spatial branch, incorporates potential spatial information into the patch sequence using the original 2D positions and employs SSM to model the spatial features of the WSI. The second branch, referred to as the cross-spatial branch, utilizes a significance scoring block along with SSM to harness feature relationships among similar instances across spatial locations. Finally, a lightweight feature fusion block integrates the outputs of both branches, facilitating more comprehensive feature utilization. Extensive experiments on 5 popular datasets and 3 downstream tasks strongly demonstrate that PSMIL surpasses the state-of-the-art MIL methods significantly, up to 5.26% ACC improvements for cancer sub-typing. Our code is available at https://github.com/YuqiZhang-Buaa/PSMIL.
多实例学习(MIL)已经成为一种可靠的范式,它推动了计算病理学(CPath)与临床组织病理学的整合。然而,尽管取得了重大进展,目前的MIL方法仍然面临着挑战,因为原始整个幻灯片图像(wsi)的无序导致空间信息表示不足。为了解决这一限制,我们首先展示了在结构化状态空间模型(SSM)中优先扫描的重要性。我们引入了一个包含空间信息的MIL框架,称为优先扫描MIL (PSMIL)。PSMIL主要由两个分支和一个融合块组成。第一个分支称为空间分支,利用原始二维位置将潜在的空间信息整合到patch序列中,并使用SSM对WSI的空间特征进行建模。第二个分支称为跨空间分支,它利用显著性评分块和SSM来利用跨空间位置的类似实例之间的特征关系。最后,一个轻量级的特征融合块集成了两个分支的输出,便于更全面地利用特征。在5个流行数据集和3个下游任务上进行的大量实验表明,PSMIL显著优于最先进的MIL方法,在癌症亚型分型方面的ACC提高高达5.26%。我们的代码可在https://github.com/YuqiZhang-Buaa/PSMIL上获得。
{"title":"Prioritized scanning: Combining spatial information multiple instance learning for computational pathology","authors":"Yuqi Zhang ,&nbsp;Jiakai Wang ,&nbsp;Baoyu Liang ,&nbsp;Yuancheng Yang ,&nbsp;Siyang Wu ,&nbsp;Chao Tong","doi":"10.1016/j.patcog.2026.113151","DOIUrl":"10.1016/j.patcog.2026.113151","url":null,"abstract":"<div><div>Multiple instance learning (MIL) has emerged as a reliable paradigm that has propelled the integration of computational pathology (CPath) into clinical histopathology. However, despite significant advancements, current MIL approaches continue to face challenges due to inadequate spatial information representation resulting from the disorder of the original whole slide images (WSIs). To address this limitation, we first demonstrate the importance of prioritized scanning within the structured state space models (SSM). We introduce a MIL framework that incorporates spatial information, termed <strong>Prioritized Scanning MIL (PSMIL)</strong>. PSMIL primarily comprises two branches and a fusion block. The first branch, known as the spatial branch, incorporates potential spatial information into the patch sequence using the original 2D positions and employs SSM to model the spatial features of the WSI. The second branch, referred to as the cross-spatial branch, utilizes a significance scoring block along with SSM to harness feature relationships among similar instances across spatial locations. Finally, a lightweight feature fusion block integrates the outputs of both branches, facilitating more comprehensive feature utilization. Extensive experiments on 5 popular datasets and 3 downstream tasks strongly demonstrate that PSMIL surpasses the state-of-the-art MIL methods significantly, up to 5.26% ACC improvements for cancer sub-typing. Our code is available at <span><span>https://github.com/YuqiZhang-Buaa/PSMIL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113151"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TranSAC: An unsupervised transferability metric based on task speciality and domain commonality TranSAC:基于任务特殊性和领域共性的无监督可转移性度量
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-29 DOI: 10.1016/j.patcog.2026.113137
Qianshan Zhan , Xiao-Jun Zeng , Qian Wang
In transfer learning, one fundamental problem is transferability estimation, where a metric measures transfer performance without training. Existing metrics face two issues: 1) requiring target domain labels, and 2) only focusing on task speciality but ignoring equally important domain commonality. To overcome these limitations, we propose TranSAC, a Transferability metric based on task Speciality And domain Commonality, capturing the separation between classes and the similarity between domains. Its main advantages are: 1) unsupervised, 2) fine-tuning free, and 3) applicable to source-dependent and source-free transfer scenarios. To achieve this, we investigate the upper and lower bounds of transfer performance based on fixed representations extracted from the pre-trained model. Theoretical results reveal that unsupervised transfer performance is characterized by entropy-based quantities, naturally reflecting task specificity and domain commonality. These insights motivate the design of TranSAC, which integrates both factors to enhance transferability. Extensive experiments are performed across 12 target datasets with 36 pre-trained models, including supervised CNNs, self-supervised CNNs, and ViTs. Results demonstrate the importance of domain commonality and task speciality, allowing TranSAC as superior to state-of-the-art metrics for pre-trained model ranking, target domain ranking, and source domain ranking.
在迁移学习中,一个基本问题是可迁移性估计,其中度量是在未经训练的情况下度量迁移性能。现有的度量标准面临两个问题:1)需要目标领域标签;2)只关注任务的特殊性,而忽略了同样重要的领域共性。为了克服这些限制,我们提出了TranSAC,一种基于任务特殊性和领域共性的可转移性度量,捕获类之间的分离和领域之间的相似性。它的主要优点是:1)无监督,2)无微调,以及3)适用于依赖源和无源的传输场景。为了实现这一点,我们研究了基于从预训练模型中提取的固定表示的传输性能的上界和下界。理论结果表明,无监督迁移性能具有基于熵的特征量,自然地反映了任务的特殊性和领域的共性。这些见解激发了TranSAC的设计,它整合了这两个因素以增强可转移性。在12个目标数据集和36个预训练模型上进行了广泛的实验,包括监督cnn、自监督cnn和vit。结果证明了领域共性和任务特殊性的重要性,使得TranSAC在预训练模型排名、目标领域排名和源领域排名方面优于最先进的指标。
{"title":"TranSAC: An unsupervised transferability metric based on task speciality and domain commonality","authors":"Qianshan Zhan ,&nbsp;Xiao-Jun Zeng ,&nbsp;Qian Wang","doi":"10.1016/j.patcog.2026.113137","DOIUrl":"10.1016/j.patcog.2026.113137","url":null,"abstract":"<div><div>In transfer learning, one fundamental problem is transferability estimation, where a metric measures transfer performance without training. Existing metrics face two issues: 1) requiring target domain labels, and 2) only focusing on task speciality but ignoring equally important domain commonality. To overcome these limitations, we propose TranSAC, a <strong>Tran</strong>sferability metric based on task <strong>S</strong>peciality <strong>A</strong>nd domain <strong>C</strong>ommonality, capturing the separation between classes and the similarity between domains. Its main advantages are: 1) unsupervised, 2) fine-tuning free, and 3) applicable to source-dependent and source-free transfer scenarios. To achieve this, we investigate the upper and lower bounds of transfer performance based on fixed representations extracted from the pre-trained model. Theoretical results reveal that unsupervised transfer performance is characterized by entropy-based quantities, naturally reflecting task specificity and domain commonality. These insights motivate the design of TranSAC, which integrates both factors to enhance transferability. Extensive experiments are performed across 12 target datasets with 36 pre-trained models, including supervised CNNs, self-supervised CNNs, and ViTs. Results demonstrate the importance of domain commonality and task speciality, allowing TranSAC as superior to state-of-the-art metrics for pre-trained model ranking, target domain ranking, and source domain ranking.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113137"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Audio-visual perceptual quality measurement via multi-perspective spatio-temporal EEG analysis 基于多视角时空脑电图分析的视听感知质量测量
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-24 DOI: 10.1016/j.patcog.2026.113156
Shuzhan Hu , Mingyu Li , Yang Liu , Weiwei Jiang , Bingrui Geng , Wei Zhong , Long Ye
In human-centered communication systems, establishing human perception-aligned audio-visual quality assessment methods is crucial for enhancing multimedia system performance and service quality. However, conventional subjective evaluation methods based on user ratings are susceptible to biases induced by high-level cognitive processes. To address this limitation, we propose an electroencephalography (EEG) feature fusion approach to establish correlations between audio-visual distortions and perceptual experiences. Specifically, we construct an audio-visual degradation-EEG dataset by recording neural responses from subjects exposed to progressively degraded stimuli. Leveraging this dataset, we extract event-related potential (ERP) features to quantify variations in subjects’ perception of audio-visual quality, demonstrating the feasibility of EEG-based perceptual experience assessment. Capitalizing on EEG’s sensitivity to dynamic multimodal perceptual changes, we develop a multi-perspective feature fusion framework, incorporating a spatio-temporal feature fusion architecture and a diffusion-driven EEG augmentation strategy. This framework enables the extraction of experience-related features from single-trial EEG signals, establishing an EEG-based classifier to detect whether distortions induce perceptual experience alterations. Experimental results validate that EEG signals effectively reflect perception changes induced by quality degradation, while the proposed model achieves efficient and dynamic detection of perception alterations from single-trial EEG data.
在以人为中心的通信系统中,建立符合人的感知的视听质量评价方法是提高多媒体系统性能和服务质量的关键。然而,传统的基于用户评分的主观评价方法容易受到高层次认知过程的影响。为了解决这一限制,我们提出了一种脑电图(EEG)特征融合方法来建立视听扭曲和感知体验之间的相关性。具体来说,我们通过记录暴露于逐渐退化的刺激的受试者的神经反应,构建了一个视听退化-脑电图数据集。利用该数据集,我们提取事件相关电位(ERP)特征来量化受试者对视听质量感知的变化,证明了基于脑电图的感知体验评估的可行性。利用脑电对动态多模态感知变化的敏感性,我们开发了一个多视角特征融合框架,将时空特征融合架构和扩散驱动的脑电增强策略相结合。该框架能够从单次脑电图信号中提取与经验相关的特征,建立基于脑电图的分类器来检测扭曲是否会引起感知经验的改变。实验结果表明,脑电信号能够有效地反映质量退化引起的感知变化,该模型能够实现对单次脑电信号感知变化的高效、动态检测。
{"title":"Audio-visual perceptual quality measurement via multi-perspective spatio-temporal EEG analysis","authors":"Shuzhan Hu ,&nbsp;Mingyu Li ,&nbsp;Yang Liu ,&nbsp;Weiwei Jiang ,&nbsp;Bingrui Geng ,&nbsp;Wei Zhong ,&nbsp;Long Ye","doi":"10.1016/j.patcog.2026.113156","DOIUrl":"10.1016/j.patcog.2026.113156","url":null,"abstract":"<div><div>In human-centered communication systems, establishing human perception-aligned audio-visual quality assessment methods is crucial for enhancing multimedia system performance and service quality. However, conventional subjective evaluation methods based on user ratings are susceptible to biases induced by high-level cognitive processes. To address this limitation, we propose an electroencephalography (EEG) feature fusion approach to establish correlations between audio-visual distortions and perceptual experiences. Specifically, we construct an audio-visual degradation-EEG dataset by recording neural responses from subjects exposed to progressively degraded stimuli. Leveraging this dataset, we extract event-related potential (ERP) features to quantify variations in subjects’ perception of audio-visual quality, demonstrating the feasibility of EEG-based perceptual experience assessment. Capitalizing on EEG’s sensitivity to dynamic multimodal perceptual changes, we develop a multi-perspective feature fusion framework, incorporating a spatio-temporal feature fusion architecture and a diffusion-driven EEG augmentation strategy. This framework enables the extraction of experience-related features from single-trial EEG signals, establishing an EEG-based classifier to detect whether distortions induce perceptual experience alterations. Experimental results validate that EEG signals effectively reflect perception changes induced by quality degradation, while the proposed model achieves efficient and dynamic detection of perception alterations from single-trial EEG data.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113156"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward generalizable robotic assembly: A prior-guided deep reinforcement learning approach with multi-sensor information 面向一般化机器人装配:基于多传感器信息的先验引导深度强化学习方法
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-08-01 Epub Date: 2026-01-22 DOI: 10.1016/j.rcim.2026.103242
Zilu Zhu , Yongkui Liu , Qianji Wang , Zinan Wang , Lihui Wang , Sichao Liu , Bin Zi , Lin Zhang
The rise of personalized manufacturing presents significant challenges for robotic assembly. While learning-based methods offer promising solutions, they often suffer from low training efficiency and poor generalization. To address these limitations, this paper proposes an efficient prior-guided (PG) deep reinforcement learning (DRL) approach for generalizable robotic assembly using multi-sensor information. First, a phased multi-sensor information fusion method is introduced. Then, a visual feature extraction method that combines MobileNetV3-Lite with conventional digital image processing and a rule-based force feature extraction method are designed to extract lower-dimensional features as prior-guided knowledge. Based on the methods above, a Soft Actor-Critic (SAC) algorithm that integrates Gated Recurrent Unit (GRU) network architecture with PG is proposed, thereby enabling efficient assembly skill learning. Simulations and physical experiments with respect to three typical assembly skills, i.e., search, alignment, and insertion, are conducted. Results indicate that, compared with the baseline SAC algorithm, our feature extraction method reduces visual feature dimensions by 93.75% and provides accurate prior-guided knowledge for DRL. The proposed assembly skill learning algorithm achieves a 30.16% reduction in average training time and a 16.82% decrease in average completion step. Furthermore, all learned skills can be rapidly transferred across different objects, and all assembly tasks are completed efficiently and compliantly with an average success rate of 96.86%.
个性化制造的兴起对机器人装配提出了重大挑战。虽然基于学习的方法提供了很好的解决方案,但它们往往存在训练效率低和泛化能力差的问题。为了解决这些限制,本文提出了一种有效的先验引导(PG)深度强化学习(DRL)方法,用于使用多传感器信息的可泛化机器人装配。首先,介绍了一种分阶段的多传感器信息融合方法。然后,设计了MobileNetV3-Lite与传统数字图像处理相结合的视觉特征提取方法和基于规则的力特征提取方法,作为先验引导知识提取低维特征。在此基础上,提出了一种将门控循环单元(GRU)网络体系结构与PG相结合的软Actor-Critic (SAC)算法,从而实现了高效的装配技能学习。针对三种典型装配技能,即搜索、对准和插入,进行了仿真和物理实验。结果表明,与基线SAC算法相比,我们的特征提取方法将视觉特征维数降低了93.75%,为DRL提供了准确的先验引导知识。提出的装配技能学习算法平均训练时间减少30.16%,平均完成步长减少16.82%。此外,所有学习到的技能都可以在不同的对象之间快速转移,所有装配任务都能高效、合规地完成,平均成功率为96.86%。
{"title":"Toward generalizable robotic assembly: A prior-guided deep reinforcement learning approach with multi-sensor information","authors":"Zilu Zhu ,&nbsp;Yongkui Liu ,&nbsp;Qianji Wang ,&nbsp;Zinan Wang ,&nbsp;Lihui Wang ,&nbsp;Sichao Liu ,&nbsp;Bin Zi ,&nbsp;Lin Zhang","doi":"10.1016/j.rcim.2026.103242","DOIUrl":"10.1016/j.rcim.2026.103242","url":null,"abstract":"<div><div>The rise of personalized manufacturing presents significant challenges for robotic assembly. While learning-based methods offer promising solutions, they often suffer from low training efficiency and poor generalization. To address these limitations, this paper proposes an efficient prior-guided (PG) deep reinforcement learning (DRL) approach for generalizable robotic assembly using multi-sensor information. First, a phased multi-sensor information fusion method is introduced. Then, a visual feature extraction method that combines MobileNetV3-Lite with conventional digital image processing and a rule-based force feature extraction method are designed to extract lower-dimensional features as prior-guided knowledge. Based on the methods above, a Soft Actor-Critic (SAC) algorithm that integrates Gated Recurrent Unit (GRU) network architecture with PG is proposed, thereby enabling efficient assembly skill learning. Simulations and physical experiments with respect to three typical assembly skills, i.e., search, alignment, and insertion, are conducted. Results indicate that, compared with the baseline SAC algorithm, our feature extraction method reduces visual feature dimensions by 93.75% and provides accurate prior-guided knowledge for DRL. The proposed assembly skill learning algorithm achieves a 30.16% reduction in average training time and a 16.82% decrease in average completion step. Furthermore, all learned skills can be rapidly transferred across different objects, and all assembly tasks are completed efficiently and compliantly with an average success rate of 96.86%.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"100 ","pages":"Article 103242"},"PeriodicalIF":11.4,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
One-step multi-view graph clustering via bottom-up structural learning 基于自底向上结构学习的一步多视图聚类
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-29 DOI: 10.1016/j.patcog.2026.113175
Wenzhe Liu , Li Jiang , Huibing Wang , Yong Zhang
In recent years, tensor-based methods have seen considerable success in multi-view clustering. However, the current approach has several limitations: 1) Insufficient exploration of underlying similarity information (i.e. latent representation); 2) Insufficient exploration of higher-order structure information of both inter-view and intra-view; 3) Treating clustering learning independently from tensor learning and the overall learning framework. To address these issues, we propose a unified framework called Bottom-up Structural Exploration for One-step Multi-view Graph Clustering (BSE_OMGC). Specifically, we first employ an anchor strategy to build similarity graphs, reducing the complexity of graph learning. To deeply represent the underlying similar information of the data and mitigate the influence of noise on similar structures in the original space, BSE_OMGC adaptively separates the noise matrix from the similarity graphs to learn high-quality enhanced graphs. Subsequently, from the bottom up, the enhanced graphs serve as the foundation for constructing high-order tensors. We rotate the constructed tensors and apply the t-TNN to preserve the low-rank properties and to better capture higher-order structure information of both inter-view and intra-view. Finally, we introduce a symmetric non-negative matrix factorization-based graph partitioning technique, which learns non-negative embeddings during dynamic optimization to reveal clustering results. This approach unifies clustering learning within the entire learning framework. Extensive experiments on multiple real-world multi-view datasets, along with comparisons to state-of-the-art methods, demonstrate the effectiveness and robustness of the proposed approach.
近年来,基于张量的聚类方法在多视图聚类中取得了相当大的成功。然而,目前的方法存在一些局限性:1)对潜在相似信息(即潜在表示)的探索不足;2)对视图间和视图内高阶结构信息的挖掘不足;3)将聚类学习独立于张量学习和整体学习框架。为了解决这些问题,我们提出了一个统一的框架,称为自下而上的结构探索一步多视图图聚类(BSE_OMGC)。具体来说,我们首先采用锚点策略来构建相似图,降低了图学习的复杂性。为了深度表示数据的潜在相似信息,减轻噪声对原始空间相似结构的影响,BSE_OMGC自适应地将噪声矩阵从相似图中分离出来,学习高质量的增强图。随后,从下往上,增强图作为构造高阶张量的基础。我们旋转构造张量并应用t-TNN来保持低秩性质,并更好地捕获视图间和视图内的高阶结构信息。最后,我们介绍了一种基于对称非负矩阵分解的图划分技术,该技术在动态优化过程中学习非负嵌入以显示聚类结果。这种方法在整个学习框架内统一了聚类学习。在多个真实世界的多视图数据集上进行的大量实验,以及与最先进方法的比较,证明了所提出方法的有效性和鲁棒性。
{"title":"One-step multi-view graph clustering via bottom-up structural learning","authors":"Wenzhe Liu ,&nbsp;Li Jiang ,&nbsp;Huibing Wang ,&nbsp;Yong Zhang","doi":"10.1016/j.patcog.2026.113175","DOIUrl":"10.1016/j.patcog.2026.113175","url":null,"abstract":"<div><div>In recent years, tensor-based methods have seen considerable success in multi-view clustering. However, the current approach has several limitations: 1) Insufficient exploration of underlying similarity information (i.e. latent representation); 2) Insufficient exploration of higher-order structure information of both inter-view and intra-view; 3) Treating clustering learning independently from tensor learning and the overall learning framework. To address these issues, we propose a unified framework called Bottom-up Structural Exploration for One-step Multi-view Graph Clustering (BSE_OMGC). Specifically, we first employ an anchor strategy to build similarity graphs, reducing the complexity of graph learning. To deeply represent the underlying similar information of the data and mitigate the influence of noise on similar structures in the original space, BSE_OMGC adaptively separates the noise matrix from the similarity graphs to learn high-quality enhanced graphs. Subsequently, from the bottom up, the enhanced graphs serve as the foundation for constructing high-order tensors. We rotate the constructed tensors and apply the t-TNN to preserve the low-rank properties and to better capture higher-order structure information of both inter-view and intra-view. Finally, we introduce a symmetric non-negative matrix factorization-based graph partitioning technique, which learns non-negative embeddings during dynamic optimization to reveal clustering results. This approach unifies clustering learning within the entire learning framework. Extensive experiments on multiple real-world multi-view datasets, along with comparisons to state-of-the-art methods, demonstrate the effectiveness and robustness of the proposed approach.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113175"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning generalizable visual representations with causal diffusion model for controllable editing 基于因果扩散模型的可控编辑可泛化视觉表征学习
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-08-01 Epub Date: 2026-01-29 DOI: 10.1016/j.patcog.2026.113162
Shanshan Huang , Lei Wang , Haoxuan Chen , Yuxuan Liang , Li Liu
Representation learning has been widely employed to learn low-dimensional representations that consist of multiple independent and interpretable generative factors like visual attributes in images, enabling controllable image editing by manipulating specific attributes in the learned representation space. However, in real-world scenarios, generative factors with semantic meanings are often causally related rather than independent. Previous methods with independence assumption are failed to capture such causal relationships, even in the supervised settings. To this end, we propose a diffusion model-based causal representation learning framework, named CausalDiffuser, which models causal prior distributions by the structural causal models (SCMs) to explicitly characterize the causal relations among the underlying generative factors. Such modelling scheme encourages the framework to learn the latent representations of causality for generative factors. Furthermore, a composite loss function is introduced to ensure causal disentanglement of latent representations by combining supervision information from the ground truth factors (i.e., image labels). Empirical evaluations on one synthetic dataset and two real-world benchmark datasets suggest our approach significantly outperforms the state-of-the-art methods. CausalDiffuser effectively edits image attributes by restoring causal relationships among generative factors and generates counterfactual images through intervention operation.
表征学习已被广泛用于学习由多个独立且可解释的生成因素(如图像中的视觉属性)组成的低维表征,通过对学习到的表征空间中的特定属性进行操作,实现对图像的可控编辑。然而,在现实场景中,具有语义意义的生成因素往往是因果相关的,而不是独立的。以前的方法与独立性假设未能捕获这样的因果关系,即使在监督设置。为此,我们提出了一个基于扩散模型的因果表示学习框架,名为CausalDiffuser,它通过结构因果模型(scm)对因果先验分布进行建模,以明确表征潜在生成因素之间的因果关系。这种建模方案鼓励框架学习生成因素因果关系的潜在表征。此外,引入了一个复合损失函数,通过结合来自地面真值因素(即图像标签)的监督信息来确保潜在表示的因果解纠缠。对一个合成数据集和两个真实世界基准数据集的实证评估表明,我们的方法明显优于最先进的方法。CausalDiffuser通过还原生成因素之间的因果关系,有效地编辑图像属性,并通过干预操作生成反事实图像。
{"title":"Learning generalizable visual representations with causal diffusion model for controllable editing","authors":"Shanshan Huang ,&nbsp;Lei Wang ,&nbsp;Haoxuan Chen ,&nbsp;Yuxuan Liang ,&nbsp;Li Liu","doi":"10.1016/j.patcog.2026.113162","DOIUrl":"10.1016/j.patcog.2026.113162","url":null,"abstract":"<div><div>Representation learning has been widely employed to learn low-dimensional representations that consist of multiple independent and interpretable generative factors like visual attributes in images, enabling controllable image editing by manipulating specific attributes in the learned representation space. However, in real-world scenarios, generative factors with semantic meanings are often causally related rather than independent. Previous methods with independence assumption are failed to capture such causal relationships, even in the supervised settings. To this end, we propose a diffusion model-based causal representation learning framework, named CausalDiffuser, which models causal prior distributions by the structural causal models (SCMs) to explicitly characterize the causal relations among the underlying generative factors. Such modelling scheme encourages the framework to learn the latent representations of causality for generative factors. Furthermore, a composite loss function is introduced to ensure causal disentanglement of latent representations by combining supervision information from the ground truth factors (i.e., image labels). Empirical evaluations on one synthetic dataset and two real-world benchmark datasets suggest our approach significantly outperforms the state-of-the-art methods. CausalDiffuser effectively edits image attributes by restoring causal relationships among generative factors and generates counterfactual images through intervention operation.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113162"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
全部 J. Field Rob. J. Bionic Eng. ACTA INFORM Adv. Rob. AI MAG Ann. Math. Artif. Intell. Appl. Bionics Biomech. APPL INTELL APPL COMPUT ELECTROM APPL ARTIF INTELL Artif. Intell. ARTIF INTELL REV CHEMOMETR INTELL LAB China Commun. CMC-Comput. Mater. Continua Complex Intell. Syst. Comput. Sci. Eng. Commun. ACM COMPUTER Comput. Graphics Forum COMPUTING EMPIR SOFTW ENG Enterp. Inf. Syst. EPJ Data Sci. ETRI J EURASIP J WIREL COMM Evolving Systems FORM METHOD SYST DES Front. Neurorob. FRONT COMPUT SCI-CHI IEEE Trans. Commun. IEEE Trans. Comput. Social Syst. IEEE Trans. Dependable Secure Comput. IEEE Trans. Green Commun. Networking IEEE Trans. Cognit. Commun. Networking IEEE Access IEEE Trans. Comput. IEEE Antennas Propag. Mag. IEEE Micro IEEE Trans. Antennas Propag. IEEE Trans. Control Syst. Technol. IEEE Trans. Big Data IEEE Trans. Cybern. IEEE Internet Comput. IEEE Trans. Affective Comput. IEEE Trans. Emerging Top. Comput. Intell. IEEE SECUR PRIV IEEE Trans. Emerging Top. Comput. IEEE Trans. Aerosp. Electron. Syst. IEEE Trans. Broadcast. IEEE Intell. Syst. IEEE Commun. Lett. IEEE Trans. Autom. Control IEEE Trans. Cloud Comput. IEEE Trans. Evol. Comput. IEEE Trans. Consum. Electron. IEEE Trans. Fuzzy Syst. IEEE Trans. Haptic IEEE Trans. Image Process. IEEE Multimedia IEEE Rob. Autom. Lett. IEEE J. Sel. Areas Commun. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. IETE Tech. Rev. IEEE Trans. Serv. Comput. IEEE Trans. Parallel Distrib. Syst. IEEE Trans. Sustainable Comput. IEEE Trans. Multimedia IEEE Trans. Ind. Inf. IEEE Trans. Neural Networks Learn. Syst. IEEE Trans. Software Eng. IEEE-ACM T AUDIO SPE IEEE Wireless Commun. IEEE Wireless Commun. Lett. IET MICROW ANTENNA P IEEE Trans. Visual Comput. Graphics IEEE Trans. Ind. Electron. IET Optoelectron IEEE Trans. Veh. Technol. IEEE Trans. Netw. Serv. Manage. IEEE Trans. Pattern Anal. Mach. Intell. IEEE Trans. Wireless Commun. IEEE ACM T NETWORK IEEE Trans. Inf. Forensics Secur. IEEE Trans. Inf. Theory IEEE Trans. Knowl. Data Eng. INFORM SYST FRONT INFORMS J COMPUT INFOR Int. J. Comput. Vision Int. J. Approximate Reasoning Int. J. Control Int. J. Commun. Syst. Int. J. Imaging Syst. Technol. Int. J. Fuzzy Syst. Int. J. Intell. Syst. Int. J. Network Manage. Int. J. Parallel Program. Int. J. Social Rob. Int. J. Software Tools Technol. Trans.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1