首页 > 最新文献

Journal of Biomedical Informatics最新文献

英文 中文
Lattice-based privacy-preserving multimodal retrieval for healthcare 用于医疗保健的基于格的隐私保护多模态检索。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-03-01 Epub Date: 2026-01-20 DOI: 10.1016/j.jbi.2026.104990
Yingying Hou, Wenbin Yao, Xikang Zhu, Zeyu Li
Multimodal data plays a vital role in advancing personalized diagnosis and precision medicine. However, during cross-institutional sharing and collaborative analysis, the protection of patient privacy becomes increasingly critical, particularly in terms of the secure storage and fine-grained retrieval of sensitive medical data. Existing privacy-preserving technologies fail to meet the demands of secure and efficient retrieval over multimodal medical data. To address this challenge, we propose a generic multi-user multimodal searchable encryption framework for healthcare applications, which supports cross-modal retrieval based on trapdoors generated from ciphertexts corresponding to arbitrary modalities. We further design a distributed-decryption searchable encryption scheme, which is the first to combine AudioCLIP and multi-key fully homomorphic encryption for efficient retrieval of encrypted multimodal data. Additionally, we construct an attribute-based multimodal searchable encryption scheme as a complementary solution for implementing fine-grained access control. This enables flexible and controllable management of retrieval permissions over multimodal ciphertexts. Experimental results on MedMNIST and AudioSet demonstrate that our schemes achieve high retrieval efficiency and quantum-resistant security, meeting the requirements of real-world medical applications.
多模态数据在推进个性化诊断和精准医疗方面发挥着至关重要的作用。然而,在跨机构共享和协作分析过程中,保护患者隐私变得越来越重要,特别是在安全存储和细粒度检索敏感医疗数据方面。现有的隐私保护技术无法满足对多模态医疗数据安全高效检索的需求。为了应对这一挑战,我们为医疗保健应用程序提出了一个通用的多用户多模态可搜索加密框架,该框架支持基于从对应于任意模态的密文生成的活门的跨模态检索。我们进一步设计了一个分布式解密可搜索的加密方案,该方案首次将AudioCLIP和多密钥全同态加密相结合,以有效地检索加密的多模态数据。此外,我们构造了一个基于属性的多模态可搜索加密方案,作为实现细粒度访问控制的补充解决方案。这使得对多模态密文检索权限的灵活可控管理成为可能。在MedMNIST和AudioSet上的实验结果表明,我们的方案具有较高的检索效率和抗量子安全性,满足现实医疗应用的要求。
{"title":"Lattice-based privacy-preserving multimodal retrieval for healthcare","authors":"Yingying Hou,&nbsp;Wenbin Yao,&nbsp;Xikang Zhu,&nbsp;Zeyu Li","doi":"10.1016/j.jbi.2026.104990","DOIUrl":"10.1016/j.jbi.2026.104990","url":null,"abstract":"<div><div>Multimodal data plays a vital role in advancing personalized diagnosis and precision medicine. However, during cross-institutional sharing and collaborative analysis, the protection of patient privacy becomes increasingly critical, particularly in terms of the secure storage and fine-grained retrieval of sensitive medical data. Existing privacy-preserving technologies fail to meet the demands of secure and efficient retrieval over multimodal medical data. To address this challenge, we propose a generic multi-user multimodal searchable encryption framework for healthcare applications, which supports cross-modal retrieval based on trapdoors generated from ciphertexts corresponding to arbitrary modalities. We further design a distributed-decryption searchable encryption scheme, which is the first to combine AudioCLIP and multi-key fully homomorphic encryption for efficient retrieval of encrypted multimodal data. Additionally, we construct an attribute-based multimodal searchable encryption scheme as a complementary solution for implementing fine-grained access control. This enables flexible and controllable management of retrieval permissions over multimodal ciphertexts. Experimental results on MedMNIST and AudioSet demonstrate that our schemes achieve high retrieval efficiency and quantum-resistant security, meeting the requirements of real-world medical applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"175 ","pages":"Article 104990"},"PeriodicalIF":4.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146029770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DMDGRN: A data augmentation-based multilayer directed graph convolutional network for gene regulatory network inference DMDGRN:一种基于数据增强的多层有向图卷积网络,用于基因调控网络推理。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-03-01 Epub Date: 2026-01-14 DOI: 10.1016/j.jbi.2026.104985
Pi-Jing Wei , Mingzhu Sun , Zheng Ding , Rui-Fen Cao , Zhen Gao , Chun-Hou Zheng

Objective

Gene regulatory networks (GRNs) provide a graphical representation of the regulatory interactions between transcription factors (TFs) and their target genes, governing transcriptional states that define cell identity and function. Deciphering GRNs is fundamental for deciphering disease pathogenesis and remains a central challenge in systems biology. Graph neural network-based methods have made significant progress in GRN inference in recent years due to their exceptional ability to model graph-structured biological data. However, the inherent characteristics of GRNs usually have been ignored, including the directionality, the sparsity and abundant high-order regulatory interactions of GRNs.

Methods

In this study, we propose DMDGRN, a data augmentation-based multilayer directed graph convolutional network for GRN inference. To capture the direction of GRNs, DMDGRN employs phase matrix to construct the Laplacian operator, which can track message propagation pathways. Considering the inherent sparsity of known GRNs, DMDGRN incorporates data augmentation techniques to overcome the network sparsity. Moreover, DMDGRN adopts a multilayer directed network architecture with residual connections to extract higher-order neighborhood information.

Results

Comprehensive evaluations on benchmark datasets demonstrate that DMDGRN significantly improves GRN inference accuracy. Notably, the application on breast cancer shows that our framework successfully identifies relevant therapeutic candidates for human breast cancer.

Conclusions

The findings demonstrate that the strategies we adopted are effective for inferring GRNs. The successful application to breast cancer data further highlights its potential of DMDGRN in uncovering disease-relevant regulatory mechanisms and identifying therapeutic targets, making it a promising tool for advancing both computational biology and translational medicine.
目的:基因调控网络(grn)提供了转录因子(tf)与其靶基因之间的调控相互作用的图形表示,控制着定义细胞身份和功能的转录状态。破解grn是破解疾病发病机制的基础,也是系统生物学的核心挑战。近年来,基于图神经网络的方法由于其对图结构生物数据建模的卓越能力,在GRN推理方面取得了重大进展。然而,grn的方向性、稀疏性和丰富的高阶调控相互作用等固有特性往往被忽视。方法:在这项研究中,我们提出了DMDGRN,一种基于数据增强的多层有向图卷积网络,用于GRN推理。为了捕获grn的方向,DMDGRN采用相位矩阵构造拉普拉斯算子,可以跟踪消息的传播路径。考虑到已知grn固有的稀疏性,DMDGRN引入了数据增强技术来克服网络的稀疏性。此外,DMDGRN采用带残差连接的多层有向网络架构提取高阶邻域信息。结果:对基准数据集的综合评估表明,DMDGRN显著提高了GRN推理精度。值得注意的是,在乳腺癌上的应用表明,我们的框架成功地确定了人类乳腺癌的相关治疗候选者。结论:研究结果表明,我们采用的策略对推断grn是有效的。乳腺癌数据的成功应用进一步凸显了DMDGRN在揭示疾病相关调控机制和确定治疗靶点方面的潜力,使其成为推进计算生物学和转化医学的有前途的工具。
{"title":"DMDGRN: A data augmentation-based multilayer directed graph convolutional network for gene regulatory network inference","authors":"Pi-Jing Wei ,&nbsp;Mingzhu Sun ,&nbsp;Zheng Ding ,&nbsp;Rui-Fen Cao ,&nbsp;Zhen Gao ,&nbsp;Chun-Hou Zheng","doi":"10.1016/j.jbi.2026.104985","DOIUrl":"10.1016/j.jbi.2026.104985","url":null,"abstract":"<div><h3>Objective</h3><div>Gene regulatory networks (GRNs) provide a graphical representation of the regulatory interactions between transcription factors (TFs) and their target genes, governing transcriptional states that define cell identity and function. Deciphering GRNs is fundamental for deciphering disease pathogenesis and remains a central challenge in systems biology. Graph neural network-based methods have made significant progress in GRN inference in recent years due to their exceptional ability to model graph-structured biological data. However, the inherent characteristics of GRNs usually have been ignored, including the directionality, the sparsity and abundant high-order regulatory interactions of GRNs.</div></div><div><h3>Methods</h3><div>In this study, we propose DMDGRN, a data augmentation-based multilayer directed graph convolutional network for GRN inference. To capture the direction of GRNs, DMDGRN employs phase matrix to construct the Laplacian operator, which can track message propagation pathways. Considering the inherent sparsity of known GRNs, DMDGRN incorporates data augmentation techniques to overcome the network sparsity. Moreover, DMDGRN adopts a multilayer directed network architecture with residual connections to extract higher-order neighborhood information.</div></div><div><h3>Results</h3><div>Comprehensive evaluations on benchmark datasets demonstrate that DMDGRN significantly improves GRN inference accuracy. Notably, the application on breast cancer shows that our framework successfully identifies relevant therapeutic candidates for human breast cancer.</div></div><div><h3>Conclusions</h3><div>The findings demonstrate that the strategies we adopted are effective for inferring GRNs. The successful application to breast cancer data further highlights its potential of DMDGRN in uncovering disease-relevant regulatory mechanisms and identifying therapeutic targets, making it a promising tool for advancing both computational biology and translational medicine.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"175 ","pages":"Article 104985"},"PeriodicalIF":4.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145989328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track TREC生物医学摘要平语适应(PLABA)轨道的经验教训。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-03-01 Epub Date: 2026-01-17 DOI: 10.1016/j.jbi.2026.104983
Brian Ondov , William Xia , Kush Attal , Ishita Unde , Jerry He , Dina Demner-Fushman

Objective:

Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability and high potential for harm in this domain means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems.

Methods:

We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level rewriting of 400 abstracts related to 40 consumer questions (Task 1) as well as identifying and replacing difficult terms in 300 abstracts spanning 30 consumer questions (Task 2). For automatic evaluation of Task 1, we developed a four-fold professionally-written reference set. Submissions for both tasks were also provided extensive manual evaluation from biomedical experts.

Results:

Twelve teams spanning twelve countries participated, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity.

Conclusion:

The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.
目的:语言模型的最新进展表明,有可能将面向专业的生物医学文献改编为通俗易懂的语言,使其易于患者和护理人员使用。然而,它们在该领域的不可预测性和高潜在危害意味着有必要进行严格的评估。我们在这条赛道上的目标是刺激研究,并提供最有前途的系统的高质量评估。方法:我们在2023年和2024年的文本检索会议上主持了生物医学摘要的通俗语言改编(PLABA)专题。任务包括完整的、句子级的重写涉及40个消费者问题的400篇摘要(任务1),以及识别和替换涉及30个消费者问题的300篇摘要中的困难术语(任务2)。为了自动评估任务1,我们开发了一个四层专业编写的参考集。生物医学专家还对这两项任务提交的材料进行了广泛的手工评价。结果:来自12个国家的12个团队参与了研究,他们的模型从多层感知器到大型预训练变压器。在任务1的人工判断中,表现最好的模型与人类事实的准确性和完整性相媲美,但不是简单性或简洁性。自动的、基于参考的度量标准通常与人工判断不太相关。在任务2中,系统努力识别困难的术语并对如何替换它们进行分类。然而,在生成替换时,基于llm的系统在人工判断的准确性、完整性和简单性方面做得很好,尽管在简洁性方面做得不好。结论:PLABA轨道显示了使用大语言模型适应普通公众的生物医学文献的希望,同时也突出了它们的不足和改进自动基准工具的必要性。
{"title":"Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track","authors":"Brian Ondov ,&nbsp;William Xia ,&nbsp;Kush Attal ,&nbsp;Ishita Unde ,&nbsp;Jerry He ,&nbsp;Dina Demner-Fushman","doi":"10.1016/j.jbi.2026.104983","DOIUrl":"10.1016/j.jbi.2026.104983","url":null,"abstract":"<div><h3>Objective:</h3><div>Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability and high potential for harm in this domain means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems.</div></div><div><h3>Methods:</h3><div>We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level rewriting of 400 abstracts related to 40 consumer questions (Task 1) as well as identifying and replacing difficult terms in 300 abstracts spanning 30 consumer questions (Task 2). For automatic evaluation of Task 1, we developed a four-fold professionally-written reference set. Submissions for both tasks were also provided extensive manual evaluation from biomedical experts.</div></div><div><h3>Results:</h3><div>Twelve teams spanning twelve countries participated, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity.</div></div><div><h3>Conclusion:</h3><div>The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"175 ","pages":"Article 104983"},"PeriodicalIF":4.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146003559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Augmented intelligence for multimodal virtual biopsy in breast cancer using generative artificial intelligence 基于生成人工智能的乳腺癌多模态虚拟活检增强智能。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-02-01 Epub Date: 2025-12-26 DOI: 10.1016/j.jbi.2025.104971
Aurora Rofena , Claudia Lucia Piccolo , Bruno Beomonte Zobel , Paolo Soda , Valerio Guarrasi

Objective:

This study aims to propose a multimodal, multi-view deep learning approach for breast cancer virtual biopsy, a non-invasive classification of breast lesions as malignant or benign, by integrating Full-Field Digital Mammography (FFDM) and Contrast-Enhanced Spectral Mammography (CESM). The work addresses the critical challenge of missing CESM data by introducing generative artificial intelligence (AI) to synthesize CESM images when unavailable, ensuring the continuity of diagnostic workflows.

Methods:

The proposed method uses FFDM and CESM images in both craniocaudal (CC) and mediolateral oblique (MLO) views. When CESM is missing, a CycleGAN-based generative model produces synthetic CESM images from FFDM inputs. For classification, three convolutional neural networks (ResNet18, ResNet50, and VGG16) are employed, and a two-stage late fusion strategy integrates view-specific and modality-specific malignancy probabilities, weighted by Matthews Correlation Coefficient (MCC), into a final malignancy score. The system’s robustness under varying degrees of missing CESM data is tested by incrementally replacing real CESM inputs with synthetic ones and evaluating classification performance using AUC, G-mean, and MCC.

Results:

CycleGAN achieved high-fidelity CESM synthesis, with Peak-Signal-to-Noise Ratio exceeding 24 dB and Structural Similarity Index above 0.8 across both CC and MLO views. For lesion classification, the multimodal configuration combining FFDM and CESM consistently outperformed the unimodal FFDM-only setup. Notably, even when CESM was entirely replaced by synthetic images, the multimodal approach still improved virtual biopsy performance compared to FFDM alone. Although classification performance declined as the proportion of synthetic CESM increased, the use of synthetic data remained beneficial.

Conclusion:

This work demonstrates that generative AI can effectively address missing-modality challenges in breast cancer diagnostics by synthesizing CESM images to enhance FFDM-based virtual biopsy pipelines. In the absence of real CESM data, incorporating synthetic images improves lesion classification compared to using FFDM alone, offering a non-invasive alternative to support clinical decision-making. Moreover, by releasing the extended CESM@UCBM dataset, this study contributes a valuable resource for advancing research and innovation in breast multimodal diagnostic systems.
目的:本研究旨在通过整合全场数字乳房x线摄影(FFDM)和对比增强光谱乳房x线摄影(CESM),提出一种用于乳腺癌虚拟活检的多模式、多视图深度学习方法,对乳房病变进行恶性或良性的无创分类。这项工作通过引入生成式人工智能(AI)来合成不可用的CESM图像,从而确保诊断工作流程的连续性,解决了缺少CESM数据的关键挑战。方法:在颅侧(CC)和中外侧斜(MLO)视图上使用FFDM和CESM图像。当缺少CESM时,基于cyclegan的生成模型从FFDM输入生成合成CESM图像。为了进行分类,使用了三个卷积神经网络(ResNet18, ResNet50和VGG16),并采用两阶段后期融合策略将特定视图和特定模式的恶性肿瘤概率结合起来,通过马修斯相关系数(MCC)加权,形成最终的恶性肿瘤评分。通过逐步用合成的CESM输入替换真实的CESM输入,并使用AUC、G-mean和MCC评估分类性能,测试了系统在不同程度缺失CESM数据下的鲁棒性。结果:CycleGAN实现了高保真的CESM合成,在CC和MLO视图上,峰值信噪比超过24 dB,结构相似指数超过0.8。对于病变分类,结合FFDM和CESM的多模态配置始终优于单模态FFDM设置。值得注意的是,即使CESM完全被合成图像取代,与单独的FFDM相比,多模态方法仍然提高了虚拟活检的性能。虽然分类性能随着合成CESM比例的增加而下降,但合成数据的使用仍然是有益的。结论:本研究表明,生成式人工智能可以通过合成CESM图像来增强基于ffdm的虚拟活检管道,有效解决乳腺癌诊断中缺失模态的挑战。在缺乏真实CESM数据的情况下,与单独使用FFDM相比,结合合成图像可以改善病变分类,为支持临床决策提供非侵入性替代方案。此外,通过发布扩展的CESM@UCBM数据集,本研究为推进乳腺多模态诊断系统的研究和创新提供了宝贵的资源。
{"title":"Augmented intelligence for multimodal virtual biopsy in breast cancer using generative artificial intelligence","authors":"Aurora Rofena ,&nbsp;Claudia Lucia Piccolo ,&nbsp;Bruno Beomonte Zobel ,&nbsp;Paolo Soda ,&nbsp;Valerio Guarrasi","doi":"10.1016/j.jbi.2025.104971","DOIUrl":"10.1016/j.jbi.2025.104971","url":null,"abstract":"<div><h3>Objective:</h3><div>This study aims to propose a multimodal, multi-view deep learning approach for breast cancer virtual biopsy, a non-invasive classification of breast lesions as malignant or benign, by integrating Full-Field Digital Mammography (FFDM) and Contrast-Enhanced Spectral Mammography (CESM). The work addresses the critical challenge of missing CESM data by introducing generative artificial intelligence (AI) to synthesize CESM images when unavailable, ensuring the continuity of diagnostic workflows.</div></div><div><h3>Methods:</h3><div>The proposed method uses FFDM and CESM images in both craniocaudal (CC) and mediolateral oblique (MLO) views. When CESM is missing, a CycleGAN-based generative model produces synthetic CESM images from FFDM inputs. For classification, three convolutional neural networks (ResNet18, ResNet50, and VGG16) are employed, and a two-stage late fusion strategy integrates view-specific and modality-specific malignancy probabilities, weighted by Matthews Correlation Coefficient (MCC), into a final malignancy score. The system’s robustness under varying degrees of missing CESM data is tested by incrementally replacing real CESM inputs with synthetic ones and evaluating classification performance using AUC, G-mean, and MCC.</div></div><div><h3>Results:</h3><div>CycleGAN achieved high-fidelity CESM synthesis, with Peak-Signal-to-Noise Ratio exceeding 24 dB and Structural Similarity Index above 0.8 across both CC and MLO views. For lesion classification, the multimodal configuration combining FFDM and CESM consistently outperformed the unimodal FFDM-only setup. Notably, even when CESM was entirely replaced by synthetic images, the multimodal approach still improved virtual biopsy performance compared to FFDM alone. Although classification performance declined as the proportion of synthetic CESM increased, the use of synthetic data remained beneficial.</div></div><div><h3>Conclusion:</h3><div>This work demonstrates that generative AI can effectively address missing-modality challenges in breast cancer diagnostics by synthesizing CESM images to enhance FFDM-based virtual biopsy pipelines. In the absence of real CESM data, incorporating synthetic images improves lesion classification compared to using FFDM alone, offering a non-invasive alternative to support clinical decision-making. Moreover, by releasing the extended CESM@UCBM dataset, this study contributes a valuable resource for advancing research and innovation in breast multimodal diagnostic systems.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"174 ","pages":"Article 104971"},"PeriodicalIF":4.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145850420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A computational framework for predicting drug-target interactions by fusing gene ontology information with cross attention 交叉关注融合基因本体信息预测药物-靶标相互作用的计算框架
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-02-01 Epub Date: 2026-01-02 DOI: 10.1016/j.jbi.2025.104976
Wenchao Cui, Pingjian Ding, Lingyun Luo, Shunheng Zhou, Hui Jiang

Motivation

Identifying drug–target interactions (DTIs) is a critical step in both drug discovery and drug repurposing. Accurate in silico prediction of DTIs can substantially reduce development time and costs. Recent advances in sequence-based methods have leveraged attention mechanisms to improve prediction accuracy. However, these approaches typically rely solely on the molecular structures of drugs and proteins, overlooking higher-level semantic information that reflects functional and biological relationships.

Results

In this work, we propose GODTI, a novel Gene Ontology-guided Drug-Target Interaction prediction model that enhances the performance through multimodal feature integration. GODTI comprises three major components: a feature extraction module, a multimodal fusion module, and an intermolecular interaction modeling module. In the protein feature extractor, both functional descriptors derived from Gene Ontology and sequence-based embeddings from amino acid sequences are obtained and combined. These protein representations are then integrated with drug molecular features via the multimodal fusion module and subsequently processed by the interaction modeling module to predict potential interactions. We evaluated GODTI under four realistic experimental settings, demonstrating consistent improvements over state-of-the-art baselines. Furthermore, case studies validated the practical utility of GODTI in accurately identifying novel, low-cost DTIs, underscoring its potential to accelerate drug discovery workflows.
动机识别药物-靶标相互作用(DTIs)是药物发现和药物再利用的关键步骤。准确的dti计算机预测可以大大减少开发时间和成本。基于序列的方法的最新进展利用注意机制来提高预测的准确性。然而,这些方法通常只依赖于药物和蛋白质的分子结构,而忽略了反映功能和生物关系的更高层次的语义信息。结果提出了一种基于基因本体论的药物-靶标相互作用预测模型GODTI,该模型通过多模态特征集成提高了药物-靶标相互作用预测的性能。GODTI包括三个主要部分:特征提取模块、多模态融合模块和分子间相互作用建模模块。在蛋白质特征提取器中,获得了来自基因本体的功能描述子和来自氨基酸序列的基于序列的嵌入子并进行了组合。然后通过多模态融合模块将这些蛋白质表征与药物分子特征整合,随后由相互作用建模模块进行处理,以预测潜在的相互作用。我们在四种现实的实验设置下评估了GODTI,显示出与最先进的基线相一致的改进。此外,案例研究证实了GODTI在准确识别新型低成本dti方面的实际效用,强调了其加速药物发现工作流程的潜力。
{"title":"A computational framework for predicting drug-target interactions by fusing gene ontology information with cross attention","authors":"Wenchao Cui,&nbsp;Pingjian Ding,&nbsp;Lingyun Luo,&nbsp;Shunheng Zhou,&nbsp;Hui Jiang","doi":"10.1016/j.jbi.2025.104976","DOIUrl":"10.1016/j.jbi.2025.104976","url":null,"abstract":"<div><h3>Motivation</h3><div>Identifying drug–target interactions (DTIs) is a critical step in both drug discovery and drug repurposing. Accurate <em>in silico</em> prediction of DTIs can substantially reduce development time and costs. Recent advances in sequence-based methods have leveraged attention mechanisms to improve prediction accuracy. However, these approaches typically rely solely on the molecular structures of drugs and proteins, overlooking higher-level semantic information that reflects functional and biological relationships.</div></div><div><h3>Results</h3><div>In this work, we propose GODTI, a novel Gene Ontology-guided Drug-Target Interaction prediction model that enhances the performance through multimodal feature integration. GODTI comprises three major components: a feature extraction module, a multimodal fusion module, and an intermolecular interaction modeling module. In the protein feature extractor, both functional descriptors derived from Gene Ontology and sequence-based embeddings from amino acid sequences are obtained and combined. These protein representations are then integrated with drug molecular features via the multimodal fusion module and subsequently processed by the interaction modeling module to predict potential interactions. We evaluated GODTI under four realistic experimental settings, demonstrating consistent improvements over state-of-the-art baselines. Furthermore, case studies validated the practical utility of GODTI in accurately identifying novel, low-cost DTIs, underscoring its potential to accelerate drug discovery workflows.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"174 ","pages":"Article 104976"},"PeriodicalIF":4.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145891182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Fine-Tuning: Leveraging Domain-Aware In-Context learning with large language models for clinical named entity recognition 超越微调:利用领域感知上下文学习与大型语言模型进行临床命名实体识别。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-02-01 Epub Date: 2026-01-08 DOI: 10.1016/j.jbi.2026.104982
Siun Kim , David Seung U Lee , Yujin Kim , Hyung-Jin Yoon , Howard Lee

Background

Clinical named entity recognition (NER) is essential for structuring clinical narratives. While large language model (LLM)-based in-context learning (ICL) enables parameter-free adaptation, encoder-based fine-tuning has generally achieved superior performance in practical biomedical NER settings.

Objective

To systematically compare ICL and encoder-based fine-tuning for clinical NER under realistic constraints, and to determine whether optimizing ICL demonstration selection can close the performance gap.

Methods

We manually annotated 2,113 clinical notes from hematologic malignancy patients at Seoul National University Hospital and 400 MIMIC-IV notes. ICL configurations were optimized across task instructions, output formats, demonstration selection methods, sorting strategies, and pool sizes, using LLaMA-3.3-70B (open-source) via Ollama. Encoder fine-tuning was performed on both domain-specific and general-domain models, with RoBERTa-large representing the best encoder baseline. All models were evaluated as token-level classification tasks using macro and weighted F1, across in-domain, cross-domain, and cross-institutional scenarios.

Results

Demonstration selection played a major role in determining to ICL performance, improving macro F1 by up to 9.4 points over random selection under our experimental settings. In moderate-resource settings (500-sample pool), ICL exceeded RoBERTa-large fine-tuning by 4.7 macro F1 points and remained competitive up to 900 samples. Both ICL and fine-tuning experienced performance degradation in cross-domain evaluations, yet ICL demonstrated superior data efficiency, achieving competitive accuracy with substantially fewer labeled examples. ICL achieved in-domain macro F1 > 0.8 in several domains, outperforming full-data fine-tuned encoders, and delivered 6.3- to 11.6-point gains in cross-institutional transfer without parameter updates. At the largest pool size (∼1,900 samples), encoder-based fine-tuning regained the lead.

Conclusion

With optimized domain-aware demonstration selection, open-source LLM-based ICL can match or surpass encoder fine-tuning for clinical NER. Its ease of adaptation and ability to update knowledge via demonstration pools—without retraining—enable continuous improvement in dynamic, resource-limited healthcare settings.
背景:临床命名实体识别(NER)是构建临床叙事的关键。虽然基于大语言模型(LLM)的上下文学习(ICL)能够实现无参数自适应,但基于编码器的微调通常在实际的生物医学NER设置中取得了优异的性能。目的:在现实约束下系统比较ICL和基于编码器的临床NER微调,并确定优化ICL演示选择是否可以缩小性能差距。方法:对首尔国立大学医院恶性血液病患者的2113份临床记录和400份MIMIC-IV记录进行手工注释。ICL配置在任务指令、输出格式、演示选择方法、排序策略和池大小方面进行了优化,使用了Ollama提供的LLaMA-3.3-70B(开源)。在特定领域和通用领域模型上执行编码器微调,RoBERTa-large表示最佳编码器基线。所有模型都被评估为标记级分类任务,使用宏观和加权F1,跨越域内、跨域和跨机构场景。结果:示范选择在决定ICL性能方面发挥了主要作用,在我们的实验设置下,与随机选择相比,宏观F1提高了9.4分。在中等资源设置(500个样本池)中,ICL比RoBERTa-large微调高出4.7个宏观F1点,并在900个样本中保持竞争力。ICL和微调在跨域评估中都经历了性能下降,但ICL展示了优越的数据效率,用更少的标记示例实现了具有竞争力的准确性。ICL在多个领域实现了域内宏F1 > 0.8,优于全数据微调编码器,并且在没有参数更新的情况下,在跨机构转移方面获得了6.3至11.6点的收益。在最大的池大小(约1,900个样本)下,基于编码器的微调重新领先。结论:通过优化的领域感知演示选择,基于开源llm的ICL可以匹配或超过临床NER的编码器微调。它易于适应,并且能够通过演示池更新知识(无需再培训),从而在动态的、资源有限的医疗保健环境中实现持续改进。
{"title":"Beyond Fine-Tuning: Leveraging Domain-Aware In-Context learning with large language models for clinical named entity recognition","authors":"Siun Kim ,&nbsp;David Seung U Lee ,&nbsp;Yujin Kim ,&nbsp;Hyung-Jin Yoon ,&nbsp;Howard Lee","doi":"10.1016/j.jbi.2026.104982","DOIUrl":"10.1016/j.jbi.2026.104982","url":null,"abstract":"<div><h3>Background</h3><div>Clinical named entity recognition (NER) is essential for structuring clinical narratives. While large language model (LLM)-based in-context learning (ICL) enables parameter-free adaptation, encoder-based fine-tuning has generally achieved superior performance in practical biomedical NER settings.</div></div><div><h3>Objective</h3><div>To systematically compare ICL and encoder-based fine-tuning for clinical NER under realistic constraints, and to determine whether optimizing ICL demonstration selection can close the performance gap.</div></div><div><h3>Methods</h3><div>We manually annotated 2,113 clinical notes from hematologic malignancy patients at Seoul National University Hospital and 400 MIMIC-IV notes. ICL configurations were optimized across task instructions, output formats, demonstration selection methods, sorting strategies, and pool sizes, using LLaMA-3.3-70B (open-source) via Ollama. Encoder fine-tuning was performed on both domain-specific and general-domain models, with RoBERTa-large representing the best encoder baseline. All models were evaluated as token-level classification tasks using macro and weighted F1, across in-domain, cross-domain, and cross-institutional scenarios.</div></div><div><h3>Results</h3><div>Demonstration selection played a major role in determining to ICL performance, improving macro F1 by up to 9.4 points over random selection under our experimental settings. In moderate-resource settings (500-sample pool), ICL exceeded RoBERTa-large fine-tuning by 4.7 macro F1 points and remained competitive up to 900 samples. Both ICL and fine-tuning experienced performance degradation in cross-domain evaluations, yet ICL demonstrated superior data efficiency, achieving competitive accuracy with substantially fewer labeled examples. ICL achieved in-domain macro F1 &gt; 0.8 in several domains, outperforming full-data fine-tuned encoders, and delivered 6.3- to 11.6-point gains in cross-institutional transfer without parameter updates. At the largest pool size (∼1,900 samples), encoder-based fine-tuning regained the lead.</div></div><div><h3>Conclusion</h3><div>With optimized domain-aware demonstration selection, open-source LLM-based ICL can match or surpass encoder fine-tuning for clinical NER. Its ease of adaptation and ability to update knowledge via demonstration pools—without retraining—enable continuous improvement in dynamic, resource-limited healthcare settings.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"174 ","pages":"Article 104982"},"PeriodicalIF":4.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multidimensional hierarchical framework for sources of bias in real-world healthcare evidence: a scoping review 现实世界医疗证据偏倚来源的多维层次框架:范围审查。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-02-01 Epub Date: 2026-01-20 DOI: 10.1016/j.jbi.2026.104989
Haeun Lee , Christelle Xiong , Derek Baughman , Chen Dun , Jiayi Tong , Benjamin Martin , Harold Lehmann , Paul Nagy

Objective

This study identifies and categorizes bias sources throughout the real-world evidence (RWE) generation process from electronic health records (EHRs), and we develop a multi-dimensional conceptual framework to characterize how bias arises in large-scale multinational federated network studies.

Methods

A three-phase bias framework spanning healthcare delivery, data management, and research was developed through the synthesis of existing frameworks, a structured literature review, and iterative assessment by multidisciplinary expert panels. A scoping review was conducted following PRISMA-ScR guidelines, analyzing studies between 2016 and 2025 in PubMed and Web of Science and focusing on bias in observational studies using real-world data. Bias sources were classified using directed content analysis based on their occurrence stage in the RWE generation process.

Results

Analysis of 220 papers within this framework identified 209 distinct bias sources categorized into seven specific levels: Access to medical care (n = 40), provision of care (n = 29), data acquisition and measurement (n = 39), clinical documentation and coding practices (n = 32), data extraction (n = 22), data modeling (n = 11), and data analytics (n = 36). Healthcare phase biases were most prevalent (n = 108), followed by data management (n = 54) and research levels (n = 47).

Conclusion

This multi-dimensional framework reveals that bias sources in RWE generation are interconnected across patient, provider, administrative, information technology, informatics, and analytical domains, and provides a structural foundation for understanding where and how bias may arise across the RWE process in large-scale observational research.
目的:本研究在电子健康记录(EHRs)的真实世界证据(RWE)生成过程中识别和分类偏倚来源,并开发了一个多维概念框架来表征大规模跨国联合网络研究中偏倚是如何产生的。方法:通过综合现有框架、结构化文献综述和多学科专家小组的反复评估,开发了一个涵盖医疗保健服务、数据管理和研究的三期偏倚框架。根据PRISMA-ScR指南进行了范围审查,分析了2016年至2025年在PubMed和Web of Science上的研究,并重点关注使用真实数据的观察性研究的偏倚。根据偏差源在RWE生成过程中的发生阶段,使用定向内容分析对其进行分类。结果:分析220篇论文在这个框架确定了209种不同的偏见来源分为七个具体的水平:获得医疗保健(n = 40),提供保健(n = 29),数据采集和测量(n = 39),临床文档和编码实践(n = 32),数据提取(n = 22),数据建模(n = 11),和数据分析(n = 36)。医疗保健阶段偏差最普遍(n = 108),其次是数据管理(n = 54)和研究水平(n = 47)。结论:该多维框架揭示了RWE生成中的偏倚来源在患者、提供者、管理、信息技术、信息学和分析领域之间相互关联,并为理解大规模观察性研究中RWE过程中的偏倚在何处以及如何产生提供了结构性基础。
{"title":"A multidimensional hierarchical framework for sources of bias in real-world healthcare evidence: a scoping review","authors":"Haeun Lee ,&nbsp;Christelle Xiong ,&nbsp;Derek Baughman ,&nbsp;Chen Dun ,&nbsp;Jiayi Tong ,&nbsp;Benjamin Martin ,&nbsp;Harold Lehmann ,&nbsp;Paul Nagy","doi":"10.1016/j.jbi.2026.104989","DOIUrl":"10.1016/j.jbi.2026.104989","url":null,"abstract":"<div><h3>Objective</h3><div>This study identifies and categorizes bias sources throughout the real-world evidence (RWE) generation process from electronic health records (EHRs), and we develop a multi-dimensional conceptual framework to characterize how bias arises in large-scale multinational federated network studies.</div></div><div><h3>Methods</h3><div>A three-phase bias framework spanning healthcare delivery, data management, and research was developed through the synthesis of existing frameworks, a structured literature review, and iterative assessment by multidisciplinary expert panels. A scoping review was conducted following PRISMA-ScR guidelines, analyzing studies between 2016 and 2025 in PubMed and Web of Science and focusing on bias in observational studies using real-world data. Bias sources were classified using directed content analysis based on their occurrence stage in the RWE generation process.</div></div><div><h3>Results</h3><div>Analysis of 220 papers within this framework identified 209 distinct bias sources categorized into seven specific levels: Access to medical care (n = 40), provision of care (n = 29), data acquisition and measurement (n = 39), clinical documentation and coding practices (n = 32), data extraction (n = 22), data modeling (n = 11), and data analytics (n = 36). Healthcare phase biases were most prevalent (n = 108), followed by data management (n = 54) and research levels (n = 47).</div></div><div><h3>Conclusion</h3><div>This multi-dimensional framework reveals that bias sources in RWE generation are interconnected across patient, provider, administrative, information technology, informatics, and analytical domains, and provides a structural foundation for understanding where and how bias may arise across the RWE process in large-scale observational research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"174 ","pages":"Article 104989"},"PeriodicalIF":4.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146029721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A federated learning framework for ethical dynamic treatment allocation across heterogeneous hospitals 跨异构医院伦理动态治疗分配的联邦学习框架。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-02-01 Epub Date: 2026-01-16 DOI: 10.1016/j.jbi.2026.104987
Xenia Konti , Nicoleta J. Economou-Zavlanos , Yi Shen , Giorgos Stamou , Armando Bedoya , Michael J. Pencina , Chuan Hong , Michael M. Zavlanos

Objective

In this paper, we propose an adaptive federated learning framework to learn optimal treatments for individual hospitals that possibly serve different patient populations. The proposed framework can enable the design of more efficient treatment allocation problems.

Methods

We propose a federated treatment recommendation strategy that for each hospital is formulated as a Multi-Armed Bandit (MAB) problem. The process is coordinated by a lead hospital that adaptively learns and transfers Upper Confidence Bounds (UCB) across similar hospitals and Personalized Upper Bounds across heterogeneous hospitals. We test our proposed method on a simulated clinical trial environment created using real Covid-19 data from the Duke University Health System.

Results

Our method relies on collaboration among hospitals, which allows for fewer data samples needed per institution, while protecting the privacy of the individual patient data. At the same time, it ensures fairness of the learned treatments by mitigating possible biases due to differences in the patient populations treated across different hospitals. Finally, our method improves the safety of the learning procedure by reducing the number of patients administered with sub-optimal treatments at each hospital. In the experiments, we show that our proposed method outperforms other state of the art approaches in that it requires up to 36%–75% fewer patient data to learn the optimal treatment for each hospital and administers the optimal treatment to 0.95%-48.6% more patients.

Conclusion

In this paper, we propose an adaptive federated learning strategy for treatment recommendation tasks, that learns optimal treatments for individual hospitals that possibly serve different patient populations, while satisfying privacy, fairness, and safety considerations.
目的:在本文中,我们提出了一个自适应联邦学习框架,以学习可能服务于不同患者群体的个别医院的最佳治疗方法。提出的框架可以使设计更有效的处理分配问题。方法:我们提出了一个联合治疗推荐策略,为每个医院制定了一个多武装强盗(MAB)问题。该过程由一家领先的医院协调,该医院可自适应地学习并在类似医院之间传递上限置信界限(UCB),并在异构医院之间传递个性化上限界限。我们在使用杜克大学卫生系统的真实Covid-19数据创建的模拟临床试验环境中测试了我们提出的方法。结果:我们的方法依赖于医院之间的协作,这使得每个机构所需的数据样本更少,同时保护了个体患者数据的隐私。同时,它通过减少因不同医院治疗的患者群体差异而可能产生的偏见,确保了所学治疗方法的公平性。最后,我们的方法通过减少在每家医院接受次优治疗的患者数量,提高了学习过程的安全性。在实验中,我们表明,我们提出的方法优于其他最先进的方法,因为它需要多达36%-75%的患者数据来学习每个医院的最佳治疗方法,并为0.95%-48.6%的患者提供最佳治疗。结论:在本文中,我们提出了一种针对治疗推荐任务的自适应联邦学习策略,该策略可以为可能服务于不同患者群体的单个医院学习最佳治疗方法,同时满足隐私、公平和安全方面的考虑。
{"title":"A federated learning framework for ethical dynamic treatment allocation across heterogeneous hospitals","authors":"Xenia Konti ,&nbsp;Nicoleta J. Economou-Zavlanos ,&nbsp;Yi Shen ,&nbsp;Giorgos Stamou ,&nbsp;Armando Bedoya ,&nbsp;Michael J. Pencina ,&nbsp;Chuan Hong ,&nbsp;Michael M. Zavlanos","doi":"10.1016/j.jbi.2026.104987","DOIUrl":"10.1016/j.jbi.2026.104987","url":null,"abstract":"<div><h3>Objective</h3><div>In this paper, we propose an adaptive federated learning framework to learn optimal treatments for individual hospitals that possibly serve different patient populations. The proposed framework can enable the design of more efficient treatment allocation problems.</div></div><div><h3>Methods</h3><div>We propose a federated treatment recommendation strategy that for each hospital is formulated as a Multi-Armed Bandit (MAB) problem. The process is coordinated by a lead hospital that adaptively learns and transfers Upper Confidence Bounds (UCB) across similar hospitals and Personalized Upper Bounds across heterogeneous hospitals. We test our proposed method on a simulated clinical trial environment created using real Covid-19 data from the Duke University Health System.</div></div><div><h3>Results</h3><div>Our method relies on collaboration among hospitals, which allows for fewer data samples needed per institution, while protecting the privacy of the individual patient data. At the same time, it ensures fairness of the learned treatments by mitigating possible biases due to differences in the patient populations treated across different hospitals. Finally, our method improves the safety of the learning procedure by reducing the number of patients administered with sub-optimal treatments at each hospital. In the experiments, we show that our proposed method outperforms other state of the art approaches in that it requires up to 36%–75% fewer patient data to learn the optimal treatment for each hospital and administers the optimal treatment to 0.95%-48.6% more patients.</div></div><div><h3>Conclusion</h3><div>In this paper, we propose an adaptive federated learning strategy for treatment recommendation tasks, that learns optimal treatments for individual hospitals that possibly serve different patient populations, while satisfying privacy, fairness, and safety considerations.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"174 ","pages":"Article 104987"},"PeriodicalIF":4.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145998273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RINet: synthetic data training for indirect estimation of clinical reference distributions RINet:用于间接估计临床参考分布的综合数据训练。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-02-01 Epub Date: 2026-01-08 DOI: 10.1016/j.jbi.2026.104980
Jack LeBien , Julian Velev , Abiel Roche-Lima

Background

Indirect methods for estimating clinical reference intervals (RIs) use statistical analysis to identify non-pathological sub-distributions within large datasets acquired from routine clinical testing. This approach has the potential to accelerate the estimation of precise RIs, accounting for influential variables such as age, gender, and ethnicity. Most existing methods are based on traditional statistics and hand-crafted algorithms. The investigation of supervised learning, which often outperforms traditional approaches, has been impeded by the limitations of real-world data. However, previous studies have widely used synthetic data for evaluating and benchmarking indirect methods due several advantages over real-world data, including greater control, variability, accessibility, and the availability of exact ground-truth RIs. Synthetic data may also provide a pathway for developing data-driven solutions for indirect RI estimation.

Methods

In this study, we leveraged synthetic data to train two convolutional neural networks (CNNs) to predict the parameters of underlying reference distributions (RDs) in diverse real-world clinical datasets. While one model was trained for standard univariate data, the other was extended to bivariate data, enabling the prediction of covariance between clinical analytes. Trained models were evaluated using both real-world and synthetic test datasets and compared with four alternative algorithms.

Results

Model predictions closely matched directly estimated RIs and RDs in real-world data and known RDs in synthetic data, outperforming four alternative indirect methods: GMM, refineR, reflimR, and RINetv1. Using labeled healthy and HCV-positive groups in real data, we compared established univariate RIs with predicted multivariate reference regions (MRRs). On average, the MRRs showed 1) higher coverage of healthy patients (closer to the desired 95%) and 2) smaller regions, which reduce the likelihood of including abnormal values.

Conclusions

Synthetic data training is a viable approach for developing accurate indirect RI estimation models for both univariate and bivariate clinical data. This strategy could help address some limitations of real-world data, direct analyses, and univariate RIs.
背景:估计临床参考区间(RIs)的间接方法使用统计分析来识别从常规临床检测获得的大型数据集中的非病理亚分布。考虑到年龄、性别和种族等有影响的变量,这种方法有可能加速对精确RIs的估计。大多数现有的方法都是基于传统的统计和手工制作的算法。监督学习的研究通常优于传统方法,但受到现实世界数据的限制。然而,先前的研究已经广泛使用合成数据来评估和对间接方法进行基准测试,因为与真实世界的数据相比,合成数据具有一些优势,包括更大的可控性、可变性、可访问性和精确的真实RIs的可用性。合成数据还可以为开发数据驱动的间接RI估计解决方案提供途径。方法:在这项研究中,我们利用合成数据来训练两个卷积神经网络(cnn)来预测不同现实世界临床数据集中潜在参考分布(rd)的参数。当一个模型被训练为标准的单变量数据时,另一个模型被扩展到双变量数据,从而能够预测临床分析者之间的协方差。训练后的模型使用真实世界和合成测试数据集进行评估,并与四种替代算法进行比较。结果:模型预测与实际数据中直接估计的RIs和rd以及合成数据中的已知rd密切匹配,优于四种替代间接方法:GMM, refineR, reflimR和RINetv1。使用真实数据中标记的健康组和hcv阳性组,我们比较了已建立的单变量RIs与预测的多变量参考区域(MRRs)。平均而言,磁共振成像显示1)健康患者的覆盖率更高(接近预期的95%),2)区域更小,这降低了包括异常值的可能性。结论:综合数据训练是为单变量和双变量临床数据建立准确的间接RI估计模型的可行方法。这种策略可以帮助解决现实世界数据、直接分析和单变量RIs的一些限制。
{"title":"RINet: synthetic data training for indirect estimation of clinical reference distributions","authors":"Jack LeBien ,&nbsp;Julian Velev ,&nbsp;Abiel Roche-Lima","doi":"10.1016/j.jbi.2026.104980","DOIUrl":"10.1016/j.jbi.2026.104980","url":null,"abstract":"<div><h3>Background</h3><div>Indirect methods for estimating clinical reference intervals (RIs) use statistical analysis to identify non-pathological sub-distributions within large datasets acquired from routine clinical testing. This approach has the potential to accelerate the estimation of precise RIs, accounting for influential variables such as age, gender, and ethnicity. Most existing methods are based on traditional statistics and hand-crafted algorithms. The investigation of supervised learning, which often outperforms traditional approaches, has been impeded by the limitations of real-world data. However, previous studies have widely used synthetic data for evaluating and benchmarking indirect methods due several advantages over real-world data, including greater control, variability, accessibility, and the availability of exact ground-truth RIs. Synthetic data may also provide a pathway for developing data-driven solutions for indirect RI estimation.</div></div><div><h3>Methods</h3><div>In this study, we leveraged synthetic data to train two convolutional neural networks (CNNs) to predict the parameters of underlying reference distributions (RDs) in diverse real-world clinical datasets. While one model was trained for standard univariate data, the other was extended to bivariate data, enabling the prediction of covariance between clinical analytes. Trained models were evaluated using both real-world and synthetic test datasets and compared with four alternative algorithms.</div></div><div><h3>Results</h3><div>Model predictions closely matched directly estimated RIs and RDs in real-world data and known RDs in synthetic data, outperforming four alternative indirect methods: GMM, <em>refineR</em>, <em>reflimR</em>, and RINet<sub>v1</sub>. Using labeled healthy and HCV-positive groups in real data, we compared established univariate RIs with predicted multivariate reference regions (MRRs). On average, the MRRs showed 1) higher coverage of healthy patients (closer to the desired 95%) and 2) smaller regions, which reduce the likelihood of including abnormal values.</div></div><div><h3>Conclusions</h3><div>Synthetic data training is a viable approach for developing accurate indirect RI estimation models for both univariate and bivariate clinical data. This strategy could help address some limitations of real-world data, direct analyses, and univariate RIs.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"174 ","pages":"Article 104980"},"PeriodicalIF":4.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FG-DDI: Functional group-aware graph neural networks for drug–drug interaction prediction FG-DDI:用于药物-药物相互作用预测的功能群感知图神经网络
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-02-01 Epub Date: 2026-01-12 DOI: 10.1016/j.jbi.2026.104981
Fangyu Zhou, Shahadat Uddin

Objective:

We aim to improve Drug–Drug Interactions (DDIs) by explicitly injecting medicinal-chemistry knowledge of functional groups (FGs) into graph neural network (GNN) message passing, in both transductive and inductive settings. Our goal is to (i) encode FG priors in a trainable way that enhances representation quality without handcrafting features, and (ii) yield interpretable attributions that align learned weights with pharmacologically meaningful FG patterns.

Methods:

We introduce FG-DDI, a dual-view GNN that augments both intra- and inter-molecular reasoning. At the intra-molecular level, atom/bond messages are scaled by FG enrichment weights derived from detected FG motifs within each drug graph. At the inter-molecular level, a bipartite message-passing layer between a drug pair is modulated by FG–FG enrichment scores that reflect empirical co-occurrence in known DDIs. Enrichment is computed as odds ratios from corpus statistics and injected via learnable gates, ensuring differentiability and allowing data to override noisy priors. We couple this with standard supervision on interaction labels and report accuracy (ACC), AUROC, average precision (AP), and F1. Experiments use DrugBank (1706 drugs; 86 interaction types) and TwoSides (filtered triplets) under transductive and inductive splits (one unseen; both unseen). We perform ablations removing each FG term to isolate contributions and assess stability across splits.

Results:

Comprehensive experiments on DrugBank and TwoSides datasets demonstrate that FG-DDI achieves superior performance compared to state-of-the-art methods. For DrugBank, the accuracy improves by 0.36% in transductive settings and by 0.46% and 1.42% in inductive settings, respectively for S1 and S2 partitioning.

Conclusion:

By systematically integrating chemical domain knowledge into deep learning architectures, this approach enables better generalization to unseen drug combinations while maintaining computational efficiency, making it particularly valuable for real-world pharmaceutical applications where new drugs continuously enter the market.
目的:我们的目的是通过在转导和感应设置下将官能团(fg)的药物化学知识明确注入图神经网络(GNN)消息传递来改善药物-药物相互作用(ddi)。我们的目标是(i)以一种可训练的方式编码FG先验,在不手工制作特征的情况下提高表征质量,以及(ii)产生可解释的归因,使学习到的权重与药理学上有意义的FG模式保持一致。方法:我们引入FG-DDI,一种双重视图GNN,增强了分子内和分子间推理。在分子内水平,原子/键信息通过每个药物图中检测到的FG基序得出的FG富集权重进行缩放。在分子间水平上,药物对之间的双向信息传递层由FG-FG富集分数调节,该分数反映了已知ddi的经验共现性。从语料库统计数据中以比值比计算富集,并通过可学习门注入,确保可微分性并允许数据覆盖有噪声的先验。我们将其与对交互标签和报告准确性(ACC)、AUROC、平均精度(AP)和F1的标准监督相结合。实验使用DrugBank(1706种药物,86种相互作用类型)和TwoSides(过滤的三联体)在转导和诱导分离(一个看不见,两个看不见)下进行。我们执行消融去除每个FG项,以分离贡献并评估分裂的稳定性。结果:在DrugBank和TwoSides数据集上的综合实验表明,FG-DDI的性能优于目前最先进的方法。对于DrugBank,在转导设置下的准确率提高了0.36%,在感应设置下的准确率分别提高了0.46%和1.42%,分别用于S1和S2分区。结论:通过系统地将化学领域知识集成到深度学习架构中,该方法可以更好地泛化未见过的药物组合,同时保持计算效率,使其在新药不断进入市场的现实世界制药应用中特别有价值。
{"title":"FG-DDI: Functional group-aware graph neural networks for drug–drug interaction prediction","authors":"Fangyu Zhou,&nbsp;Shahadat Uddin","doi":"10.1016/j.jbi.2026.104981","DOIUrl":"10.1016/j.jbi.2026.104981","url":null,"abstract":"<div><h3>Objective:</h3><div>We aim to improve Drug–Drug Interactions (DDIs) by explicitly injecting medicinal-chemistry knowledge of functional groups (FGs) into graph neural network (GNN) message passing, in both transductive and inductive settings. Our goal is to (i) encode FG priors in a trainable way that enhances representation quality without handcrafting features, and (ii) yield interpretable attributions that align learned weights with pharmacologically meaningful FG patterns.</div></div><div><h3>Methods:</h3><div>We introduce <em>FG-DDI</em>, a dual-view GNN that augments both intra- and inter-molecular reasoning. At the <em>intra</em>-molecular level, atom/bond messages are scaled by FG enrichment weights derived from detected FG motifs within each drug graph. At the <em>inter</em>-molecular level, a bipartite message-passing layer between a drug pair is modulated by FG–FG enrichment scores that reflect empirical co-occurrence in known DDIs. Enrichment is computed as odds ratios from corpus statistics and injected via learnable gates, ensuring differentiability and allowing data to override noisy priors. We couple this with standard supervision on interaction labels and report accuracy (ACC), AUROC, average precision (AP), and F1. Experiments use DrugBank (1706 drugs; 86 interaction types) and TwoSides (filtered triplets) under transductive and inductive splits (one unseen; both unseen). We perform ablations removing each FG term to isolate contributions and assess stability across splits.</div></div><div><h3>Results:</h3><div>Comprehensive experiments on DrugBank and TwoSides datasets demonstrate that FG-DDI achieves superior performance compared to state-of-the-art methods. For DrugBank, the accuracy improves by 0.36% in transductive settings and by 0.46% and 1.42% in inductive settings, respectively for S1 and S2 partitioning.</div></div><div><h3>Conclusion:</h3><div>By systematically integrating chemical domain knowledge into deep learning architectures, this approach enables better generalization to unseen drug combinations while maintaining computational efficiency, making it particularly valuable for real-world pharmaceutical applications where new drugs continuously enter the market.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"174 ","pages":"Article 104981"},"PeriodicalIF":4.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145979340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1