首页 > 最新文献

Journal of Biomedical Informatics最新文献

英文 中文
Sleep apnea test prediction based on Electronic Health Records. 基于电子健康记录的睡眠呼吸暂停测试预测。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104737
Lama Abu Tahoun, Amit Shay Green, Tal Patalon, Yaron Dagan, Robert Moskovitch

The identification of Obstructive Sleep Apnea (OSA) is done by a Polysomnography test which is often done in later ages. Being able to notify potential insured members at earlier ages is desirable. For that, we develop predictive models that rely on Electronic Health Records (EHR) and predict whether a person will go through a sleep apnea test after the age of 50. A major challenge is the variability in EHR records in various insured members over the years, which this study investigates as well in the context of controls matching, and prediction. Since there are many temporal variables, the RankLi method was introduced for temporal variable selection. This approach employs the t-test to calculate a divergence score for each temporal variable between the target classes. We also investigate here the need to consider the number of EHR records, as part of control matching, and whether modeling separately for subgroups according to the number of EHR records is more effective. For each prediction task, we trained 4 different classifiers including 1-CNN, LSTM, Random Forest, and Logistic Regression, on data until the age of 40 or 50, and on several numbers of temporal variables. Using the number of EHR records for control matching was found crucial, and using learning models for subsets of the population according to the number of EHR records they have was found more effective. The deep learning models, particularly the 1-CNN, achieved the highest balanced accuracy and AUC scores in both male and female groups. In the male group, the highest results were also observed at age 50 with 100 temporal variables, resulting in a balanced accuracy of 90% and an AUC of 93%.

阻塞性睡眠呼吸暂停(OSA)是通过多导睡眠图检查来确定的,通常在晚年进行。我们希望能够在潜在投保人较早的年龄就通知他们。为此,我们开发了依赖电子健康记录(EHR)的预测模型,预测一个人是否会在 50 岁以后接受睡眠呼吸暂停测试。一个主要的挑战是不同参保人员多年来的电子健康记录存在差异,本研究在对照匹配和预测方面也对此进行了调查。由于存在许多时间变量,因此引入了 RankLi 方法来选择时间变量。这种方法采用 t 检验来计算目标类别之间每个时间变量的分歧分值。在此,我们还研究了作为控制匹配的一部分,是否需要考虑电子病历记录的数量,以及根据电子病历记录的数量为亚组单独建模是否更有效。针对每项预测任务,我们在 40 岁或 50 岁之前的数据和多个时间变量上训练了 4 种不同的分类器,包括 1-CNN、LSTM、随机森林和逻辑回归。我们发现,使用电子病历记录数量进行对照匹配至关重要,而根据电子病历记录数量对人群子集使用学习模型则更为有效。在男性组和女性组中,深度学习模型,尤其是 1-CNN 获得了最高的平衡准确率和 AUC 分数。在男性组中,50 岁时的结果也是最高的,有 100 个时间变量,平衡准确率为 90%,AUC 为 93%。
{"title":"Sleep apnea test prediction based on Electronic Health Records.","authors":"Lama Abu Tahoun, Amit Shay Green, Tal Patalon, Yaron Dagan, Robert Moskovitch","doi":"10.1016/j.jbi.2024.104737","DOIUrl":"https://doi.org/10.1016/j.jbi.2024.104737","url":null,"abstract":"<p><p>The identification of Obstructive Sleep Apnea (OSA) is done by a Polysomnography test which is often done in later ages. Being able to notify potential insured members at earlier ages is desirable. For that, we develop predictive models that rely on Electronic Health Records (EHR) and predict whether a person will go through a sleep apnea test after the age of 50. A major challenge is the variability in EHR records in various insured members over the years, which this study investigates as well in the context of controls matching, and prediction. Since there are many temporal variables, the RankLi method was introduced for temporal variable selection. This approach employs the t-test to calculate a divergence score for each temporal variable between the target classes. We also investigate here the need to consider the number of EHR records, as part of control matching, and whether modeling separately for subgroups according to the number of EHR records is more effective. For each prediction task, we trained 4 different classifiers including 1-CNN, LSTM, Random Forest, and Logistic Regression, on data until the age of 40 or 50, and on several numbers of temporal variables. Using the number of EHR records for control matching was found crucial, and using learning models for subsets of the population according to the number of EHR records they have was found more effective. The deep learning models, particularly the 1-CNN, achieved the highest balanced accuracy and AUC scores in both male and female groups. In the male group, the highest results were also observed at age 50 with 100 temporal variables, resulting in a balanced accuracy of 90% and an AUC of 93%.</p>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":" ","pages":"104737"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142568735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PLRTE: Progressive learning for biomedical relation triplet extraction using large language models PLRTE:使用大型语言模型进行生物医学关系三元组提取的渐进式学习。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104738
Yi-Kai Zheng , Bi Zeng , Yi-Chun Feng , Lu Zhou , Yi-Xue Li
Document-level relation triplet extraction is crucial in biomedical text mining, aiding in drug discovery and the construction of biomedical knowledge graphs. Current language models face challenges in generalizing to unseen datasets and relation types in biomedical relation triplet extraction, which limits their effectiveness in these crucial tasks. To address this challenge, our study optimizes models from two critical dimensions: data-task relevance and granularity of relations, aiming to enhance their generalization capabilities significantly. We introduce a novel progressive learning strategy to obtain the PLRTE model. This strategy not only enhances the model’s capability to comprehend diverse relation types in the biomedical domain but also implements a structured four-level progressive learning process through semantic relation augmentation, compositional instruction, and dual-axis level learning. Our experiments on the DDI and BC5CDR document-level biomedical relation triplet datasets demonstrate a significant performance improvement of 5% to 20% over the current state-of-the-art baselines. Furthermore, our model exhibits exceptional generalization capabilities on the unseen Chemprot and GDA datasets, further validating the effectiveness of optimizing data-task association and relation granularity for enhancing model generalizability.
文档级关系三元组提取在生物医学文本挖掘中至关重要,它有助于药物发现和生物医学知识图谱的构建。在生物医学关系三元组提取中,当前的语言模型在泛化到未见过的数据集和关系类型方面面临挑战,这限制了它们在这些关键任务中的有效性。为了应对这一挑战,我们的研究从两个关键维度对模型进行了优化:数据与任务的相关性和关系的粒度,旨在显著增强模型的泛化能力。我们引入了一种新颖的渐进式学习策略来获得 PLRTE 模型。该策略不仅增强了模型理解生物医学领域各种关系类型的能力,还通过语义关系增强、组合指令和双轴水平学习实现了结构化的四级渐进学习过程。我们在 DDI 和 BC5CDR 文档级生物医学关系三元组数据集上进行的实验表明,与目前最先进的基线相比,我们的性能提高了 5% 到 20%。此外,我们的模型在未见过的 Chemprot 和 GDA 数据集上表现出了卓越的泛化能力,进一步验证了优化数据-任务关联和关系粒度以增强模型泛化能力的有效性。
{"title":"PLRTE: Progressive learning for biomedical relation triplet extraction using large language models","authors":"Yi-Kai Zheng ,&nbsp;Bi Zeng ,&nbsp;Yi-Chun Feng ,&nbsp;Lu Zhou ,&nbsp;Yi-Xue Li","doi":"10.1016/j.jbi.2024.104738","DOIUrl":"10.1016/j.jbi.2024.104738","url":null,"abstract":"<div><div>Document-level relation triplet extraction is crucial in biomedical text mining, aiding in drug discovery and the construction of biomedical knowledge graphs. Current language models face challenges in generalizing to unseen datasets and relation types in biomedical relation triplet extraction, which limits their effectiveness in these crucial tasks. To address this challenge, our study optimizes models from two critical dimensions: data-task relevance and granularity of relations, aiming to enhance their generalization capabilities significantly. We introduce a novel progressive learning strategy to obtain the PLRTE model. This strategy not only enhances the model’s capability to comprehend diverse relation types in the biomedical domain but also implements a structured four-level progressive learning process through semantic relation augmentation, compositional instruction, and dual-axis level learning. Our experiments on the DDI and BC5CDR document-level biomedical relation triplet datasets demonstrate a significant performance improvement of 5% to 20% over the current state-of-the-art baselines. Furthermore, our model exhibits exceptional generalization capabilities on the unseen Chemprot and GDA datasets, further validating the effectiveness of optimizing data-task association and relation granularity for enhancing model generalizability.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104738"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adapting the open-source Gen3 platform and kubernetes for the NIH HEAL IMPOWR and MIRHIQL clinical trial data commons: Customization, cloud transition, and optimization 将开源 Gen3 平台和 kubernetes 用于 NIH HEAL IMPOWR 和 MIRHIQL 临床试验数据中心:定制、云过渡和优化。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104749
Meredith C.B. Adams , Colin Griffin , Hunter Adams , Stephen Bryant , Robert W. Hurley , Umit Topaloglu

Objective

This study aims to provide the decision-making framework, strategies, and software used to successfully deploy the first combined chronic pain and opioid use data clinical trial data commons using the Gen3 platform.

Materials and Methods

The approach involved adapting the open-source Gen3 platform and Kubernetes for the needs of the NIH HEAL IMPOWR and MIRHIQL networks. Key steps included customizing the Gen3 architecture, transitioning from Amazon to Google Cloud, adapting data ingestion and harmonization processes, ensuring security and compliance for the Kubernetes environment, and optimizing performance and user experience.

Results

The primary result was a fully operational IMPOWR data commons built on Gen3. Key features include a modular architecture supporting diverse clinical trial data types, automated processes for data management, fine-grained access control and auditing, and researcher-friendly interfaces for data exploration and analysis.

Discussion

The successful development of the Wake Forest IDEA-CC data commons represents a significant milestone for chronic pain and addiction research. Harmonized, FAIR data from diverse studies can be discovered in a secure, scalable repository. Challenges remain in long-term maintenance and governance, but the commons provides a foundation for accelerating scientific progress. Key lessons learned include the importance of engaging both technical and domain experts, the need for flexible yet robust infrastructure, and the value of building on established open-source platforms.

Conclusion

The WF IDEA-CC Gen3 data commons demonstrates the feasibility and value of developing a shared data infrastructure for chronic pain and opioid use research. The lessons learned can inform similar efforts in other clinical domains.
目的:本研究旨在提供决策框架、策略和软件,用于利用 Gen3 平台成功部署首个慢性疼痛和阿片类药物使用数据的临床试验数据公共平台:本研究旨在提供决策框架、策略和软件,以便利用 Gen3 平台成功部署首个慢性疼痛和阿片类药物使用数据合并临床试验数据公共中心:该方法包括根据 NIH HEAL IMPOWR 和 MIRHIQL 网络的需求调整开源 Gen3 平台和 Kubernetes。关键步骤包括定制 Gen3 架构、从亚马逊云过渡到谷歌云、调整数据摄取和统一流程、确保 Kubernetes 环境的安全性和合规性,以及优化性能和用户体验:主要成果是在 Gen3 基础上建立了一个全面运行的 IMPOWR 数据中心。主要特点包括:支持多种临床试验数据类型的模块化架构、数据管理自动化流程、细粒度访问控制和审计,以及用于数据探索和分析的研究人员友好界面:维克森林 IDEA-CC 数据集的成功开发是慢性疼痛和成瘾研究的一个重要里程碑。来自不同研究的统一的 FAIR 数据可以在一个安全、可扩展的资源库中被发现。虽然在长期维护和管理方面仍存在挑战,但共享库为加速科学进步奠定了基础。获得的主要经验包括:技术专家和领域专家参与的重要性、对灵活而强大的基础设施的需求,以及在已有开源平台基础上进行构建的价值:WF IDEA-CC Gen3 数据共用区证明了为慢性疼痛和阿片类药物使用研究开发共享数据基础设施的可行性和价值。这些经验教训可为其他临床领域的类似工作提供借鉴。
{"title":"Adapting the open-source Gen3 platform and kubernetes for the NIH HEAL IMPOWR and MIRHIQL clinical trial data commons: Customization, cloud transition, and optimization","authors":"Meredith C.B. Adams ,&nbsp;Colin Griffin ,&nbsp;Hunter Adams ,&nbsp;Stephen Bryant ,&nbsp;Robert W. Hurley ,&nbsp;Umit Topaloglu","doi":"10.1016/j.jbi.2024.104749","DOIUrl":"10.1016/j.jbi.2024.104749","url":null,"abstract":"<div><h3>Objective</h3><div>This study aims to provide the decision-making framework, strategies, and software used to successfully deploy the first combined chronic pain and opioid use data clinical trial data commons using the Gen3 platform.</div></div><div><h3>Materials and Methods</h3><div>The approach involved adapting the open-source Gen3 platform and Kubernetes for the needs of the NIH HEAL IMPOWR and MIRHIQL networks. Key steps included customizing the Gen3 architecture, transitioning from Amazon to Google Cloud, adapting data ingestion and harmonization processes, ensuring security and compliance for the Kubernetes environment, and optimizing performance and user experience.</div></div><div><h3>Results</h3><div>The primary result was a fully operational IMPOWR data commons built on Gen3. Key features include a modular architecture supporting diverse clinical trial data types, automated processes for data management, fine-grained access control and auditing, and researcher-friendly interfaces for data exploration and analysis.</div></div><div><h3>Discussion</h3><div>The successful development of the Wake Forest IDEA-CC data commons represents a significant milestone for chronic pain and addiction research. Harmonized, FAIR data from diverse studies can be discovered in a secure, scalable repository. Challenges remain in long-term maintenance and governance, but the commons provides a foundation for accelerating scientific progress. Key lessons learned include the importance of engaging both technical and domain experts, the need for flexible yet robust infrastructure, and the value of building on established open-source platforms.</div></div><div><h3>Conclusion</h3><div>The WF IDEA-CC Gen3 data commons demonstrates the feasibility and value of developing a shared data infrastructure for chronic pain and opioid use research. The lessons learned can inform similar efforts in other clinical domains.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104749"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142604422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A mother-child data linkage approach using data from the information system for the development of research in primary care (SIDIAP) in Catalonia 利用加泰罗尼亚初级保健研究发展信息系统(SIDIAP)的数据进行母婴数据链接的方法。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104747
E. Segundo, M. Far, C.I. Rodríguez-Casado, J.M. Elorza, J. Carrere-Molina, R. Mallol-Parera, M. Aragón

Background

Large-scale clinical databases containing routinely collected electronic health records (EHRs) data are a valuable source of information for research studies. For example, they can be used in pharmacoepidemiology studies to evaluate the effects of maternal medication exposure on neonatal and pediatric outcomes. Yet, this type of studies is infeasible without proper mother–child linkage.

Methods

We leveraged all eligible active records (N = 8,553,321) of the Information System for Research in Primary Care (SIDIAP) database. Mothers and infants were linked using a deterministic approach and linkage accuracy was evaluated in terms of the number of records from candidate mothers that failed to link. We validated the mother–child links identified by comparison of linked and unlinked records for both candidate mothers and descendants. Differences across these two groups were evaluated by means of effect size calculations instead of p-values. Overall, we described our data linkage process following the GUidance for Information about Linking Data sets (GUILD) principles.

Results

We were able to identify 744,763 unique mother–child relationships, linking 83.8 % candidate mothers with delivery dates within a period of 15 years. Of note, we provide a record-level category label used to derive a global confidence metric for the presented linkage process. Our validation analysis showed that the two groups were similar in terms of a number of aggregated attributes.

Conclusions

Complementing the SIDIAP database with mother–child links will allow clinical researchers to expand their epidemiologic studies with the ultimate goal of improving outcomes for pregnant women and their children. Importantly, the reported information at each step of the data linkage process will contribute to the validity of analyses and interpretation of results in future studies using this resource.
背景:包含常规收集的电子健康记录(EHR)数据的大型临床数据库是研究的宝贵信息来源。例如,它们可用于药物流行病学研究,以评估母体药物暴露对新生儿和儿科预后的影响。然而,如果没有适当的母婴联系,这类研究是不可行的:我们利用了初级医疗研究信息系统(SIDIAP)数据库中所有符合条件的有效记录(N = 8,553,321)。采用确定性方法连接母亲和婴儿,并根据未能连接的候选母亲记录数量评估连接的准确性。我们通过比较候选母亲和后代的链接记录和未链接记录,验证了所识别的母婴链接。我们通过计算效应大小而不是 p 值来评估这两组之间的差异。总之,我们按照《数据集链接信息指南》(GUILD)的原则描述了我们的数据链接过程:我们能够识别 744 763 个独特的母子关系,将 83.8% 的候选母亲与 15 年内的分娩日期联系起来。值得注意的是,我们提供了一个记录级别的类别标签,用于为提出的链接过程推导出一个全局置信度指标。我们的验证分析表明,两组数据在一些综合属性方面具有相似性:通过母婴链接对 SIDIAP 数据库进行补充,将使临床研究人员能够扩大流行病学研究,最终改善孕妇及其子女的预后。重要的是,在数据链接过程的每一步所报告的信息都将有助于提高分析的有效性,并有助于今后使用这一资源进行研究时对结果的解释。
{"title":"A mother-child data linkage approach using data from the information system for the development of research in primary care (SIDIAP) in Catalonia","authors":"E. Segundo,&nbsp;M. Far,&nbsp;C.I. Rodríguez-Casado,&nbsp;J.M. Elorza,&nbsp;J. Carrere-Molina,&nbsp;R. Mallol-Parera,&nbsp;M. Aragón","doi":"10.1016/j.jbi.2024.104747","DOIUrl":"10.1016/j.jbi.2024.104747","url":null,"abstract":"<div><h3>Background</h3><div>Large-scale clinical databases containing routinely collected electronic health records (EHRs) data are a valuable source of information for research studies. For example, they can be used in pharmacoepidemiology studies to evaluate the effects of maternal medication exposure on neonatal and pediatric outcomes. Yet, this type of studies is infeasible without proper mother–child linkage.</div></div><div><h3>Methods</h3><div>We leveraged all eligible active records (N = 8,553,321) of the Information System for Research in Primary Care (SIDIAP) database. Mothers and infants were linked using a deterministic approach and linkage accuracy was evaluated in terms of the number of records from candidate mothers that failed to link. We validated the mother–child links identified by comparison of linked and unlinked records for both candidate mothers and descendants. Differences across these two groups were evaluated by means of effect size calculations instead of <em>p</em>-values. Overall, we described our data linkage process following the GUidance for Information about Linking Data sets (GUILD) principles.</div></div><div><h3>Results</h3><div>We were able to identify 744,763 unique mother–child relationships, linking 83.8 % candidate mothers with delivery dates within a period of 15 years. Of note, we provide a record-level category label used to derive a global confidence metric for the presented linkage process. Our validation analysis showed that the two groups were similar in terms of a number of aggregated attributes.</div></div><div><h3>Conclusions</h3><div>Complementing the SIDIAP database with mother–child links will allow clinical researchers to expand their epidemiologic studies with the ultimate goal of improving outcomes for pregnant women and their children. Importantly, the reported information at each step of the data linkage process will contribute to the validity of analyses and interpretation of results in future studies using this resource.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104747"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142604420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Triple and quadruple optimization for feature selection in cancer biomarker discovery 癌症生物标记物发现中特征选择的三重和四重优化。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-11 DOI: 10.1016/j.jbi.2024.104736
L. Cattelani, V. Fortino
The proliferation of omics data has advanced cancer biomarker discovery but often falls short in external validation, mainly due to a narrow focus on prediction accuracy that neglects clinical utility and validation feasibility. We introduce three- and four-objective optimization strategies based on genetic algorithms to identify clinically actionable biomarkers in omics studies, addressing classification tasks aimed at distinguishing hard-to-differentiate cancer subtypes beyond histological analysis alone. Our hypothesis is that by optimizing more than one characteristic of cancer biomarkers, we may identify biomarkers that will enhance their success in external validation. Our objectives are to: (i) assess the biomarker panel’s accuracy using a machine learning (ML) framework; (ii) ensure the biomarkers exhibit significant fold-changes across subtypes, thereby boosting the success rate of PCR or immunohistochemistry validations; (iii) select a concise set of biomarkers to simplify the validation process and reduce clinical costs; and (iv) identify biomarkers crucial for predicting overall survival, which plays a significant role in determining the prognostic value of cancer subtypes. We implemented and applied triple and quadruple optimization algorithms to renal carcinoma gene expression data from TCGA. The study targets kidney cancer subtypes that are difficult to distinguish through histopathology methods. Selected RNA-seq biomarkers were assessed against the gold standard method, which relies solely on clinical information, and in external microarray-based validation datasets. Notably, these biomarkers achieved over 0.8 of accuracy in external validations and added significant value to survival predictions, outperforming the use of clinical data alone with a superior c-index. The provided tool also helps explore the trade-off between objectives, offering multiple solutions for clinical evaluation before proceeding to costly validation or clinical trials.
全局组学数据的激增推动了癌症生物标记物的发现,但在外部验证方面往往存在不足,这主要是由于狭隘地关注预测准确性而忽视了临床实用性和验证可行性。我们介绍了基于遗传算法的三目标和四目标优化策略,以便在全局组学研究中发现可用于临床的生物标记物,解决旨在区分难以区分的癌症亚型的分类任务,而不仅仅是组织学分析。我们的假设是,通过优化癌症生物标志物的一个以上特征,我们可以确定生物标志物,从而提高它们在外部验证中的成功率。我们的目标是(i)使用机器学习(ML)框架评估生物标记物面板的准确性;(ii)确保生物标记物在不同亚型中表现出显著的折叠变化,从而提高 PCR 或免疫组化验证的成功率;(iii)选择一组简明的生物标记物以简化验证过程并降低临床成本;(iv)确定对预测总生存期至关重要的生物标记物,总生存期在确定癌症亚型的预后价值方面发挥着重要作用。我们对来自 TCGA 的肾癌基因表达数据实施并应用了三重和四重优化算法。这项研究的目标是组织病理学方法难以区分的肾癌亚型。对照完全依赖临床信息的金标准方法以及基于微阵列的外部验证数据集,对选定的 RNA-seq 生物标志物进行了评估。值得注意的是,这些生物标记物在外部验证中的准确率超过了 0.8,为生存预测带来了显著的价值,其 c 指数优于仅使用临床数据的方法。所提供的工具还有助于探索目标之间的权衡,在进行昂贵的验证或临床试验之前为临床评估提供多种解决方案。
{"title":"Triple and quadruple optimization for feature selection in cancer biomarker discovery","authors":"L. Cattelani,&nbsp;V. Fortino","doi":"10.1016/j.jbi.2024.104736","DOIUrl":"10.1016/j.jbi.2024.104736","url":null,"abstract":"<div><div>The proliferation of omics data has advanced cancer biomarker discovery but often falls short in external validation, mainly due to a narrow focus on prediction accuracy that neglects clinical utility and validation feasibility. We introduce three- and four-objective optimization strategies based on genetic algorithms to identify clinically actionable biomarkers in omics studies, addressing classification tasks aimed at distinguishing hard-to-differentiate cancer subtypes beyond histological analysis alone. Our hypothesis is that by optimizing more than one characteristic of cancer biomarkers, we may identify biomarkers that will enhance their success in external validation. Our objectives are to: (i) assess the biomarker panel’s accuracy using a machine learning (ML) framework; (ii) ensure the biomarkers exhibit significant fold-changes across subtypes, thereby boosting the success rate of PCR or immunohistochemistry validations; (iii) select a concise set of biomarkers to simplify the validation process and reduce clinical costs; and (iv) identify biomarkers crucial for predicting overall survival, which plays a significant role in determining the prognostic value of cancer subtypes. We implemented and applied triple and quadruple optimization algorithms to renal carcinoma gene expression data from TCGA. The study targets kidney cancer subtypes that are difficult to distinguish through histopathology methods. Selected RNA-seq biomarkers were assessed against the gold standard method, which relies solely on clinical information, and in external microarray-based validation datasets. Notably, these biomarkers achieved over 0.8 of accuracy in external validations and added significant value to survival predictions, outperforming the use of clinical data alone with a superior c-index. The provided tool also helps explore the trade-off between objectives, offering multiple solutions for clinical evaluation before proceeding to costly validation or clinical trials.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104736"},"PeriodicalIF":4.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving tabular data extraction in scanned laboratory reports using deep learning models 利用深度学习模型改进扫描实验室报告中的表格数据提取。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-10 DOI: 10.1016/j.jbi.2024.104735
Yiming Li , Qiang Wei , Xinghan Chen , Jianfu Li , Cui Tao , Hua Xu

Objective

Medical laboratory testing is essential in healthcare, providing crucial data for diagnosis and treatment. Nevertheless, patients’ lab testing results are often transferred via fax across healthcare organizations and are not immediately available for timely clinical decision making. Thus, it is important to develop new technologies to accurately extract lab testing information from scanned laboratory reports. This study aims to develop an advanced deep learning-based Optical Character Recognition (OCR) method to identify tables containing lab testing results in scanned laboratory reports.

Methods

Extracting tabular data from scanned lab reports involves two stages: table detection (i.e., identifying the area of a table object) and table recognition (i.e., identifying and extracting tabular structures and contents). DETR R18 algorithm as well as YOLOv8s were involved for table detection, and we compared the performance of PaddleOCR and the encoder-dual-decoder (EDD) model for table recognition. 650 tables from 632 randomly selected laboratory test reports were annotated and used to train and evaluate those models. For table detection evaluation, we used metrics such as Average Precision (AP), Average Recall (AR), AP50, and AP75. For table recognition evaluation, we employed Tree-Edit Distance (TEDS).

Results

For table detection, fine-tuned DETR R18 demonstrated superior performance (AP50: 0.774; AP75: 0.644; AP: 0.601; AR: 0.766). In terms of table recognition, fine-tuned EDD outperformed other models with a TEDS score of 0.815. The proposed OCR pipeline (fine-tuned DETR R18 and fine-tuned EDD), demonstrated impressive results, achieving a TEDS score of 0.699 and a TEDS structure score of 0.764.

Conclusions

Our study presents a dedicated OCR pipeline for scanned clinical documents, utilizing state-of-the-art deep learning models for region-of-interest detection and table recognition. The high TEDS scores demonstrate the effectiveness of our approach, which has significant implications for clinical data analysis and decision-making.
目的:医学实验室检测在医疗保健中至关重要,可为诊断和治疗提供关键数据。然而,病人的化验结果通常是通过传真在各医疗机构之间传递的,无法立即用于及时的临床决策。因此,开发新技术从扫描的化验报告中准确提取化验信息非常重要。本研究旨在开发一种基于深度学习的先进光学字符识别(OCR)方法,以识别扫描化验报告中包含化验结果的表格:从扫描的实验报告中提取表格数据涉及两个阶段:表格检测(即识别表格对象的区域)和表格识别(即识别和提取表格结构和内容)。DETR R18 算法和 YOLOv8s 参与了表格检测,我们比较了 PaddleOCR 和编码器-双解码器(EDD)模型在表格识别方面的性能。我们对随机抽取的 632 份实验室测试报告中的 650 张桌子进行了标注,并用于训练和评估这些模型。在表格检测评估中,我们使用了平均精度 (AP)、平均召回率 (AR)、AP50 和 AP75 等指标。在表格识别评估中,我们使用了树-编辑距离(Tree-Edit Distance,TEDS):结果:在表格检测方面,微调后的 DETR R18 表现优异(AP50:0.774;AP75:0.644;AP:0.601;AR:0.766)。在表格识别方面,微调 EDD 的 TEDS 得分为 0.815,优于其他模型。拟议的 OCR 管道(微调 DETR R18 和微调 EDD)取得了令人印象深刻的结果,TEDS 得分为 0.699,TEDS 结构得分为 0.764:我们的研究为扫描的临床文件提供了一个专用的 OCR 管道,利用最先进的深度学习模型进行兴趣区域检测和表格识别。高 TEDS 分数证明了我们方法的有效性,这对临床数据分析和决策具有重要意义。
{"title":"Improving tabular data extraction in scanned laboratory reports using deep learning models","authors":"Yiming Li ,&nbsp;Qiang Wei ,&nbsp;Xinghan Chen ,&nbsp;Jianfu Li ,&nbsp;Cui Tao ,&nbsp;Hua Xu","doi":"10.1016/j.jbi.2024.104735","DOIUrl":"10.1016/j.jbi.2024.104735","url":null,"abstract":"<div><h3>Objective</h3><div>Medical laboratory testing is essential in healthcare, providing crucial data for diagnosis and treatment. Nevertheless, patients’ lab testing results are often transferred via fax across healthcare organizations and are not immediately available for timely clinical decision making. Thus, it is important to develop new technologies to accurately extract lab testing information from scanned laboratory reports. This study aims to develop an advanced deep learning-based Optical Character Recognition (OCR) method to identify tables containing lab testing results in scanned laboratory reports.</div></div><div><h3>Methods</h3><div>Extracting tabular data from scanned lab reports involves two stages: table detection (i.e., identifying the area of a table object) and table recognition (i.e., identifying and extracting tabular structures and contents). DETR R18 algorithm as well as YOLOv8s were involved for table detection, and we compared the performance of PaddleOCR and the encoder-dual-decoder (EDD) model for table recognition. 650 tables from 632 randomly selected laboratory test reports were annotated and used to train and evaluate those models. For table detection evaluation, we used metrics such as Average Precision (AP), Average Recall (AR), AP50, and AP75. For table recognition evaluation, we employed Tree-Edit Distance (TEDS).</div></div><div><h3>Results</h3><div>For table detection, fine-tuned DETR R18 demonstrated superior performance (AP50: 0.774; AP75: 0.644; AP: 0.601; AR: 0.766). In terms of table recognition, fine-tuned EDD outperformed other models with a TEDS score of 0.815. The proposed OCR pipeline (fine-tuned DETR R18 and fine-tuned EDD), demonstrated impressive results, achieving a TEDS score of 0.699 and a TEDS structure score of 0.764.</div></div><div><h3>Conclusions</h3><div>Our study presents a dedicated OCR pipeline for scanned clinical documents, utilizing state-of-the-art deep learning models for region-of-interest detection and table recognition. The high TEDS scores demonstrate the effectiveness of our approach, which has significant implications for clinical data analysis and decision-making.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104735"},"PeriodicalIF":4.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142406431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning to match patients to clinical trials using large language models 利用大型语言模型学习将患者与临床试验相匹配。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-09 DOI: 10.1016/j.jbi.2024.104734
Maciej Rybinski , Wojciech Kusa , Sarvnaz Karimi , Allan Hanbury

Objective:

This study investigates the use of Large Language Models (LLMs) for matching patients to clinical trials (CTs) within an information retrieval pipeline. Our objective is to enhance the process of patient-trial matching by leveraging the semantic processing capabilities of LLMs, thereby improving the effectiveness of patient recruitment for clinical trials.

Methods:

We employed a multi-stage retrieval pipeline integrating various methodologies, including BM25 and Transformer-based rankers, along with LLM-based methods. Our primary datasets were the TREC Clinical Trials 2021–23 track collections. We compared LLM-based approaches, focusing on methods that leverage LLMs in query formulation, filtering, relevance ranking, and re-ranking of CTs.

Results:

Our results indicate that LLM-based systems, particularly those involving re-ranking with a fine-tuned LLM, outperform traditional methods in terms of nDCG and Precision measures. The study demonstrates that fine-tuning LLMs enhances their ability to find eligible trials. Moreover, our LLM-based approach is competitive with state-of-the-art systems in the TREC challenges.
The study shows the effectiveness of LLMs in CT matching, highlighting their potential in handling complex semantic analysis and improving patient-trial matching. However, the use of LLMs increases the computational cost and reduces efficiency. We provide a detailed analysis of effectiveness-efficiency trade-offs.

Conclusion:

This research demonstrates the promising role of LLMs in enhancing the patient-to-clinical trial matching process, offering a significant advancement in the automation of patient recruitment. Future work should explore optimising the balance between computational cost and retrieval effectiveness in practical applications.
研究目的本研究探讨了在信息检索管道中使用大语言模型(LLM)将患者与临床试验(CT)进行匹配的问题。我们的目标是利用大型语言模型的语义处理能力,加强患者与试验的匹配过程,从而提高临床试验患者招募的效率:我们采用了一个多阶段检索管道,其中整合了各种方法,包括基于 BM25 和 Transformer 的排序器以及基于 LLM 的方法。我们的主要数据集是TREC临床试验2021-23轨迹集。我们对基于 LLM 的方法进行了比较,重点关注在查询制定、过滤、相关性排序和 CT 重新排序中利用 LLM 的方法:我们的结果表明,基于 LLM 的系统,特别是那些涉及使用微调 LLM 重新排序的系统,在 nDCG 和精度测量方面优于传统方法。研究表明,对 LLM 进行微调可增强其发现合格试验的能力。此外,在 TREC 挑战赛中,我们基于 LLM 的方法与最先进的系统相比具有竞争力。研究显示了 LLM 在 CT 匹配中的有效性,突出了其在处理复杂语义分析和改善患者-试验匹配方面的潜力。但是,LLM 的使用增加了计算成本,降低了效率。我们详细分析了有效性和效率之间的权衡:这项研究证明了 LLMs 在加强患者与临床试验匹配过程中的重要作用,为患者招募自动化提供了重大进展。未来的工作应探索如何在实际应用中优化计算成本与检索效率之间的平衡。
{"title":"Learning to match patients to clinical trials using large language models","authors":"Maciej Rybinski ,&nbsp;Wojciech Kusa ,&nbsp;Sarvnaz Karimi ,&nbsp;Allan Hanbury","doi":"10.1016/j.jbi.2024.104734","DOIUrl":"10.1016/j.jbi.2024.104734","url":null,"abstract":"<div><h3>Objective:</h3><div>This study investigates the use of Large Language Models (LLMs) for matching patients to clinical trials (CTs) within an information retrieval pipeline. Our objective is to enhance the process of patient-trial matching by leveraging the semantic processing capabilities of LLMs, thereby improving the effectiveness of patient recruitment for clinical trials.</div></div><div><h3>Methods:</h3><div>We employed a multi-stage retrieval pipeline integrating various methodologies, including BM25 and Transformer-based rankers, along with LLM-based methods. Our primary datasets were the TREC Clinical Trials 2021–23 track collections. We compared LLM-based approaches, focusing on methods that leverage LLMs in query formulation, filtering, relevance ranking, and re-ranking of CTs.</div></div><div><h3>Results:</h3><div>Our results indicate that LLM-based systems, particularly those involving re-ranking with a fine-tuned LLM, outperform traditional methods in terms of nDCG and Precision measures. The study demonstrates that fine-tuning LLMs enhances their ability to find eligible trials. Moreover, our LLM-based approach is competitive with state-of-the-art systems in the TREC challenges.</div><div>The study shows the effectiveness of LLMs in CT matching, highlighting their potential in handling complex semantic analysis and improving patient-trial matching. However, the use of LLMs increases the computational cost and reduces efficiency. We provide a detailed analysis of effectiveness-efficiency trade-offs.</div></div><div><h3>Conclusion:</h3><div>This research demonstrates the promising role of LLMs in enhancing the patient-to-clinical trial matching process, offering a significant advancement in the automation of patient recruitment. Future work should explore optimising the balance between computational cost and retrieval effectiveness in practical applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104734"},"PeriodicalIF":4.0,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142400333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Augmenting biomedical named entity recognition with general-domain resources 利用通用领域资源增强生物医学命名实体识别。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-04 DOI: 10.1016/j.jbi.2024.104731
Yu Yin , Hyunjae Kim , Xiao Xiao , Chih Hsuan Wei , Jaewoo Kang , Zhiyong Lu , Hua Xu , Meng Fang , Qingyu Chen

Objective

Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets.

Methods

We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset.

Results

We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.

Conclusion

This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
目的:训练基于神经网络的生物医学命名实体识别2(BioNER)模型通常需要大量昂贵的人工标注。虽然有几项研究利用多个 BioNER 数据集进行多任务学习以减少人力,但这种方法并不能持续提高性能,而且可能会在不同的生物医学语料库中引入标签模糊性。我们的目标是通过从与生物医学数据集概念重叠较少且易于获取的资源中进行迁移学习来应对这些挑战:我们提出了 GERBERA,一种利用通用领域 NER 数据集进行训练的简单而有效的方法。我们采用多任务学习法,利用目标 BioNER 数据集和通用域数据集训练一个预训练生物医学语言模型。随后,我们专门针对 BioNER 数据集对模型进行了微调:我们在八个实体类型的五个数据集上对 GERBERA 进行了系统评估,这些数据集共包含 81,410 个实例。尽管使用的生物医学资源较少,但与使用其他 BioNER 数据集训练的基线模型相比,我们的模型表现出了卓越的性能。具体来说,在八种实体类型中,我们的模型在六种类型中的表现始终优于基线模型,在八个实体中比最佳基线表现平均提高了 0.9%。我们的方法在数据有限的 BioNER 数据集上尤其有效,在 JNLPBA-RNA 数据集上的 F1 分数提高了 4.7%:本研究介绍了一种新的训练方法,该方法利用具有成本效益的通用领域 NER 数据集来增强 BioNER 模型。这种方法大大提高了 BioNER 模型的性能,使其成为生物医学数据集稀缺或昂贵情况下的宝贵资产。我们通过 https://github.com/qingyu-qc/bioner_gerbera 公开数据、代码和模型。
{"title":"Augmenting biomedical named entity recognition with general-domain resources","authors":"Yu Yin ,&nbsp;Hyunjae Kim ,&nbsp;Xiao Xiao ,&nbsp;Chih Hsuan Wei ,&nbsp;Jaewoo Kang ,&nbsp;Zhiyong Lu ,&nbsp;Hua Xu ,&nbsp;Meng Fang ,&nbsp;Qingyu Chen","doi":"10.1016/j.jbi.2024.104731","DOIUrl":"10.1016/j.jbi.2024.104731","url":null,"abstract":"<div><h3>Objective</h3><div>Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets.</div></div><div><h3>Methods</h3><div>We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset.</div></div><div><h3>Results</h3><div>We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.</div></div><div><h3>Conclusion</h3><div>This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via <span><span>https://github.com/qingyu-qc/bioner_gerbera</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104731"},"PeriodicalIF":4.0,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142377852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clinical outcome-guided deep temporal clustering for disease progression subtyping 临床结果指导下的疾病进展亚型深度时间聚类。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-01 DOI: 10.1016/j.jbi.2024.104732
Dulin Wang , Xiaotian Ma , Paul E. Schulz , Xiaoqian Jiang , Yejin Kim

Objective

Complex diseases exhibit heterogeneous progression patterns, necessitating effective capture and clustering of longitudinal changes to identify disease subtypes for personalized treatments. However, existing studies often fail to design clustering-specific representations or neglect clinical outcomes, thereby limiting the interpretability and clinical utility.

Method

We design a unified framework for subtyping longitudinal progressive diseases. We focus on effectively integrating all data from disease progressions and improving patient representation for downstream clustering. Specifically, we propose a clinical Outcome-Guided Deep Temporal Clustering (OG-DTC) that generates representations informed by clustering and clinical outcomes. A GRU-based seq2seq architecture captures the temporal dynamics, and the model integrates k-means clustering and outcome regression to facilitate the formation of clustering structures and the integration of clinical outcomes. The learned representations are clustered using a Gaussian mixture model to identify distinct subtypes. The clustering results are extensively validated through reproducibility, stability, and significance tests.

Results

We demonstrated the efficacy of our framework by applying it to three Alzheimer’s Disease (AD) clinical trials. Through the AD case study, we identified three distinct subtypes with unique patterns associated with differentiated clinical declines across multiple measures. The ablation study revealed the contributions of each component in the model and showed that jointly optimizing the full model improved patient representations for clustering. Extensive validations showed that the derived clustering is reproducible, stable, and significant.

Conclusion

Our temporal clustering framework can derive robust clustering applicable for subtyping longitudinal progressive diseases and has the potential to account for subtype variability in clinical outcomes.
目的:复杂疾病的发展模式各不相同,因此需要有效捕捉和聚类纵向变化,以确定疾病亚型,从而进行个性化治疗。然而,现有的研究往往未能设计出针对特定聚类的表征,或忽略了临床结果,从而限制了研究的可解释性和临床实用性:方法:我们设计了一个统一的框架,用于对纵向进展性疾病进行亚型分析。我们的重点是有效整合疾病进展的所有数据,并为下游聚类改进患者表示。具体来说,我们提出了一种临床结果引导的深度时空聚类(OG-DTC),它能根据聚类和临床结果生成表征。基于 GRU 的 seq2seq 架构可捕捉时间动态,该模型整合了 k-means 聚类和结果回归,以促进聚类结构的形成和临床结果的整合。利用高斯混合模型对学习到的表征进行聚类,以识别不同的亚型。聚类结果通过可重复性、稳定性和显著性测试进行了广泛验证:我们将该框架应用于三项阿尔茨海默病(AD)临床试验,证明了它的功效。通过阿尔茨海默病病例研究,我们发现了三种不同的亚型,它们具有独特的模式,与多种测量指标下不同的临床衰退相关联。消融研究揭示了模型中每个组成部分的贡献,并表明联合优化完整模型可改善患者聚类的代表性。广泛的验证表明,得出的聚类具有可重复性、稳定性和显著性:我们的时间聚类框架可以推导出适用于纵向进展性疾病亚型的稳健聚类,并有可能解释临床结果中的亚型变异。
{"title":"Clinical outcome-guided deep temporal clustering for disease progression subtyping","authors":"Dulin Wang ,&nbsp;Xiaotian Ma ,&nbsp;Paul E. Schulz ,&nbsp;Xiaoqian Jiang ,&nbsp;Yejin Kim","doi":"10.1016/j.jbi.2024.104732","DOIUrl":"10.1016/j.jbi.2024.104732","url":null,"abstract":"<div><h3>Objective</h3><div>Complex diseases exhibit heterogeneous progression patterns, necessitating effective capture and clustering of longitudinal changes to identify disease subtypes for personalized treatments. However, existing studies often fail to design clustering-specific representations or neglect clinical outcomes, thereby limiting the interpretability and clinical utility.</div></div><div><h3>Method</h3><div>We design a unified framework for subtyping longitudinal progressive diseases. We focus on effectively integrating all data from disease progressions and improving patient representation for downstream clustering. Specifically, we propose a clinical <strong>O</strong>utcome-<strong>G</strong>uided <strong>D</strong>eep <strong>T</strong>emporal <strong>C</strong>lustering (OG-DTC) that generates representations informed by clustering and clinical outcomes. A GRU-based seq2seq architecture captures the temporal dynamics, and the model integrates <em>k</em>-means clustering and outcome regression to facilitate the formation of clustering structures and the integration of clinical outcomes. The learned representations are clustered using a Gaussian mixture model to identify distinct subtypes. The clustering results are extensively validated through reproducibility, stability, and significance tests.</div></div><div><h3>Results</h3><div>We demonstrated the efficacy of our framework by applying it to three Alzheimer’s Disease (AD) clinical trials. Through the AD case study, we identified three distinct subtypes with unique patterns associated with differentiated clinical declines across multiple measures. The ablation study revealed the contributions of each component in the model and showed that jointly optimizing the full model improved patient representations for clustering. Extensive validations showed that the derived clustering is reproducible, stable, and significant.</div></div><div><h3>Conclusion</h3><div>Our temporal clustering framework can derive robust clustering applicable for subtyping longitudinal progressive diseases and has the potential to account for subtype variability in clinical outcomes.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104732"},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142365373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs FuseLinker:利用 LLM 的预训练文本嵌入和领域知识,增强基于 GNN 的生物医学知识图谱链接预测。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-01 DOI: 10.1016/j.jbi.2024.104730
Yongkang Xiao , Sinian Zhang , Huixue Zhou , Mingchen Li , Han Yang , Rui Zhang

Objective

To develop the FuseLinker, a novel link prediction framework for biomedical knowledge graphs (BKGs), which fully exploits the graph’s structural, textual and domain knowledge information. We evaluated the utility of FuseLinker in the graph-based drug repurposing task through detailed case studies.

Methods

FuseLinker leverages fused pre-trained text embedding and domain knowledge embedding to enhance the graph neural network (GNN)-based link prediction model tailored for BKGs. This framework includes three parts: a) obtain text embeddings for BKGs using embedding-visible large language models (LLMs), b) learn the representations of medical ontology as domain knowledge information by employing the Poincaré graph embedding method, and c) fuse these embeddings and further learn the graph structure representations of BKGs by applying a GNN-based link prediction model. We evaluated FuseLinker against traditional knowledge graph embedding models and a conventional GNN-based link prediction model across four public BKG datasets. Additionally, we examined the impact of using different embedding-visible LLMs on FuseLinker’s performance. Finally, we investigated FuseLinker’s ability to generate medical hypotheses through two drug repurposing case studies for Sorafenib and Parkinson’s disease.

Results

By comparing FuseLinker with baseline models on four BKGs, our method demonstrates superior performance. The Mean Reciprocal Rank (MRR) and Area Under receiver operating characteristic Curve (AUROC) for KEGG50k, Hetionet, SuppKG and ADInt are 0.969 and 0.987, 0.548 and 0.903, 0.739 and 0.928, and 0.831 and 0.890, respectively.

Conclusion

Our study demonstrates that FuseLinker is an effective novel link prediction framework that integrates multiple graph information and shows significant potential for practical applications in biomedical and clinical tasks. Source code and data are available at https://github.com/YKXia0/FuseLinker.
目的开发用于生物医学知识图谱(BKG)的新型链接预测框架FuseLinker,该框架可充分利用图谱的结构、文本和领域知识信息。我们通过详细的案例研究评估了 FuseLinker 在基于图的药物再利用任务中的实用性:FuseLinker 利用融合预训练文本嵌入和领域知识嵌入来增强基于图神经网络 (GNN) 的链接预测模型,该模型专为 BKGs 量身定制。该框架包括三个部分:a)使用嵌入可见大语言模型(LLM)获取 BKG 的文本嵌入;b)使用 Poincaré 图嵌入方法学习作为领域知识信息的医学本体表示;c)融合这些嵌入,并通过应用基于 GNN 的链接预测模型进一步学习 BKG 的图结构表示。我们通过四个公开的 BKG 数据集对 FuseLinker 与传统的知识图嵌入模型和传统的基于 GNN 的链接预测模型进行了评估。此外,我们还考察了使用不同的嵌入可见 LLM 对 FuseLinker 性能的影响。最后,我们通过索拉非尼(Sorafenib)和帕金森病的两个药物再利用案例研究,考察了 FuseLinker 生成医学假设的能力:结果:通过比较 FuseLinker 与四个 BKG 的基线模型,我们的方法表现出了卓越的性能。KEGG50k、Hetionet、SuppKG 和 ADInt 的平均倒数秩(MRR)和接收者操作特征曲线下面积(AUROC)分别为 0.969 和 0.987、0.548 和 0.903、0.739 和 0.928 以及 0.831 和 0.890:我们的研究表明,FuseLinker 是一种有效的新型链接预测框架,它整合了多种图信息,在生物医学和临床任务的实际应用中显示出巨大的潜力。源代码和数据见 https://github.com/YKXia0/FuseLinker。
{"title":"FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs","authors":"Yongkang Xiao ,&nbsp;Sinian Zhang ,&nbsp;Huixue Zhou ,&nbsp;Mingchen Li ,&nbsp;Han Yang ,&nbsp;Rui Zhang","doi":"10.1016/j.jbi.2024.104730","DOIUrl":"10.1016/j.jbi.2024.104730","url":null,"abstract":"<div><h3>Objective</h3><div>To develop the FuseLinker, a novel link prediction framework for biomedical knowledge graphs (BKGs), which fully exploits the graph’s structural, textual and domain knowledge information. We evaluated the utility of FuseLinker in the graph-based drug repurposing task through detailed case studies.</div></div><div><h3>Methods</h3><div>FuseLinker leverages fused pre-trained text embedding and domain knowledge embedding to enhance the graph neural network (GNN)-based link prediction model tailored for BKGs. This framework includes three parts: a) obtain text embeddings for BKGs using embedding-visible large language models (LLMs), b) learn the representations of medical ontology as domain knowledge information by employing the Poincaré graph embedding method, and c) fuse these embeddings and further learn the graph structure representations of BKGs by applying a GNN-based link prediction model. We evaluated FuseLinker against traditional knowledge graph embedding models and a conventional GNN-based link prediction model across four public BKG datasets. Additionally, we examined the impact of using different embedding-visible LLMs on FuseLinker’s performance. Finally, we investigated FuseLinker’s ability to generate medical hypotheses through two drug repurposing case studies for Sorafenib and Parkinson’s disease.</div></div><div><h3>Results</h3><div>By comparing FuseLinker with baseline models on four BKGs, our method demonstrates superior performance. The Mean Reciprocal Rank (MRR) and Area Under receiver operating characteristic Curve (AUROC) for KEGG50k, Hetionet, SuppKG and ADInt are 0.969 and 0.987, 0.548 and 0.903, 0.739 and 0.928, and 0.831 and 0.890, respectively.</div></div><div><h3>Conclusion</h3><div>Our study demonstrates that FuseLinker is an effective novel link prediction framework that integrates multiple graph information and shows significant potential for practical applications in biomedical and clinical tasks. Source code and data are available at https://github.com/YKXia0/FuseLinker.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104730"},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1