首页 > 最新文献

Journal of Biomedical Informatics最新文献

英文 中文
BAMRE: Joint extraction model of Chinese medical entities and relations based on Biaffine transformation with relation attention BAMRE:基于 Biaffine 变换和关系关注的中医实体和关系联合提取模型。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-01 DOI: 10.1016/j.jbi.2024.104733
Jiaqi Sun , Chen Zhang , Linlin Xing , Longbo Zhang , Hongzhen Cai , Maozu Guo
Electronic Health Records (EHRs) contain various valuable medical entities and their relationships. Although the extraction of biomedical relationships has achieved good results in the mining of electronic health records and the construction of biomedical knowledge bases, there are still some problems. There may be implied complex associations between entities and relationships in overlapping triplets, and ignoring these interactions may lead to a decrease in the accuracy of entity extraction. To address this issue, a joint extraction model for medical entity relations based on a relation attention mechanism is proposed. The relation extraction module identifies candidate relationships within a sentence. The attention mechanism based on these relationships assigns weights to contextual words in the sentence that are associated with different relationships. Additionally, it extracts the subject and object entities. Under a specific relationship, entity vector representations are utilized to construct a global entity matching matrix based on Biaffine transformations. This matrix is designed to enhance the semantic dependencies and relational representations between entities, enabling triplet extraction. This allows the two subtasks of named entity recognition and relation extraction to be interrelated, fully utilizing contextual information within the sentence, and effectively addresses the issue of overlapping triplets.
Experimental observations from the CMeIE Chinese medical relation extraction dataset and the Baidu2019 Chinese dataset confirm that our approach yields the superior F1 score across all cutting-edge baselines. Moreover, it offers substantial performance improvements in intricate situations involving diverse overlapping patterns, multitudes of triplets, and cross-sentence triplets.
电子健康记录(EHR)包含各种有价值的医疗实体及其关系。尽管生物医学关系的提取在电子健康记录的挖掘和生物医学知识库的构建中取得了良好的效果,但仍然存在一些问题。在重叠的三元组中,实体和关系之间可能隐含着复杂的关联,忽略这些相互作用可能会导致实体提取的准确性下降。为了解决这个问题,本文提出了一种基于关系关注机制的医学实体关系联合提取模型。关系提取模块可识别句子中的候选关系。基于这些关系的关注机制为句子中与不同关系相关联的上下文词语分配权重。此外,它还能提取主语和宾语实体。在特定关系下,实体向量表示法被用来构建基于 Biaffine 变换的全局实体匹配矩阵。该矩阵旨在增强实体间的语义依赖性和关系表征,从而实现三元组提取。这使得命名实体识别和关系提取这两项子任务相互关联,充分利用了句子中的上下文信息,有效解决了三元组重叠的问题。来自 CMeIE 中文医疗关系提取数据集和百度 2019 中文数据集的实验观察证实,我们的方法在所有前沿基线中都获得了更优的 F1 分数。此外,在涉及各种重叠模式、大量三元组和跨句子三元组的复杂情况下,它的性能也得到了大幅提升。
{"title":"BAMRE: Joint extraction model of Chinese medical entities and relations based on Biaffine transformation with relation attention","authors":"Jiaqi Sun ,&nbsp;Chen Zhang ,&nbsp;Linlin Xing ,&nbsp;Longbo Zhang ,&nbsp;Hongzhen Cai ,&nbsp;Maozu Guo","doi":"10.1016/j.jbi.2024.104733","DOIUrl":"10.1016/j.jbi.2024.104733","url":null,"abstract":"<div><div>Electronic Health Records (EHRs) contain various valuable medical entities and their relationships. Although the extraction of biomedical relationships has achieved good results in the mining of electronic health records and the construction of biomedical knowledge bases, there are still some problems. There may be implied complex associations between entities and relationships in overlapping triplets, and ignoring these interactions may lead to a decrease in the accuracy of entity extraction. To address this issue, a joint extraction model for medical entity relations based on a relation attention mechanism is proposed. The relation extraction module identifies candidate relationships within a sentence. The attention mechanism based on these relationships assigns weights to contextual words in the sentence that are associated with different relationships. Additionally, it extracts the subject and object entities. Under a specific relationship, entity vector representations are utilized to construct a global entity matching matrix based on Biaffine transformations. This matrix is designed to enhance the semantic dependencies and relational representations between entities, enabling triplet extraction. This allows the two subtasks of named entity recognition and relation extraction to be interrelated, fully utilizing contextual information within the sentence, and effectively addresses the issue of overlapping triplets.</div><div>Experimental observations from the CMeIE Chinese medical relation extraction dataset and the Baidu2019 Chinese dataset confirm that our approach yields the superior <span><math><mrow><mi>F</mi><mn>1</mn></mrow></math></span> score across all cutting-edge baselines. Moreover, it offers substantial performance improvements in intricate situations involving diverse overlapping patterns, multitudes of triplets, and cross-sentence triplets.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104733"},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142377853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fairness and inclusion methods for biomedical informatics research 生物医学信息学研究的公平性和包容性方法。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-01 DOI: 10.1016/j.jbi.2024.104713
Shyam Visweswaran, Yuan Luo, Mor Peleg
{"title":"Fairness and inclusion methods for biomedical informatics research","authors":"Shyam Visweswaran,&nbsp;Yuan Luo,&nbsp;Mor Peleg","doi":"10.1016/j.jbi.2024.104713","DOIUrl":"10.1016/j.jbi.2024.104713","url":null,"abstract":"","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104713"},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142072868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-domain visual prompting with spatial proximity knowledge distillation for histological image classification 利用空间邻近性知识提炼跨域视觉提示,实现组织学图像分类。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-09-21 DOI: 10.1016/j.jbi.2024.104728
Xiaohong Li , Guoheng Huang , Lianglun Cheng , Guo Zhong , Weihuang Liu , Xuhang Chen , Muyan Cai

Objective:

Histological classification is a challenging task due to the diverse appearances, unpredictable variations, and blurry edges of histological tissues. Recently, many approaches based on large networks have achieved satisfactory performance. However, most of these methods rely heavily on substantial computational resources and large high-quality datasets, limiting their practical application. Knowledge Distillation (KD) offers a promising solution by enabling smaller networks to achieve performance comparable to that of larger networks. Nonetheless, KD is hindered by the problem of high-dimensional characteristics, which makes it difficult to capture tiny scattered features and often leads to the loss of edge feature relationships.

Methods:

A novel cross-domain visual prompting distillation approach is proposed, compelling the teacher network to facilitate the extraction of significant high-dimensional features into low-dimensional feature maps, thereby aiding the student network in achieving superior performance. Additionally, a dynamic learnable temperature module based on novel vector-based spatial proximity is introduced to further encourage the student to imitate the teacher.

Results:

Experiments conducted on widely accepted histological datasets, NCT-CRC-HE-100K and LC25000, demonstrate the effectiveness of the proposed method and validate its robustness on the popular dermoscopic dataset ISIC-2019. Compared to state-of-the-art knowledge distillation methods, the proposed method achieves better performance and greater robustness with optimal domain adaptation.

Conclusion:

A novel distillation architecture, termed VPSP, tailored for histological classification, is proposed. This architecture achieves superior performance with optimal domain adaptation, enhancing the clinical application of histological classification. The source code will be released at https://github.com/xiaohongji/VPSP.
目的:组织学分类是一项具有挑战性的任务,因为组织学组织的外观多种多样,变化难以预测,而且边缘模糊。最近,许多基于大型网络的方法取得了令人满意的效果。然而,这些方法大多严重依赖于大量的计算资源和大型高质量数据集,限制了它们的实际应用。知识蒸馏(Knowledge Distillation,KD)提供了一种很有前景的解决方案,它能使较小的网络达到与较大网络相当的性能。然而,知识蒸馏受到高维特征问题的阻碍,难以捕捉到微小分散的特征,并经常导致边缘特征关系的丢失:方法:提出了一种新颖的跨领域视觉提示提炼方法,迫使教师网络将重要的高维特征提取到低维特征图中,从而帮助学生网络取得优异成绩。此外,还引入了基于新型向量空间接近性的动态可学习温度模块,以进一步鼓励学生模仿教师:在广泛接受的组织学数据集 NCT-CRC-HE-100K 和 LC25000 上进行的实验证明了所提方法的有效性,并在流行的皮肤镜数据集 ISIC-2019 上验证了其鲁棒性。与最先进的知识蒸馏方法相比,所提出的方法通过优化领域适应性实现了更好的性能和更高的鲁棒性:本文提出了一种专为组织学分类定制的新型蒸馏架构,称为 VPSP。结论:本文提出了一种专为组织学分类设计的新型蒸馏架构--VPSP,该架构通过优化领域适应性实现了更优越的性能,从而提高了组织学分类的临床应用水平。
{"title":"Cross-domain visual prompting with spatial proximity knowledge distillation for histological image classification","authors":"Xiaohong Li ,&nbsp;Guoheng Huang ,&nbsp;Lianglun Cheng ,&nbsp;Guo Zhong ,&nbsp;Weihuang Liu ,&nbsp;Xuhang Chen ,&nbsp;Muyan Cai","doi":"10.1016/j.jbi.2024.104728","DOIUrl":"10.1016/j.jbi.2024.104728","url":null,"abstract":"<div><h3>Objective:</h3><div>Histological classification is a challenging task due to the diverse appearances, unpredictable variations, and blurry edges of histological tissues. Recently, many approaches based on large networks have achieved satisfactory performance. However, most of these methods rely heavily on substantial computational resources and large high-quality datasets, limiting their practical application. Knowledge Distillation (KD) offers a promising solution by enabling smaller networks to achieve performance comparable to that of larger networks. Nonetheless, KD is hindered by the problem of high-dimensional characteristics, which makes it difficult to capture tiny scattered features and often leads to the loss of edge feature relationships.</div></div><div><h3>Methods:</h3><div>A novel cross-domain visual prompting distillation approach is proposed, compelling the teacher network to facilitate the extraction of significant high-dimensional features into low-dimensional feature maps, thereby aiding the student network in achieving superior performance. Additionally, a dynamic learnable temperature module based on novel vector-based spatial proximity is introduced to further encourage the student to imitate the teacher.</div></div><div><h3>Results:</h3><div>Experiments conducted on widely accepted histological datasets, NCT-CRC-HE-100K and LC25000, demonstrate the effectiveness of the proposed method and validate its robustness on the popular dermoscopic dataset ISIC-2019. Compared to state-of-the-art knowledge distillation methods, the proposed method achieves better performance and greater robustness with optimal domain adaptation.</div></div><div><h3>Conclusion:</h3><div>A novel distillation architecture, termed VPSP, tailored for histological classification, is proposed. This architecture achieves superior performance with optimal domain adaptation, enhancing the clinical application of histological classification. The source code will be released at <span><span>https://github.com/xiaohongji/VPSP</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104728"},"PeriodicalIF":4.0,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142288115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing cancer driver gene identification through an integrative network and pathway approach 通过综合网络和通路方法推进癌症驱动基因的识别。
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-09-19 DOI: 10.1016/j.jbi.2024.104729
Junrong Song , Zhiming Song , Yuanli Gong , Lichang Ge , Wenlu Lou

Objective

Cancer is a complex genetic disease characterized by the accumulation of various mutations, with driver genes playing a crucial role in cancer initiation and progression. Distinguishing driver genes from passenger mutations is essential for understanding cancer biology and discovering therapeutic targets. However, the majority of existing methods ignore the mutational heterogeneity and commonalities among patients, which hinders the identification of driver genes more effectively.

Methods

This study introduces MCSdriver, a novel computational model that integrates network and pathway information to prioritize the identification of cancer driver genes. MCSdriver employs a bidirectional random walk algorithm to quantify the mutual exclusivity and functional relationships between mutated genes within patient cohorts. It calculates similarity scores based on a mutual exclusivity-weighted network and pathway coverage patterns, accounting for patient-specific heterogeneity and molecular profile similarity.

Results

This approach enhances the accuracy and quality of driver gene identification. MCSdriver demonstrates superior performance in identifying cancer driver genes across four cancer types from The Cancer Genome Atlas, showing a higher F-score, Recall and Precision compared to existing ranking list-based and module-based models.

Conclusion

The MCSdriver model not only outperforms other models in identifying known cancer driver genes but also effectively identifies novel driver genes involved in cancer-related biological processes. The model’s consideration of patient-specific heterogeneity and similarity in molecular profiles significantly enhances the accuracy and quality of driver gene identification. Validation through Gene Ontology enrichment analysis and literature mining further underscores its potential application value in personalized cancer therapy, offering a promising tool for advancing our understanding and treatment of cancer.
目的:癌症是一种复杂的遗传性疾病,其特点是各种突变的累积,其中驱动基因在癌症的发生和发展中起着至关重要的作用。区分驱动基因和乘客突变对于了解癌症生物学和发现治疗靶点至关重要。然而,现有的大多数方法都忽略了患者突变的异质性和共性,这阻碍了更有效地识别驱动基因:本研究介绍了一种新型计算模型 MCSdriver,它整合了网络和通路信息,可优先识别癌症驱动基因。MCSdriver采用双向随机行走算法量化患者队列中突变基因之间的互斥性和功能关系。它根据互斥性加权网络和通路覆盖模式计算相似性得分,并考虑患者特异性异质性和分子特征相似性:结果:这种方法提高了驱动基因鉴定的准确性和质量。MCSdriver 在识别《癌症基因组图谱》中四种癌症类型的癌症驱动基因方面表现出卓越的性能,与现有的基于排序列表和基于模块的模型相比,MCSdriver 显示出更高的 F-score、Recall 和 Precision:结论:MCSdriver 模型不仅在识别已知癌症驱动基因方面优于其他模型,而且还能有效识别参与癌症相关生物学过程的新型驱动基因。该模型考虑了患者的特异性和分子特征的相似性,大大提高了驱动基因鉴定的准确性和质量。通过基因本体富集分析和文献挖掘进行验证,进一步凸显了该模型在个性化癌症治疗中的潜在应用价值,为促进我们对癌症的理解和治疗提供了一种前景广阔的工具。
{"title":"Advancing cancer driver gene identification through an integrative network and pathway approach","authors":"Junrong Song ,&nbsp;Zhiming Song ,&nbsp;Yuanli Gong ,&nbsp;Lichang Ge ,&nbsp;Wenlu Lou","doi":"10.1016/j.jbi.2024.104729","DOIUrl":"10.1016/j.jbi.2024.104729","url":null,"abstract":"<div><h3>Objective</h3><div>Cancer is a complex genetic disease characterized by the accumulation of various mutations, with driver genes playing a crucial role in cancer initiation and progression. Distinguishing driver genes from passenger mutations is essential for understanding cancer biology and discovering therapeutic targets. However, the majority of existing methods ignore the mutational heterogeneity and commonalities among patients, which hinders the identification of driver genes more effectively.</div></div><div><h3>Methods</h3><div>This study introduces MCSdriver, a novel computational model that integrates network and pathway information to prioritize the identification of cancer driver genes. MCSdriver employs a bidirectional random walk algorithm to quantify the mutual exclusivity and functional relationships between mutated genes within patient cohorts. It calculates similarity scores based on a mutual exclusivity-weighted network and pathway coverage patterns, accounting for patient-specific heterogeneity and molecular profile similarity.</div></div><div><h3>Results</h3><div>This approach enhances the accuracy and quality of driver gene identification. MCSdriver demonstrates superior performance in identifying cancer driver genes across four cancer types from The Cancer Genome Atlas, showing a higher F-score, Recall and Precision compared to existing ranking list-based and module-based models.</div></div><div><h3>Conclusion</h3><div>The MCSdriver model not only outperforms other models in identifying known cancer driver genes but also effectively identifies novel driver genes involved in cancer-related biological processes. The model’s consideration of patient-specific heterogeneity and similarity in molecular profiles significantly enhances the accuracy and quality of driver gene identification. Validation through Gene Ontology enrichment analysis and literature mining further underscores its potential application value in personalized cancer therapy, offering a promising tool for advancing our understanding and treatment of cancer.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104729"},"PeriodicalIF":4.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142288114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proxy endpoints — bridging clinical trials and real world data 代理终点--连接临床试验与真实世界的数据
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-09-17 DOI: 10.1016/j.jbi.2024.104723
Maxim Kryukov , Kathleen P. Moriarty , Macarena Villamea , Ingrid O’Dwyer , Ohn Chow , Flavio Dormont , Ramon Hernandez , Ziv Bar-Joseph , Brandon Rufino

Objective:

Disease severity scores, or endpoints, are routinely measured during Randomized Controlled Trials (RCTs) to closely monitor the effect of treatment. In real-world clinical practice, although a larger set of patients is observed, the specific RCT endpoints are often not captured, which makes it hard to utilize real-world data (RWD) to evaluate drug efficacy in larger populations.

Methods:

To overcome this challenge, we developed an ensemble technique which learns proxy models of disease endpoints in RWD. Using a multi-stage learning framework applied to RCT data, we first identify features considered significant drivers of disease available within RWD. To create endpoint proxy models, we use Explainable Boosting Machines (EBMs) which allow for both end-user interpretability and modeling of non-linear relationships.

Results:

We demonstrate our approach on two diseases, rheumatoid arthritis (RA) and atopic dermatitis (AD). As we show, our combined feature selection and prediction method achieves good results for both disease areas, improving upon prior methods proposed for predictive disease severity scoring.

Conclusion:

Having disease severity over time for a patient is important to further disease understanding and management. Our results open the door to more use cases in the space of RA and AD such as treatment effect estimates or prognostic scoring on RWD. Our framework may be extended beyond RA and AD to other diseases where the severity score is not well measured in electronic health records.

目标:在随机对照试验(RCT)中,疾病严重程度评分或终点被常规测量,以密切监测治疗效果。在真实世界的临床实践中,虽然观察到的患者人数更多,但往往无法捕捉到具体的 RCT 终点,因此很难利用真实世界数据(RWD)来评估药物在更大人群中的疗效。利用适用于 RCT 数据的多阶段学习框架,我们首先确定了 RWD 中被认为是疾病重要驱动因素的特征。结果:我们在类风湿性关节炎(RA)和特应性皮炎(AD)这两种疾病上演示了我们的方法。结果:我们在类风湿性关节炎(RA)和特应性皮炎(AD)这两种疾病上演示了我们的方法。正如我们所展示的,我们的特征选择和预测组合方法在这两种疾病领域都取得了很好的效果,改进了之前提出的疾病严重程度预测评分方法。我们的研究结果为 RA 和 AD 领域的更多应用案例(如治疗效果估计或 RWD 预后评分)打开了大门。我们的框架可以从 RA 和 AD 扩展到电子健康记录中没有很好测量严重程度评分的其他疾病。
{"title":"Proxy endpoints — bridging clinical trials and real world data","authors":"Maxim Kryukov ,&nbsp;Kathleen P. Moriarty ,&nbsp;Macarena Villamea ,&nbsp;Ingrid O’Dwyer ,&nbsp;Ohn Chow ,&nbsp;Flavio Dormont ,&nbsp;Ramon Hernandez ,&nbsp;Ziv Bar-Joseph ,&nbsp;Brandon Rufino","doi":"10.1016/j.jbi.2024.104723","DOIUrl":"10.1016/j.jbi.2024.104723","url":null,"abstract":"<div><h3>Objective:</h3><p>Disease severity scores, or endpoints, are routinely measured during Randomized Controlled Trials (RCTs) to closely monitor the effect of treatment. In real-world clinical practice, although a larger set of patients is observed, the specific RCT endpoints are often not captured, which makes it hard to utilize real-world data (RWD) to evaluate drug efficacy in larger populations.</p></div><div><h3>Methods:</h3><p>To overcome this challenge, we developed an ensemble technique which learns proxy models of disease endpoints in RWD. Using a multi-stage learning framework applied to RCT data, we first identify features considered significant drivers of disease available within RWD. To create endpoint proxy models, we use Explainable Boosting Machines (EBMs) which allow for both end-user interpretability and modeling of non-linear relationships.</p></div><div><h3>Results:</h3><p>We demonstrate our approach on two diseases, rheumatoid arthritis (RA) and atopic dermatitis (AD). As we show, our combined feature selection and prediction method achieves good results for both disease areas, improving upon prior methods proposed for predictive disease severity scoring.</p></div><div><h3>Conclusion:</h3><p>Having disease severity over time for a patient is important to further disease understanding and management. Our results open the door to more use cases in the space of RA and AD such as treatment effect estimates or prognostic scoring on RWD. Our framework may be extended beyond RA and AD to other diseases where the severity score is not well measured in electronic health records.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104723"},"PeriodicalIF":4.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001412/pdfft?md5=7711cb401e9e3526c4adf1c9e025c587&pid=1-s2.0-S1532046424001412-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142274254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Health text simplification: An annotated corpus for digestive cancer education and novel strategies for reinforcement learning 健康文本简化:用于消化系统癌症教育的注释语料库和新的强化学习策略
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-09-16 DOI: 10.1016/j.jbi.2024.104727
Md Mushfiqur Rahman , Mohammad Sabik Irbaz , Kai North, Michelle S. Williams, Marcos Zampieri, Kevin Lybarger

Objective:

The reading level of health educational materials significantly influences the understandability and accessibility of the information, particularly for minoritized populations. Many patient educational resources surpass widely accepted standards for reading level and complexity. There is a critical need for high-performing text simplification models for health information to enhance dissemination and literacy. This need is particularly acute in cancer education, where effective prevention and screening education can substantially reduce morbidity and mortality.

Methods:

We introduce Simplified Digestive Cancer (SimpleDC), a parallel corpus of cancer education materials tailored for health text simplification research, comprising educational content from the American Cancer Society, Centers for Disease Control and Prevention, and National Cancer Institute. The corpus includes 31 web pages with the corresponding manually simplified versions. It consists of 1183 annotated sentence pairs (361 train, 294 development, and 528 test). Utilizing SimpleDC and the existing Med-EASi corpus, we explore Large Language Model (LLM)-based simplification methods, including fine-tuning, reinforcement learning (RL), reinforcement learning with human feedback (RLHF), domain adaptation, and prompt-based approaches. Our experimentation encompasses Llama 2, Llama 3, and GPT-4. We introduce a novel RLHF reward function featuring a lightweight model adept at distinguishing between original and simplified texts when enables training on unlabeled data.

Results:

Fine-tuned Llama models demonstrated high performance across various metrics. Our RLHF reward function outperformed existing RL text simplification reward functions. The results underscore that RL/RLHF can achieve performance comparable to fine-tuning and improve the performance of fine-tuned models. Additionally, these methods effectively adapt out-of-domain text simplification models to a target domain. The best-performing RL-enhanced Llama models outperformed GPT-4 in both automatic metrics and manual evaluation by subject matter experts.

Conclusion:

The newly developed SimpleDC corpus will serve as a valuable asset to the research community, particularly in patient education simplification. The RL/RLHF methodologies presented herein enable effective training of simplification models on unlabeled text and the utilization of out-of-domain simplification corpora.

目的:健康教育材料的阅读水平极大地影响了信息的可理解性和可获取性,尤其是对少数群体而言。许多患者教育资源的阅读水平和复杂程度超过了广泛接受的标准。因此,我们亟需高效的健康信息文本简化模式,以提高信息的传播和普及程度。方法:我们介绍了简化消化系统癌症(SimpleDC),这是一个为健康文本简化研究定制的癌症教育材料平行语料库,由美国癌症协会、美国疾病控制和预防中心以及美国国家癌症研究所的教育内容组成。该语料库包括 31 个网页和相应的人工简化版本。它包括 1183 个注释句对(361 个训练句对、294 个开发句对和 528 个测试句对)。利用 SimpleDC 和现有的 Med-EASi 语料库,我们探索了基于大型语言模型 (LLM) 的简化方法,包括微调、强化学习 (RL)、带人类反馈的强化学习 (RLHF)、领域适应和基于提示的方法。我们的实验包括 Llama 2、Llama 3 和 GPT-4。我们引入了一种新颖的 RLHF 奖励函数,该函数具有轻量级模型,当在无标签数据上进行训练时,该模型善于区分原始文本和简化文本。我们的 RLHF 奖励函数优于现有的 RL 文本简化奖励函数。这些结果表明,RL/RLHF 可以实现与微调相当的性能,并提高微调模型的性能。此外,这些方法还能有效地将域外文本简化模型调整到目标领域。结论:新开发的 SimpleDC 语料库将成为研究界的宝贵财富,尤其是在患者教育简化方面。本文介绍的 RL/RLHF 方法可以在无标签文本上有效地训练简化模型,并利用域外简化语料库。
{"title":"Health text simplification: An annotated corpus for digestive cancer education and novel strategies for reinforcement learning","authors":"Md Mushfiqur Rahman ,&nbsp;Mohammad Sabik Irbaz ,&nbsp;Kai North,&nbsp;Michelle S. Williams,&nbsp;Marcos Zampieri,&nbsp;Kevin Lybarger","doi":"10.1016/j.jbi.2024.104727","DOIUrl":"10.1016/j.jbi.2024.104727","url":null,"abstract":"<div><h3>Objective:</h3><p>The reading level of health educational materials significantly influences the understandability and accessibility of the information, particularly for minoritized populations. Many patient educational resources surpass widely accepted standards for reading level and complexity. There is a critical need for high-performing text simplification models for health information to enhance dissemination and literacy. This need is particularly acute in cancer education, where effective prevention and screening education can substantially reduce morbidity and mortality.</p></div><div><h3>Methods:</h3><p>We introduce <em>Simplified Digestive Cancer</em> (SimpleDC), a parallel corpus of cancer education materials tailored for health text simplification research, comprising educational content from the American Cancer Society, Centers for Disease Control and Prevention, and National Cancer Institute. The corpus includes 31 web pages with the corresponding manually simplified versions. It consists of 1183 annotated sentence pairs (361 train, 294 development, and 528 test). Utilizing SimpleDC and the existing Med-EASi corpus, we explore Large Language Model (LLM)-based simplification methods, including fine-tuning, reinforcement learning (RL), reinforcement learning with human feedback (RLHF), domain adaptation, and prompt-based approaches. Our experimentation encompasses Llama 2, Llama 3, and GPT-4. We introduce a novel RLHF reward function featuring a lightweight model adept at distinguishing between original and simplified texts when enables training on unlabeled data.</p></div><div><h3>Results:</h3><p>Fine-tuned Llama models demonstrated high performance across various metrics. Our RLHF reward function outperformed existing RL text simplification reward functions. The results underscore that RL/RLHF can achieve performance comparable to fine-tuning and improve the performance of fine-tuned models. Additionally, these methods effectively adapt out-of-domain text simplification models to a target domain. The best-performing RL-enhanced Llama models outperformed GPT-4 in both automatic metrics and manual evaluation by subject matter experts.</p></div><div><h3>Conclusion:</h3><p>The newly developed SimpleDC corpus will serve as a valuable asset to the research community, particularly in patient education simplification. The RL/RLHF methodologies presented herein enable effective training of simplification models on unlabeled text and the utilization of out-of-domain simplification corpora.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104727"},"PeriodicalIF":4.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling engagement with a digital behavior change intervention (HeartSteps II): An exploratory system identification approach 数字行为改变干预(HeartSteps II)的参与建模:探索性系统识别方法
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-09-13 DOI: 10.1016/j.jbi.2024.104721
Steven A. De La Torre , Mohamed El Mistiri , Eric Hekler , Predrag Klasnja , Benjamin Marlin , Misha Pavel , Donna Spruijt-Metz , Daniel E. Rivera

Objective

Digital behavior change interventions (DBCIs) are feasibly effective tools for addressing physical activity. However, in-depth understanding of participants’ long-term engagement with DBCIs remains sparse. Since the effectiveness of DBCIs to impact behavior change depends, in part, upon participant engagement, there is a need to better understand engagement as a dynamic process in response to an individual’s ever-changing biological, psychological, social, and environmental context.

Methods

The year-long micro-randomized trial (MRT) HeartSteps II provides an unprecedented opportunity to investigate DBCI engagement among ethnically diverse participants. We combined data streams from wearable sensors (Fitbit Versa, i.e., walking behavior), the HeartSteps II app (i.e. page views), and ecological momentary assessments (EMAs, i.e. perceived intrinsic and extrinsic motivation) to build the idiographic models. A system identification approach and a fluid analogy model were used to conduct autoregressive with exogenous input (ARX) analyses that tested hypothesized relationships between these variables inspired by Self-Determination Theory (SDT) with DBCI engagement through time.

Results

Data from 11 HeartSteps II participants was used to test aspects of the hypothesized SDT dynamic model. The average age was 46.33 (SD=7.4) years, and the average steps per day at baseline was 5,507 steps (SD=6,239). The hypothesized 5-input SDT-inspired ARX model for app engagement resulted in a 31.75 % weighted RMSEA (31.50 % on validation and 31.91 % on estimation), indicating that the model predicted app page views almost 32 % better relative to the mean of the data. Among Hispanic/Latino participants, the average overall model fit across inventories of the SDT fluid analogy was 34.22 % (SD=10.53) compared to 22.39 % (SD=6.36) among non-Hispanic/Latino Whites, a difference of 11.83 %. Across individuals, the number of daily notification prompts received by the participant was positively associated with increased app page views. The weekend/weekday indicator and perceived daily busyness were also found to be key predictors of the number of daily application page views.

Conclusions

This novel approach has significant implications for both personalized and adaptive DBCIs by identifying factors that foster or undermine engagement in an individual’s respective context. Once identified, these factors can be tailored to promote engagement and support sustained behavior change over time.

目标数字行为改变干预(DBCIs)是解决体育锻炼问题的可行有效工具。然而,对参与者长期参与 DBCI 的深入了解仍然很少。由于 DBCI 影响行为改变的效果部分取决于参与者的参与度,因此有必要更好地了解参与度是一个动态的过程,它与个体不断变化的生理、心理、社会和环境背景相适应。方法为期一年的微型随机试验(MRT)HeartSteps II 为研究不同种族参与者的 DBCI 参与度提供了前所未有的机会。我们将来自可穿戴传感器(Fitbit Versa,即步行行为)、HeartSteps II 应用程序(即页面浏览量)和生态瞬间评估(EMA,即感知内在和外在动机)的数据流结合起来,建立了成因模型。我们使用系统识别方法和流体类比模型来进行自回归与外生输入(ARX)分析,以检验受自我决定理论(SDT)启发的这些变量与 DBCI 参与时间之间的假设关系。他们的平均年龄为 46.33 岁(SD=7.4),基线时的平均每天步数为 5,507 步(SD=6,239)。假设的 5 输入 SDT-inspired ARX 模型对应用程序参与度的加权均方根误差为 31.75%(验证为 31.50%,估计为 31.91%),表明该模型对应用程序页面浏览量的预测比数据平均值高出近 32%。在西班牙裔/拉美裔参与者中,流体类比 SDT 库存的平均总体模型拟合度为 34.22 %(SD=10.53),而在非西班牙裔/拉美裔白人中为 22.39 %(SD=6.36),相差 11.83 %。在所有个体中,参与者每天收到的通知提示数量与应用程序页面浏览量的增加呈正相关。周末/周日指标和感知到的日常忙碌程度也是预测每日应用页面浏览量的关键因素。 结论这种新方法通过识别促进或削弱个人参与度的因素,对个性化和适应性 DBCI 有着重要意义。一旦确定了这些因素,就可以对其进行定制,以促进参与并支持长期的持续行为改变。
{"title":"Modeling engagement with a digital behavior change intervention (HeartSteps II): An exploratory system identification approach","authors":"Steven A. De La Torre ,&nbsp;Mohamed El Mistiri ,&nbsp;Eric Hekler ,&nbsp;Predrag Klasnja ,&nbsp;Benjamin Marlin ,&nbsp;Misha Pavel ,&nbsp;Donna Spruijt-Metz ,&nbsp;Daniel E. Rivera","doi":"10.1016/j.jbi.2024.104721","DOIUrl":"10.1016/j.jbi.2024.104721","url":null,"abstract":"<div><h3>Objective</h3><p>Digital behavior change interventions (DBCIs) are feasibly effective tools for addressing physical activity. However, in-depth understanding of participants’ long-term engagement with DBCIs remains sparse. Since the effectiveness of DBCIs to impact behavior change depends, in part, upon participant engagement, there is a need to better understand engagement as a dynamic process in response to an individual’s ever-changing biological, psychological, social, and environmental context.</p></div><div><h3>Methods</h3><p>The year-long micro-randomized trial (MRT) <em>HeartSteps II</em> provides an unprecedented opportunity to investigate DBCI engagement among ethnically diverse participants. We combined data streams from wearable sensors (Fitbit Versa, i.e., walking behavior), the <em>HeartSteps II</em> app (i.e. page views), and ecological momentary assessments (EMAs, i.e. perceived intrinsic and extrinsic motivation) to build the idiographic models. A system identification approach and a fluid analogy model were used to conduct autoregressive with exogenous input (ARX) analyses that tested hypothesized relationships between these variables inspired by Self-Determination Theory (SDT) with DBCI engagement through time.</p></div><div><h3>Results</h3><p>Data from 11 <em>HeartSteps II</em> participants was used to test aspects of the hypothesized SDT dynamic model. The average age was 46.33 (SD=7.4) years, and the average steps per day at baseline was 5,507 steps (SD=6,239). The hypothesized 5-input SDT-inspired ARX model for app engagement resulted in a 31.75 % weighted RMSEA (31.50 % on validation and 31.91 % on estimation), indicating that the model predicted app page views almost 32 % better relative to the mean of the data. Among Hispanic/Latino participants, the average overall model fit across inventories of the SDT fluid analogy was 34.22 % (SD=10.53) compared to 22.39 % (SD=6.36) among non-Hispanic/Latino Whites, a difference of 11.83 %. Across individuals, the number of daily notification prompts received by the participant was positively associated with increased app page views. The weekend/weekday indicator and perceived daily busyness were also found to be key predictors of the number of daily application page views.</p></div><div><h3>Conclusions</h3><p>This novel approach has significant implications for both personalized and adaptive DBCIs by identifying factors that foster or undermine engagement in an individual’s respective context. Once identified, these factors can be tailored to promote engagement and support sustained behavior change over time.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104721"},"PeriodicalIF":4.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001394/pdfft?md5=4f63dda9bba243570e4ff38291614e5d&pid=1-s2.0-S1532046424001394-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation 大型语言模型、科学知识和事实性:简化人类专家评估的框架
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-09-12 DOI: 10.1016/j.jbi.2024.104724
Magdalena Wysocka , Oskar Wysocki , Maxime Delmas , Vincent Mutel , André Freitas

Objective:

The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from Large Language Models (LLMs) trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery.

Methods:

The framework involves three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound–fungus relation determination.

Results:

Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted.

Conclusion:

While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale up in size and level of human feedback.

目的:本文介绍了一个评估事实科学知识编码的框架,旨在简化通常由领域专家进行的人工评估过程。通过大型科学文献语料库训练的大型语言模型(LLMs)推断和提取信息,有可能为生物医学发现带来阶跃式变化,减少获取和整合现有医学证据的障碍。方法:该框架包括三个评估步骤,每个步骤依次评估不同的方面:流畅性、提示一致性、语义连贯性、事实知识和生成回复的特异性。通过在非专家和专家之间拆分这些任务,该框架减少了后者所需的工作量。该研究对 ChatGPT、GPT-4 和 Llama 2 等 11 种最先进的 LLM 在两个基于提示的任务(化合物定义生成和化合物-真菌关系确定)中的能力进行了系统评估。结论:虽然 LLMs 目前还不适合作为零镜头环境下的生物医学事实知识库,但随着模型领域的专业化、规模的扩大和人类反馈水平的提高,在事实性方面出现了令人鼓舞的新特性。
{"title":"Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation","authors":"Magdalena Wysocka ,&nbsp;Oskar Wysocki ,&nbsp;Maxime Delmas ,&nbsp;Vincent Mutel ,&nbsp;André Freitas","doi":"10.1016/j.jbi.2024.104724","DOIUrl":"10.1016/j.jbi.2024.104724","url":null,"abstract":"<div><h3>Objective:</h3><p>The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from Large Language Models (LLMs) trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery.</p></div><div><h3>Methods:</h3><p>The framework involves three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound–fungus relation determination.</p></div><div><h3>Results:</h3><p>Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted.</p></div><div><h3>Conclusion:</h3><p>While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale up in size and level of human feedback.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104724"},"PeriodicalIF":4.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001424/pdfft?md5=ac0ecdf9dc0e6bc7bc1738a6853505c0&pid=1-s2.0-S1532046424001424-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Community knowledge graph abstraction for enhanced link prediction: A study on PubMed knowledge graph 增强链接预测的社区知识图谱抽象:对 PubMed 知识图谱的研究
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-09-10 DOI: 10.1016/j.jbi.2024.104725
Yang Zhao , Danushka Bollegala , Shunsuke Hirose , Yingzi Jin , Tomotake Kozu

Objective:

As new knowledge is produced at a rapid pace in the biomedical field, existing biomedical Knowledge Graphs (KGs) cannot be manually updated in a timely manner. Previous work in Natural Language Processing (NLP) has leveraged link prediction to infer the missing knowledge in general-purpose KGs. Inspired by this, we propose to apply link prediction to existing biomedical KGs to infer missing knowledge. Although Knowledge Graph Embedding (KGE) methods are effective in link prediction tasks, they are less capable of capturing relations between communities of entities with specific attributes (Fanourakis et al., 2023).

Methods:

To address this challenge, we proposed an entity distance-based method for abstracting a Community Knowledge Graph (CKG) from a simplified version of the pre-existing PubMed Knowledge Graph (PKG) (Xu et al., 2020). For link prediction on the abstracted CKG, we proposed an extension approach for the existing KGE models by linking the information in the PKG to the abstracted CKG. The applicability of this extension was proved by employing six well-known KGE models: TransE, TransH, DistMult, ComplEx, SimplE, and RotatE. Evaluation metrics including Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits@k were used to assess the link prediction performance. In addition, we presented a backtracking process that traces the results of CKG link prediction back to the PKG scale for further comparison.

Results:

Six different CKGs were abstracted from the PKG by using embeddings of the six KGE methods. The results of link prediction in these abstracted CKGs indicate that our proposed extension can improve the existing KGE methods, achieving a top-10 accuracy of 0.69 compared to 0.5 for TransE, 0.7 compared to 0.54 for TransH, 0.67 compared to 0.6 for DistMult, 0.73 compared to 0.57 for ComplEx, 0.73 compared to 0.63 for SimplE, and 0.85 compared to 0.76 for RotatE on their CKGs, respectively. These improved performances also highlight the wide applicability of the extension approach.

Conclusion:

This study proposed novel insights into abstracting CKGs from the PKG. The extension approach indicated enhanced performance of the existing KGE methods and has applicability. As an interesting future extension, we plan to conduct link prediction for entities that are newly introduced to the PKG.

目标:随着生物医学领域新知识的快速产生,现有的生物医学知识图谱(KG)无法及时进行人工更新。自然语言处理(NLP)领域的前人已经利用链接预测来推断通用知识图谱中缺失的知识。受此启发,我们提议将链接预测应用于现有的生物医学知识图谱,以推断缺失的知识。虽然知识图谱嵌入(KGE)方法在链接预测任务中很有效,但它们在捕捉具有特定属性的实体社区之间的关系方面能力较弱(Fanourakis et al.为了对抽象后的 CKG 进行链接预测,我们提出了一种扩展方法,通过将 PKG 中的信息链接到抽象后的 CKG 来扩展现有的 KGE 模型。通过使用六个著名的 KGE 模型,证明了这种扩展方法的适用性:TransE、TransH、DistMult、ComplEx、SimplE 和 RotatE。评估指标包括平均排名(MR)、平均互易排名(MRR)和点击率@k,用于评估链接预测性能。此外,我们还提出了一个回溯过程,将 CKG 链接预测结果追溯到 PKG 尺度,以便进一步比较。在这些抽象 CKG 中进行链接预测的结果表明,我们提出的扩展可以改进现有的 KGE 方法,在它们的 CKG 中,前 10 名的准确率分别为:TransE 0.69(0.5)、TransH 0.7(0.54)、DistMult 0.67(0.6)、ComplEx 0.73(0.57)、SimplE 0.73(0.63)和 RotatE 0.85(0.76)。结论:本研究提出了从 PKG 抽象 CKG 的新见解。该扩展方法提高了现有 KGE 方法的性能,并具有适用性。作为一个有趣的未来扩展,我们计划对新引入 PKG 的实体进行链接预测。
{"title":"Community knowledge graph abstraction for enhanced link prediction: A study on PubMed knowledge graph","authors":"Yang Zhao ,&nbsp;Danushka Bollegala ,&nbsp;Shunsuke Hirose ,&nbsp;Yingzi Jin ,&nbsp;Tomotake Kozu","doi":"10.1016/j.jbi.2024.104725","DOIUrl":"10.1016/j.jbi.2024.104725","url":null,"abstract":"<div><h3>Objective:</h3><p>As new knowledge is produced at a rapid pace in the biomedical field, existing biomedical Knowledge Graphs (KGs) cannot be manually updated in a timely manner. Previous work in Natural Language Processing (NLP) has leveraged link prediction to infer the missing knowledge in general-purpose KGs. Inspired by this, we propose to apply link prediction to existing biomedical KGs to infer missing knowledge. Although Knowledge Graph Embedding (KGE) methods are effective in link prediction tasks, they are less capable of capturing relations between communities of entities with specific attributes (Fanourakis et al., 2023).</p></div><div><h3>Methods:</h3><p>To address this challenge, we proposed an entity distance-based method for abstracting a Community Knowledge Graph (CKG) from a simplified version of the pre-existing PubMed Knowledge Graph (PKG) (Xu et al., 2020). For link prediction on the abstracted CKG, we proposed an extension approach for the existing KGE models by linking the information in the PKG to the abstracted CKG. The applicability of this extension was proved by employing six well-known KGE models: TransE, TransH, DistMult, ComplEx, SimplE, and RotatE. Evaluation metrics including Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits@<span><math><mi>k</mi></math></span> were used to assess the link prediction performance. In addition, we presented a backtracking process that traces the results of CKG link prediction back to the PKG scale for further comparison.</p></div><div><h3>Results:</h3><p>Six different CKGs were abstracted from the PKG by using embeddings of the six KGE methods. The results of link prediction in these abstracted CKGs indicate that our proposed extension can improve the existing KGE methods, achieving a top-10 accuracy of 0.69 compared to 0.5 for TransE, 0.7 compared to 0.54 for TransH, 0.67 compared to 0.6 for DistMult, 0.73 compared to 0.57 for ComplEx, 0.73 compared to 0.63 for SimplE, and 0.85 compared to 0.76 for RotatE on their CKGs, respectively. These improved performances also highlight the wide applicability of the extension approach.</p></div><div><h3>Conclusion:</h3><p>This study proposed novel insights into abstracting CKGs from the PKG. The extension approach indicated enhanced performance of the existing KGE methods and has applicability. As an interesting future extension, we plan to conduct link prediction for entities that are newly introduced to the PKG.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104725"},"PeriodicalIF":4.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001436/pdfft?md5=1241f4473bb8cac3c0c3666b4968750a&pid=1-s2.0-S1532046424001436-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extracting lung cancer staging descriptors from pathology reports: A generative language model approach 从病理报告中提取肺癌分期描述符:生成语言模型方法
IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-09-01 DOI: 10.1016/j.jbi.2024.104720
Hyeongmin Cho , Sooyoung Yoo , Borham Kim , Sowon Jang , Leonard Sunwoo , Sanghwan Kim , Donghyoung Lee , Seok Kim , Sejin Nam , Jin-Haeng Chung

Background

In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research. Particularly, extracting the information required for the pathological stage from surgical pathology reports can be utilized to update cancer staging according to the latest cancer staging guidelines.

Objectives

This study has two main objectives. The first objective is to evaluate the performance of extracting information from text-based surgical pathology reports and determining pathological stages based on the extracted information using fine-tuned generative language models (GLMs) for patients with lung cancer. The second objective is to determine the feasibility of utilizing relatively small GLMs for information extraction in a resource-constrained computing environment.

Methods

Lung cancer surgical pathology reports were collected from the Common Data Model database of Seoul National University Bundang Hospital (SNUBH), a tertiary hospital in Korea. We selected 42 descriptors necessary for tumor-node (TN) classification based on these reports and created a gold standard with validation by two clinical experts. The pathology reports and gold standard were used to generate prompt-response pairs for training and evaluating GLMs which then were used to extract information required for staging from pathology reports.

Results

We evaluated the information extraction performance of six trained models as well as their performance in TN classification using the extracted information. The Deductive Mistral-7B model, which was pre-trained with the deductive dataset, showed the best performance overall, with an exact match ratio of 92.24% in the information extraction problem and an accuracy of 0.9876 (predicting T and N classification concurrently) in classification.

Conclusion

This study demonstrated that training GLMs with deductive datasets can improve information extraction performance, and GLMs with a relatively small number of parameters at approximately seven billion can achieve high performance in this problem. The proposed GLM-based information extraction method is expected to be useful in clinical decision-making support, lung cancer staging and research.

背景:在肿瘤学领域,电子健康记录包含用于癌症患者诊断、分期和治疗计划的文本关键信息。然而,文本数据处理需要大量的时间和精力,这限制了对这些数据的利用。自然语言处理(NLP)技术(包括大型语言模型)的最新进展可应用于癌症研究。特别是从手术病理报告中提取病理分期所需的信息,可用于根据最新的癌症分期指南更新癌症分期:本研究有两个主要目标。第一个目标是评估从基于文本的手术病理报告中提取信息的性能,并使用微调生成语言模型(GLM)根据提取的信息确定肺癌患者的病理分期。第二个目标是确定在资源有限的计算环境中利用相对较小的生成语言模型进行信息提取的可行性:方法:我们从韩国三级医院首尔国立大学盆唐医院(SNUBH)的通用数据模型数据库中收集了肺癌手术病理报告。我们根据这些报告选择了肿瘤结节(TN)分类所需的 42 个描述符,并创建了金标准,由两名临床专家进行验证。病理报告和金标准被用来生成用于训练和评估 GLM 的提示-响应对,然后用于从病理报告中提取分期所需的信息:我们评估了六个训练有素模型的信息提取性能,以及它们利用提取的信息进行 TN 分类的性能。使用演绎数据集预先训练的演绎 Mistral-7B 模型总体表现最佳,在信息提取问题上的精确匹配率为 92.24%,在分类中的准确率为 0.9876(同时预测 T 和 N 分类):本研究表明,用演绎数据集训练 GLM 可以提高信息提取的性能,而参数数相对较少(约 70 亿)的 GLM 在这一问题上可以取得较高的性能。所提出的基于 GLM 的信息提取方法有望在临床决策支持、肺癌分期和研究中发挥作用。
{"title":"Extracting lung cancer staging descriptors from pathology reports: A generative language model approach","authors":"Hyeongmin Cho ,&nbsp;Sooyoung Yoo ,&nbsp;Borham Kim ,&nbsp;Sowon Jang ,&nbsp;Leonard Sunwoo ,&nbsp;Sanghwan Kim ,&nbsp;Donghyoung Lee ,&nbsp;Seok Kim ,&nbsp;Sejin Nam ,&nbsp;Jin-Haeng Chung","doi":"10.1016/j.jbi.2024.104720","DOIUrl":"10.1016/j.jbi.2024.104720","url":null,"abstract":"<div><h3>Background</h3><p>In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research. Particularly, extracting the information required for the pathological stage from surgical pathology reports can be utilized to update cancer staging according to the latest cancer staging guidelines.</p></div><div><h3>Objectives</h3><p>This study has two main objectives. The first objective is to evaluate the performance of extracting information from text-based surgical pathology reports and determining pathological stages based on the extracted information using fine-tuned generative language models (GLMs) for patients with lung cancer. The second objective is to determine the feasibility of utilizing relatively small GLMs for information extraction in a resource-constrained computing environment.</p></div><div><h3>Methods</h3><p>Lung cancer surgical pathology reports were collected from the Common Data Model database of Seoul National University Bundang Hospital (SNUBH), a tertiary hospital in Korea. We selected 42 descriptors necessary for tumor-node (TN) classification based on these reports and created a gold standard with validation by two clinical experts. The pathology reports and gold standard were used to generate prompt-response pairs for training and evaluating GLMs which then were used to extract information required for staging from pathology reports.</p></div><div><h3>Results</h3><p>We evaluated the information extraction performance of six trained models as well as their performance in TN classification using the extracted information. The Deductive Mistral-7B model, which was pre-trained with the deductive dataset, showed the best performance overall, with an exact match ratio of 92.24% in the information extraction problem and an accuracy of 0.9876 (predicting T and N classification concurrently) in classification.</p></div><div><h3>Conclusion</h3><p>This study demonstrated that training GLMs with deductive datasets can improve information extraction performance, and GLMs with a relatively small number of parameters at approximately seven billion can achieve high performance in this problem. The proposed GLM-based information extraction method is expected to be useful in clinical decision-making support, lung cancer staging and research.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"157 ","pages":"Article 104720"},"PeriodicalIF":4.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001382/pdfft?md5=a07a39b7bc41fc8621f04b2757525870&pid=1-s2.0-S1532046424001382-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142132875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1