A study on large-scale disease causality discovery from biomedical literature.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS BMC Medical Informatics and Decision Making Pub Date : 2025-03-18 DOI:10.1186/s12911-025-02893-0

Shirui Yu, Peng Dong, Junlian Li, Xiaoli Tang, Xiaoying Li

{"title":"A study on large-scale disease causality discovery from biomedical literature.","authors":"Shirui Yu, Peng Dong, Junlian Li, Xiaoli Tang, Xiaoying Li","doi":"10.1186/s12911-025-02893-0","DOIUrl":null,"url":null,"abstract":"Background: Biomedical semantic relationship extraction could reveal important biomedical entities and the semantic relationships between them, providing a crucial foundation for the biomedical knowledge discovery, clinical decision making and other artificial intelligence applications. Identifying the causal relationships between diseases is a significant research field, since it expedites the identification of underlying disease pathogenesis mechanisms and promote better disease prevention and treatment. SemRep is an effective tool for semantic relationship extraction in the biomedical field, but it is not accurate enough for disease causality extraction, bringing challenges for downstream tasks. In this study, we proposed an optimization strategy for SemRep to enhance its accuracy in disease causality extraction.Methods: This study aims to optimize disease causality extraction of SemRep tool by constructing a semantic predicate vocabulary that precisely expresses disease causality to support the automatic extraction of disease causality knowledge from biomedical literature. The proposed method invloves the following four steps: Firstly, we obtained a collection of semantic feature words expressing disease causality based on current causality predicate studies and the disease causality pairs extracted from SemMedDB. Then, we constructed a disease causality semantic predicate vocabulary by filtering and evaluating the clue words using quantitative comparisons. Following that, we extracted disease causality pairs from the biomedical literature using 36 semantic predicates with an accuracy greater than 80% for more meaningful knowledge discovery. Finally, we conducted knowledge discovery based on the extracted disease causality triples, which primarily includes unidirectional disease causality, bidirectional disease causality, as well as two specific types of disease causality: primary disease causality and rare disease causality.Results: We obtained a disease causality semantic predicate vocabulary containing 50 textual predicates with an accuracy of above 40%. 36 semantic predicates from the 60% accuracy group were used for disease causality extraction, yielding 259,434 disease causality pairs for subsequent knowledge discovery. Among them, 92,557 types with 176,010 unidirectional disease causality triples, and 6084 types with 83,424 bidirectional disease causality triples were found eventually. Two other types of disease causality, primary disease causality and rare disease causality, were also discovered.Conclusions: The novelty of this research is that the proposed method enhanced the disease causality extraction of SemRep tool, resulting a more accurate and comprehensive disease causality extraction. It also facilitates an automatic disease causality extraction from large-scale biomedical literature. Additionally, a customized extraction of disease causality for its accuracy and comprehensiveness can be made possible by leveraging the quantified causality predicate vocabulary, allowing for flexible extraction of disease causality according to the actual circumstance.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"136"},"PeriodicalIF":3.8000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11916938/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-02893-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Biomedical semantic relationship extraction could reveal important biomedical entities and the semantic relationships between them, providing a crucial foundation for the biomedical knowledge discovery, clinical decision making and other artificial intelligence applications. Identifying the causal relationships between diseases is a significant research field, since it expedites the identification of underlying disease pathogenesis mechanisms and promote better disease prevention and treatment. SemRep is an effective tool for semantic relationship extraction in the biomedical field, but it is not accurate enough for disease causality extraction, bringing challenges for downstream tasks. In this study, we proposed an optimization strategy for SemRep to enhance its accuracy in disease causality extraction.

Methods: This study aims to optimize disease causality extraction of SemRep tool by constructing a semantic predicate vocabulary that precisely expresses disease causality to support the automatic extraction of disease causality knowledge from biomedical literature. The proposed method invloves the following four steps: Firstly, we obtained a collection of semantic feature words expressing disease causality based on current causality predicate studies and the disease causality pairs extracted from SemMedDB. Then, we constructed a disease causality semantic predicate vocabulary by filtering and evaluating the clue words using quantitative comparisons. Following that, we extracted disease causality pairs from the biomedical literature using 36 semantic predicates with an accuracy greater than 80% for more meaningful knowledge discovery. Finally, we conducted knowledge discovery based on the extracted disease causality triples, which primarily includes unidirectional disease causality, bidirectional disease causality, as well as two specific types of disease causality: primary disease causality and rare disease causality.

Results: We obtained a disease causality semantic predicate vocabulary containing 50 textual predicates with an accuracy of above 40%. 36 semantic predicates from the 60% accuracy group were used for disease causality extraction, yielding 259,434 disease causality pairs for subsequent knowledge discovery. Among them, 92,557 types with 176,010 unidirectional disease causality triples, and 6084 types with 83,424 bidirectional disease causality triples were found eventually. Two other types of disease causality, primary disease causality and rare disease causality, were also discovered.

Conclusions: The novelty of this research is that the proposed method enhanced the disease causality extraction of SemRep tool, resulting a more accurate and comprehensive disease causality extraction. It also facilitates an automatic disease causality extraction from large-scale biomedical literature. Additionally, a customized extraction of disease causality for its accuracy and comprehensiveness can be made possible by leveraging the quantified causality predicate vocabulary, allowing for flexible extraction of disease causality according to the actual circumstance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于生物医学文献的大规模疾病因果关系发现研究。

背景：生物医学语义关系提取可以揭示重要生物医学实体及其之间的语义关系，为生物医学知识发现、临床决策等人工智能应用提供重要基础。确定疾病之间的因果关系是一个重要的研究领域，因为它可以加快识别潜在的疾病发病机制，促进更好的疾病预防和治疗。SemRep是生物医学领域语义关系提取的有效工具，但在疾病因果关系提取方面不够准确，给下游任务带来了挑战。在本研究中，我们提出了一种优化策略，以提高SemRep在疾病因果关系提取中的准确性。方法：通过构建精确表达疾病因果关系的语义谓词词汇表，优化SemRep工具的疾病因果关系提取，支持生物医学文献中疾病因果关系知识的自动提取。该方法包括以下四个步骤：首先，基于当前的因果谓词研究和从SemMedDB中提取的疾病因果对，获得表达疾病因果关系的语义特征词集合；然后，通过对线索词的过滤和定量比较，构建疾病因果语义谓词词汇。随后，我们使用36个语义谓词从生物医学文献中提取疾病因果关系对，准确率超过80%，以获得更有意义的知识发现。最后，我们根据提取的疾病因果三元组进行知识发现，主要包括单向疾病因果组、双向疾病因果组以及两种特定类型的疾病因果组：原发性疾病因果组和罕见病因果组。结果：我们获得了一个包含50个文本谓词的疾病因果语义谓词词汇表，准确率在40%以上。从60%准确率组中提取36个语义谓词用于疾病因果关系提取，产生259,434对疾病因果关系对用于后续的知识发现。其中，最终发现92557种类型、176010种单向疾病因果关系三元组，6084种类型、83424种双向疾病因果关系三元组。另外两种类型的疾病因果关系，原发疾病因果关系和罕见疾病因果关系，也被发现。结论：本研究的新颖之处在于所提出的方法增强了SemRep工具的疾病因果关系提取，使得疾病因果关系提取更加准确和全面。它还有助于从大规模生物医学文献中自动提取疾病因果关系。此外，通过利用量化的因果谓词词汇表，可以根据实际情况灵活地提取疾病因果关系，从而实现对疾病因果关系的精确和全面的定制提取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.