首页 > 最新文献

Journal of Biomedical Semantics最新文献

英文 中文
Enriched knowledge representation in biological fields: a case study of literature-based discovery in Alzheimer's disease. 生物学领域丰富的知识表示:阿尔茨海默病基于文献发现的案例研究。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-20 DOI: 10.1186/s13326-025-00328-3
Yiyuan Pu, Daniel Beck, Karin Verspoor

Background: In Literature-based Discovery (LBD), Swanson's original ABC model brought together isolated public knowledge statements and assembled them to infer putative hypotheses via logical connections. Modern LBD studies that scale up this approach through automation typically rely on a simple entity-based knowledge graph with co-occurrences and/or semantic triples as basic building blocks. However, our analysis of a knowledge graph constructed for a recent LBD system reveals limitations arising from such pairwise representations, which further negatively impact knowledge inference. Using LBD as the context and motivation in this work, we explore limitations of using pairwise relationships only as knowledge representation in knowledge graphs, and we identify impacts of these limitations on knowledge inference. We argue that enhanced knowledge representation is beneficial for biological knowledge representation in general, as well as for both the quality and the specificity of hypotheses proposed with LBD.

Results: Based on a systematic analysis of one co-occurrence-based LBD system focusing on Alzheimer's Disease, we identify 7 types of limitations arising from the exclusive use of pairwise relationships in a standard knowledge graph-including the need to capture more than two entities interacting together in a single event-and 3 types of negative impacts on knowledge inferred with the graph-Experimentally infeasible hypotheses, Literature-inconsistent hypotheses, and Oversimplified hypotheses explanations. We also present an indicative distribution of different types of relationships. Pairwise relationships are an essential component in representation frameworks for knowledge discovery. However, only 20% of discoveries are perfectly represented with pairwise relationships alone. 73% require a combination of pairwise relationships and nested relationships. The remaining 7% are represented with pairwise relationships, nested relationships, and hypergraphs.

Conclusion: We argue that the standard entity pair-based knowledge graph, while essential for representing basic binary relations, results in important limitations for comprehensive biological knowledge representation and impacts downstream tasks such as proposing meaningful discoveries in LBD. These limitations can be mitigated by integrating more semantically complex knowledge representation strategies, including capturing collective interactions and allowing for nested entities. The use of more sophisticated knowledge representation will benefit biological fields with more expressive knowledge graphs. Downstream tasks, such as LBD, can benefit from richer representations as well, allowing for generation of implicit knowledge discoveries and explanations for disease diagnosis, treatment, and mechanism that are more biologically meaningful.

背景:在基于文献的发现(LBD)中,Swanson最初的ABC模型汇集了孤立的公共知识陈述,并通过逻辑联系将它们组合起来推断假设。通过自动化扩展这种方法的现代LBD研究通常依赖于一个简单的基于实体的知识图,其中包含共现和/或语义三元组作为基本构建块。然而,我们对最近为LBD系统构建的知识图的分析揭示了这种两两表示所产生的局限性,这进一步对知识推理产生了负面影响。本研究以LBD为背景和动机,探讨了在知识图中仅使用两两关系作为知识表示的局限性,并确定了这些局限性对知识推理的影响。我们认为,增强的知识表示通常有利于生物学知识表示,以及用LBD提出的假设的质量和特异性。结果:基于对一个以阿尔茨海默病为重点的基于共现的LBD系统的系统分析,我们确定了在标准知识图中单独使用成对关系所产生的7种限制——包括需要捕获在单个事件中相互作用的两个以上实体——以及3种对知识推断的负面影响——实验上不可行的假设、文献不一致的假设、以及过度简化的假设解释。我们还提出了不同类型关系的指示性分布。两两关系是知识发现表示框架的重要组成部分。然而,只有20%的发现可以完全用两两关系来表示。73%需要成对关系和嵌套关系的组合。剩下的7%用成对关系、嵌套关系和超图表示。结论:我们认为,标准的基于实体对的知识图谱虽然对表示基本的二元关系至关重要,但对全面的生物学知识表示造成了严重的限制,并影响了下游任务,如在LBD中提出有意义的发现。可以通过集成语义更复杂的知识表示策略(包括捕获集体交互和允许嵌套实体)来减轻这些限制。使用更复杂的知识表示将使生物领域受益于更具表现力的知识图。下游任务,如LBD,也可以从更丰富的表示中受益,允许对疾病诊断、治疗和机制产生更有生物学意义的隐性知识发现和解释。
{"title":"Enriched knowledge representation in biological fields: a case study of literature-based discovery in Alzheimer's disease.","authors":"Yiyuan Pu, Daniel Beck, Karin Verspoor","doi":"10.1186/s13326-025-00328-3","DOIUrl":"10.1186/s13326-025-00328-3","url":null,"abstract":"<p><strong>Background: </strong>In Literature-based Discovery (LBD), Swanson's original ABC model brought together isolated public knowledge statements and assembled them to infer putative hypotheses via logical connections. Modern LBD studies that scale up this approach through automation typically rely on a simple entity-based knowledge graph with co-occurrences and/or semantic triples as basic building blocks. However, our analysis of a knowledge graph constructed for a recent LBD system reveals limitations arising from such pairwise representations, which further negatively impact knowledge inference. Using LBD as the context and motivation in this work, we explore limitations of using pairwise relationships only as knowledge representation in knowledge graphs, and we identify impacts of these limitations on knowledge inference. We argue that enhanced knowledge representation is beneficial for biological knowledge representation in general, as well as for both the quality and the specificity of hypotheses proposed with LBD.</p><p><strong>Results: </strong>Based on a systematic analysis of one co-occurrence-based LBD system focusing on Alzheimer's Disease, we identify 7 types of limitations arising from the exclusive use of pairwise relationships in a standard knowledge graph-including the need to capture more than two entities interacting together in a single event-and 3 types of negative impacts on knowledge inferred with the graph-Experimentally infeasible hypotheses, Literature-inconsistent hypotheses, and Oversimplified hypotheses explanations. We also present an indicative distribution of different types of relationships. Pairwise relationships are an essential component in representation frameworks for knowledge discovery. However, only 20% of discoveries are perfectly represented with pairwise relationships alone. 73% require a combination of pairwise relationships and nested relationships. The remaining 7% are represented with pairwise relationships, nested relationships, and hypergraphs.</p><p><strong>Conclusion: </strong>We argue that the standard entity pair-based knowledge graph, while essential for representing basic binary relations, results in important limitations for comprehensive biological knowledge representation and impacts downstream tasks such as proposing meaningful discoveries in LBD. These limitations can be mitigated by integrating more semantically complex knowledge representation strategies, including capturing collective interactions and allowing for nested entities. The use of more sophisticated knowledge representation will benefit biological fields with more expressive knowledge graphs. Downstream tasks, such as LBD, can benefit from richer representations as well, allowing for generation of implicit knowledge discoveries and explanations for disease diagnosis, treatment, and mechanism that are more biologically meaningful.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"3"},"PeriodicalIF":2.0,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924609/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143669945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gene expression knowledge graph for patient representation and diabetes prediction. 用于患者表征和糖尿病预测的基因表达知识图谱。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-08 DOI: 10.1186/s13326-025-00325-6
Rita T Sousa, Heiko Paulheim

Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the number of patients in expression datasets is usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel approach to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration, and to learn uniform patient representations for subjects contained in different incompatible datasets. Different strategies and KG embedding methods are explored to generate vector representations, serving as inputs for a classifier. Extensive experiments demonstrate the efficacy of our approach, revealing weighted F1-score improvements in diabetes prediction up to 13% when integrating multiple gene expression datasets and domain-specific knowledge about protein functions and interactions.

糖尿病是一个影响数百万人的全球性健康问题。机器学习方法在改善糖尿病预测方面显示出有希望的结果,特别是通过分析基因表达数据。虽然基因表达数据可以提供有价值的见解,但由于表达数据集中的患者数量通常有限,并且来自不同基因表达的不同数据集的数据不能容易地组合,因此存在挑战。这项工作提出了一种新的方法来解决这些挑战,通过使用知识图(一种独特的生物医学数据集成工具)集成多个基因表达数据集和领域特定知识,并学习不同不兼容数据集中受试者的统一患者表示。探索了不同的策略和KG嵌入方法来生成向量表示,作为分类器的输入。大量的实验证明了我们的方法的有效性,揭示了当整合多个基因表达数据集和关于蛋白质功能和相互作用的特定结构域知识时,糖尿病预测的加权f1评分提高了13%。
{"title":"Gene expression knowledge graph for patient representation and diabetes prediction.","authors":"Rita T Sousa, Heiko Paulheim","doi":"10.1186/s13326-025-00325-6","DOIUrl":"10.1186/s13326-025-00325-6","url":null,"abstract":"<p><p>Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the number of patients in expression datasets is usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel approach to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration, and to learn uniform patient representations for subjects contained in different incompatible datasets. Different strategies and KG embedding methods are explored to generate vector representations, serving as inputs for a classifier. Extensive experiments demonstrate the efficacy of our approach, revealing weighted F1-score improvements in diabetes prediction up to 13% when integrating multiple gene expression datasets and domain-specific knowledge about protein functions and interactions.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"2"},"PeriodicalIF":2.0,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11889825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143585774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expanding the concept of ID conversion in TogoID by introducing multi-semantic and label features. 通过引入多语义和标签特征,扩展了TogoID中ID转换的概念。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-01-08 DOI: 10.1186/s13326-024-00322-1
Shuya Ikeda, Kiyoko F Aoki-Kinoshita, Hirokazu Chiba, Susumu Goto, Masae Hosoda, Shuichi Kawashima, Jin-Dong Kim, Yuki Moriya, Tazro Ohta, Hiromasa Ono, Terue Takatsuki, Yasunori Yamamoto, Toshiaki Katayama

Background: TogoID ( https://togoid.dbcls.jp/ ) is an identifier (ID) conversion service designed to link IDs across diverse categories of life science databases. With its ability to obtain IDs related in different semantic relationships, a user-friendly web interface, and a regular automatic data update system, TogoID has been a valuable tool for bioinformatics.

Results: We have recently expanded TogoID's ability to represent semantics between datasets, enabling it to handle multiple semantic relationships within dataset pairs. This enhancement enables TogoID to distinguish relationships such as "glycans bind to proteins" or "glycans are processed by proteins" between glycans and proteins. Additional new features include the ability to display labels corresponding to database IDs, making it easier to interpret the relationships between the various IDs available in TogoID, and the ability to convert labels to IDs, extending the entry point for ID conversion. The implementation of URL parameters, which reproduces the state of TogoID's web application, allows users to share complex search results through a simple URL.

Conclusions: These advancements improve TogoID's utility in bioinformatics, allowing researchers to explore complex ID relationships. By introducing the tool's multi-semantic and label features, TogoID expands the concept of ID conversion and supports more comprehensive and efficient data integration across life science databases.

背景:TogoID (https://togoid.dbcls.jp/)是一个标识符(ID)转换服务,旨在链接不同类别的生命科学数据库中的ID。TogoID具有获取不同语义关系相关id的能力,用户友好的web界面和定期自动数据更新系统,已成为生物信息学的重要工具。结果:我们最近扩展了TogoID在数据集之间表示语义的能力,使其能够处理数据集对中的多个语义关系。这种增强功能使TogoID能够区分聚糖和蛋白质之间的关系,例如“聚糖与蛋白质结合”或“聚糖被蛋白质加工”。其他新特性包括显示与数据库ID相对应的标签的能力,从而更容易解释TogoID中可用的各种ID之间的关系,以及将标签转换为ID的能力,扩展了ID转换的入口点。URL参数的实现再现了TogoID web应用程序的状态,允许用户通过一个简单的URL共享复杂的搜索结果。结论:这些进步提高了TogoID在生物信息学中的实用性,使研究人员能够探索复杂的ID关系。通过引入该工具的多语义和标签特性,TogoID扩展了ID转换的概念,并支持跨生命科学数据库更全面、更高效的数据集成。
{"title":"Expanding the concept of ID conversion in TogoID by introducing multi-semantic and label features.","authors":"Shuya Ikeda, Kiyoko F Aoki-Kinoshita, Hirokazu Chiba, Susumu Goto, Masae Hosoda, Shuichi Kawashima, Jin-Dong Kim, Yuki Moriya, Tazro Ohta, Hiromasa Ono, Terue Takatsuki, Yasunori Yamamoto, Toshiaki Katayama","doi":"10.1186/s13326-024-00322-1","DOIUrl":"10.1186/s13326-024-00322-1","url":null,"abstract":"<p><strong>Background: </strong>TogoID ( https://togoid.dbcls.jp/ ) is an identifier (ID) conversion service designed to link IDs across diverse categories of life science databases. With its ability to obtain IDs related in different semantic relationships, a user-friendly web interface, and a regular automatic data update system, TogoID has been a valuable tool for bioinformatics.</p><p><strong>Results: </strong>We have recently expanded TogoID's ability to represent semantics between datasets, enabling it to handle multiple semantic relationships within dataset pairs. This enhancement enables TogoID to distinguish relationships such as \"glycans bind to proteins\" or \"glycans are processed by proteins\" between glycans and proteins. Additional new features include the ability to display labels corresponding to database IDs, making it easier to interpret the relationships between the various IDs available in TogoID, and the ability to convert labels to IDs, extending the entry point for ID conversion. The implementation of URL parameters, which reproduces the state of TogoID's web application, allows users to share complex search results through a simple URL.</p><p><strong>Conclusions: </strong>These advancements improve TogoID's utility in bioinformatics, allowing researchers to explore complex ID relationships. By introducing the tool's multi-semantic and label features, TogoID expands the concept of ID conversion and supports more comprehensive and efficient data integration across life science databases.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"1"},"PeriodicalIF":2.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11708180/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142948850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis. FAIR数据立方体,用于集成多组学数据分析的FAIR数据基础设施。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-12-28 DOI: 10.1186/s13326-024-00321-2
Xiaofeng Liao, Thomas H A Ederveen, Anna Niehues, Casper de Visser, Junda Huang, Firdaws Badmus, Cenna Doornbos, Yuliia Orlova, Purva Kulkarni, K Joeri van der Velde, Morris A Swertz, Martin Brandt, Alain J van Gool, Peter A C 't Hoen

Motivation: We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. The integration of multi-omics data is challenging. Moreover, human multi-omics data may be privacy-sensitive and can be misused to de-anonymize and (re-)identify individuals. Hence, most biomedical data is kept in secure and protected silos. Therefore, it remains a challenge to re-use these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of Findable, Accessible, Interoperable, and Reusable (FAIR) data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.

Results: The Netherlands X-omics Initiative is a National Roadmap Large-Scale Research Infrastructure aiming for efficient integration of data generated within X-omics and external datasets. To facilitate this, we developed the FAIR Data Cube (FDCube), which adopts and applies the FAIR principles and helps researchers to create FAIR data and metadata, to facilitate re-use of their data, and to make their data analysis workflows transparent, and in the meantime ensure data security and privacy.

动机:我们正在见证分子分析(组学)数据量的巨大增长。多组学数据的集成具有挑战性。此外,人类多组学数据可能是隐私敏感的,可能被滥用于去匿名化和(重新)识别个人。因此,大多数生物医学数据都保存在安全和受保护的孤岛中。因此,在不侵犯数据来源的个人隐私的情况下重新使用这些数据仍然是一个挑战。可查找、可访问、可互操作和可重用(FAIR)数据的联邦分析是一种保护隐私的解决方案,可最佳地利用这些多组学数据并将其转换为可操作的知识。结果:荷兰x组学计划是一个国家大规模研究基础设施路线图,旨在有效整合x组学和外部数据集内生成的数据。为此,我们开发了FAIR数据立方体(FDCube),它采用并应用FAIR原则,帮助研究人员创建FAIR数据和元数据,方便其数据的重用,并使其数据分析工作流程透明化,同时确保数据安全和隐私。
{"title":"FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis.","authors":"Xiaofeng Liao, Thomas H A Ederveen, Anna Niehues, Casper de Visser, Junda Huang, Firdaws Badmus, Cenna Doornbos, Yuliia Orlova, Purva Kulkarni, K Joeri van der Velde, Morris A Swertz, Martin Brandt, Alain J van Gool, Peter A C 't Hoen","doi":"10.1186/s13326-024-00321-2","DOIUrl":"10.1186/s13326-024-00321-2","url":null,"abstract":"<p><strong>Motivation: </strong>We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. The integration of multi-omics data is challenging. Moreover, human multi-omics data may be privacy-sensitive and can be misused to de-anonymize and (re-)identify individuals. Hence, most biomedical data is kept in secure and protected silos. Therefore, it remains a challenge to re-use these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of Findable, Accessible, Interoperable, and Reusable (FAIR) data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.</p><p><strong>Results: </strong>The Netherlands X-omics Initiative is a National Roadmap Large-Scale Research Infrastructure aiming for efficient integration of data generated within X-omics and external datasets. To facilitate this, we developed the FAIR Data Cube (FDCube), which adopts and applies the FAIR principles and helps researchers to create FAIR data and metadata, to facilitate re-use of their data, and to make their data analysis workflows transparent, and in the meantime ensure data security and privacy.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"20"},"PeriodicalIF":2.0,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681678/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142894524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI). 使用人工智能的本体动态检索增强生成(DRAGON-AI)。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-17 DOI: 10.1186/s13326-024-00320-3
Sabrina Toro, Anna V Anagnostopoulos, Susan M Bello, Kai Blumberg, Rhiannon Cameron, Leigh Carmody, Alexander D Diehl, Damion M Dooley, William D Duncan, Petra Fey, Pascale Gaudet, Nomi L Harris, Marcin P Joachimiak, Leila Kiani, Tiago Lubiana, Monica C Munoz-Torres, Shawn O'Neil, David Osumi-Sutherland, Aleix Puig-Barbe, Justin T Reese, Leonore Reiser, Sofia Mc Robb, Troy Ruemping, James Seager, Eric Sid, Ray Stefancsik, Magalie Weber, Valerie Wood, Melissa A Haendel, Christopher J Mungall

Background: Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.

Results: We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues.

Conclusions: These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.

背景:本体是生物医学、环境和食品科学等领域信息学基础设施的基本组成部分,以准确和可计算的形式代表共识知识。然而,本体的构建和维护需要大量资源,并且需要领域专家、馆长和本体专家之间的大量协作。我们提出了使用人工智能动态检索增强生成本体(DRAGON-AI),这是一种采用大型语言模型(LLM)和检索增强生成(RAG)的本体生成方法。DRAGON-AI 可以从多个本体和非结构化文本源中的现有知识中生成文本和逻辑本体组件:我们评估了 DRAGON-AI 在十个不同本体中从头构建术语的性能,并对结果进行了广泛的人工评估。我们的方法具有较高的关系生成精度,但精度略低于基于逻辑推理的方法。我们的方法还能生成专家评估员认为可以接受的定义,但这些定义的得分比人工撰写的定义要低。值得注意的是,对某一领域具有最高信任度的评估者能够更好地识别人工智能生成的定义中的缺陷。我们还展示了 DRAGON-AI 以 GitHub 问题的形式纳入自然语言指令的能力:这些发现表明 DRAGON-AI 有潜力为人工构建本体的过程提供实质性帮助。然而,我们的研究结果也强调了由专家策划人和本体编辑来推动本体生成过程的重要性。
{"title":"Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI).","authors":"Sabrina Toro, Anna V Anagnostopoulos, Susan M Bello, Kai Blumberg, Rhiannon Cameron, Leigh Carmody, Alexander D Diehl, Damion M Dooley, William D Duncan, Petra Fey, Pascale Gaudet, Nomi L Harris, Marcin P Joachimiak, Leila Kiani, Tiago Lubiana, Monica C Munoz-Torres, Shawn O'Neil, David Osumi-Sutherland, Aleix Puig-Barbe, Justin T Reese, Leonore Reiser, Sofia Mc Robb, Troy Ruemping, James Seager, Eric Sid, Ray Stefancsik, Magalie Weber, Valerie Wood, Melissa A Haendel, Christopher J Mungall","doi":"10.1186/s13326-024-00320-3","DOIUrl":"10.1186/s13326-024-00320-3","url":null,"abstract":"<p><strong>Background: </strong>Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.</p><p><strong>Results: </strong>We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues.</p><p><strong>Conclusions: </strong>These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"19"},"PeriodicalIF":2.0,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484368/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed. MeSH2Matrix:结合 MeSH 关键词和机器学习,基于 PubMed 进行生物医学关系分类。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-02 DOI: 10.1186/s13326-024-00319-w
Houcemeddine Turki, Bonaventure F P Dossou, Chris Chinenye Emezue, Abraham Toluwase Owodunni, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Hanen Ben Hassen, Afif Masmoudi

Biomedical relation classification has been significantly improved by the application of advanced machine learning techniques on the raw texts of scholarly publications. Despite this improvement, the reliance on large chunks of raw text makes these algorithms suffer in terms of generalization, precision, and reliability. The use of the distinctive characteristics of bibliographic metadata can prove effective in achieving better performance for this challenging task. In this research paper, we introduce an approach for biomedical relation classification using the qualifiers of co-occurring Medical Subject Headings (MeSH). First of all, we introduce MeSH2Matrix, our dataset consisting of 46,469 biomedical relations curated from PubMed publications using our approach. Our dataset includes a matrix that maps associations between the qualifiers of subject MeSH keywords and those of object MeSH keywords. It also specifies the corresponding Wikidata relation type and the superclass of semantic relations for each relation. Using MeSH2Matrix, we build and train three machine learning models (Support Vector Machine [SVM], a dense model [D-Model], and a convolutional neural network [C-Net]) to evaluate the efficiency of our approach for biomedical relation classification. Our best model achieves an accuracy of 70.78% for 195 classes and 83.09% for five superclasses. Finally, we provide confusion matrix and extensive feature analyses to better examine the relationship between the MeSH qualifiers and the biomedical relations being classified. Our results will hopefully shed light on developing better algorithms for biomedical ontology classification based on the MeSH keywords of PubMed publications. For reproducibility purposes, MeSH2Matrix, as well as all our source codes, are made publicly accessible at https://github.com/SisonkeBiotik-Africa/MeSH2Matrix .

通过在学术出版物的原始文本上应用先进的机器学习技术,生物医学关系分类得到了显著改善。尽管有所改进,但对大块原始文本的依赖使这些算法在泛化、精确度和可靠性方面受到影响。事实证明,利用书目元数据的显著特征可以有效提高这项具有挑战性任务的性能。在本研究论文中,我们介绍了一种利用共现医学主题词表(MeSH)的限定词进行生物医学关系分类的方法。首先,我们介绍了 MeSH2Matrix,这是我们的数据集,由 46,469 个使用我们的方法从 PubMed 出版物中整理出来的生物医学关系组成。我们的数据集包含一个矩阵,它映射了主题 MeSH 关键词和对象 MeSH 关键词限定词之间的关联。它还指定了相应的 Wikidata 关系类型和每个关系的超类语义关系。利用 MeSH2Matrix,我们建立并训练了三个机器学习模型(支持向量机 [SVM]、密集模型 [D-Model] 和卷积神经网络 [C-Net]),以评估我们的生物医学关系分类方法的效率。我们的最佳模型在 195 个类别中达到了 70.78% 的准确率,在五个超类别中达到了 83.09% 的准确率。最后,我们提供了混淆矩阵和广泛的特征分析,以更好地研究 MeSH 限定词与被分类的生物医学关系之间的关系。我们的研究结果有望为基于 PubMed 出版物的 MeSH 关键词开发更好的生物医学本体分类算法提供启示。出于可重复性的目的,MeSH2Matrix 以及我们所有的源代码均可在 https://github.com/SisonkeBiotik-Africa/MeSH2Matrix 网站上公开访问。
{"title":"MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed.","authors":"Houcemeddine Turki, Bonaventure F P Dossou, Chris Chinenye Emezue, Abraham Toluwase Owodunni, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Hanen Ben Hassen, Afif Masmoudi","doi":"10.1186/s13326-024-00319-w","DOIUrl":"10.1186/s13326-024-00319-w","url":null,"abstract":"<p><p>Biomedical relation classification has been significantly improved by the application of advanced machine learning techniques on the raw texts of scholarly publications. Despite this improvement, the reliance on large chunks of raw text makes these algorithms suffer in terms of generalization, precision, and reliability. The use of the distinctive characteristics of bibliographic metadata can prove effective in achieving better performance for this challenging task. In this research paper, we introduce an approach for biomedical relation classification using the qualifiers of co-occurring Medical Subject Headings (MeSH). First of all, we introduce MeSH2Matrix, our dataset consisting of 46,469 biomedical relations curated from PubMed publications using our approach. Our dataset includes a matrix that maps associations between the qualifiers of subject MeSH keywords and those of object MeSH keywords. It also specifies the corresponding Wikidata relation type and the superclass of semantic relations for each relation. Using MeSH2Matrix, we build and train three machine learning models (Support Vector Machine [SVM], a dense model [D-Model], and a convolutional neural network [C-Net]) to evaluate the efficiency of our approach for biomedical relation classification. Our best model achieves an accuracy of 70.78% for 195 classes and 83.09% for five superclasses. Finally, we provide confusion matrix and extensive feature analyses to better examine the relationship between the MeSH qualifiers and the biomedical relations being classified. Our results will hopefully shed light on developing better algorithms for biomedical ontology classification based on the MeSH keywords of PubMed publications. For reproducibility purposes, MeSH2Matrix, as well as all our source codes, are made publicly accessible at https://github.com/SisonkeBiotik-Africa/MeSH2Matrix .</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"18"},"PeriodicalIF":2.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11445994/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142361554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Annotation of epilepsy clinic letters for natural language processing 为自然语言处理的癫痫门诊信件添加注释
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-15 DOI: 10.1186/s13326-024-00316-z
Beata Fonferko-Shadrach, Huw Strafford, Carys Jones, Russell A. Khan, Sharon Brown, Jenny Edwards, Jonathan Hawken, Luke E. Shrimpton, Catharine P. White, Robert Powell, Inder M. S. Sawhney, William O. Pickrell, Arron S. Lacey
Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline. We created 200 synthetic clinic letters based on hospital outpatient consultations with epilepsy specialists. The letters were double annotated by trained clinicians and researchers according to agreed guidelines. We used the annotation tool, Markup, with an epilepsy concept list based on the Unified Medical Language System ontology. All annotations were reviewed, and a gold standard set of annotations was agreed and used to validate the performance of ExECTv2. The overall inter-annotator agreement (IAA) between the two sets of annotations produced a per item F1 score of 0.73. Validating ExECTv2 using the gold standard gave an overall F1 score of 0.87 per item, and 0.90 per letter. The synthetic letters, annotations, and annotation guidelines have been made freely available. To our knowledge, this is the first publicly available set of annotated epilepsy clinic letters and guidelines that can be used for NLP researchers with minimum epilepsy knowledge. The IAA results show that clinical text annotation tasks are difficult and require a gold standard to be arranged by researcher consensus. The results for ExECTv2, our automated epilepsy NLP pipeline, extracted detailed epilepsy information from unstructured epilepsy letters with more accuracy than human annotators, further confirming the utility of NLP for clinical and research applications.
自然语言处理(NLP)越来越多地用于从非结构化文本中提取结构化信息,以辅助临床决策和医疗保健研究。用于开发和验证 NLP 应用程序的专家注释文档非常有限。我们创建了合成临床文档来解决这一问题,并验证癫痫临床文本提取第 2 版 (ExECTv2) NLP 管道。我们根据医院门诊癫痫专家的会诊结果创建了 200 封合成门诊信件。这些信件由训练有素的临床医生和研究人员根据商定的指南进行双重注释。我们使用了标注工具 Markup 和基于统一医学语言系统本体的癫痫概念列表。我们对所有注释进行了审核,并商定了一套金标准注释,用于验证 ExECTv2 的性能。两组注释之间的注释者间总体一致性(IAA)达到了 0.73 的单项 F1 分数。使用金标准对 ExECTv2 进行验证后,每个项目的总体 F1 得分为 0.87,每个字母的 F1 得分为 0.90。合成字母、注释和注释指南已免费提供。据我们所知,这是第一套公开提供的癫痫临床信函注释和指南,可供具有最低癫痫知识的 NLP 研究人员使用。IAA 的结果表明,临床文本注释任务难度很大,需要通过研究人员的共识来安排黄金标准。我们的自动化癫痫 NLP 管道 ExECTv2 的结果显示,从非结构化癫痫信件中提取详细癫痫信息的准确性高于人类注释者,这进一步证实了 NLP 在临床和研究应用中的实用性。
{"title":"Annotation of epilepsy clinic letters for natural language processing","authors":"Beata Fonferko-Shadrach, Huw Strafford, Carys Jones, Russell A. Khan, Sharon Brown, Jenny Edwards, Jonathan Hawken, Luke E. Shrimpton, Catharine P. White, Robert Powell, Inder M. S. Sawhney, William O. Pickrell, Arron S. Lacey","doi":"10.1186/s13326-024-00316-z","DOIUrl":"https://doi.org/10.1186/s13326-024-00316-z","url":null,"abstract":"Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline. We created 200 synthetic clinic letters based on hospital outpatient consultations with epilepsy specialists. The letters were double annotated by trained clinicians and researchers according to agreed guidelines. We used the annotation tool, Markup, with an epilepsy concept list based on the Unified Medical Language System ontology. All annotations were reviewed, and a gold standard set of annotations was agreed and used to validate the performance of ExECTv2. The overall inter-annotator agreement (IAA) between the two sets of annotations produced a per item F1 score of 0.73. Validating ExECTv2 using the gold standard gave an overall F1 score of 0.87 per item, and 0.90 per letter. The synthetic letters, annotations, and annotation guidelines have been made freely available. To our knowledge, this is the first publicly available set of annotated epilepsy clinic letters and guidelines that can be used for NLP researchers with minimum epilepsy knowledge. The IAA results show that clinical text annotation tasks are difficult and require a gold standard to be arranged by researcher consensus. The results for ExECTv2, our automated epilepsy NLP pipeline, extracted detailed epilepsy information from unstructured epilepsy letters with more accuracy than human annotators, further confirming the utility of NLP for clinical and research applications.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"36 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An extensible and unifying approach to retrospective clinical data modeling: the BrainTeaser Ontology. 可扩展和统一的回顾性临床数据建模方法:BrainTeaser 本体论。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-30 DOI: 10.1186/s13326-024-00317-y
Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Adriano Chió, Arianna Dagliati, Mamede de Carvalho, Marta Gromicho, Umberto Manera, Eleonora Tavazzi, Giorgio Maria Di Nunzio, Gianmaria Silvello, Nicola Ferro

Automatic disease progression prediction models require large amounts of training data, which are seldom available, especially when it comes to rare diseases. A possible solution is to integrate data from different medical centres. Nevertheless, various centres often follow diverse data collection procedures and assign different semantics to collected data. Ontologies, used as schemas for interoperable knowledge bases, represent a state-of-the-art solution to homologate the semantics and foster data integration from various sources. This work presents the BrainTeaser Ontology (BTO), an ontology that models the clinical data associated with two brain-related rare diseases (ALS and MS) in a comprehensive and modular manner. BTO assists in organizing and standardizing the data collected during patient follow-up. It was created by harmonizing schemas currently used by multiple medical centers into a common ontology, following a bottom-up approach. As a result, BTO effectively addresses the practical data collection needs of various real-world situations and promotes data portability and interoperability. BTO captures various clinical occurrences, such as disease onset, symptoms, diagnostic and therapeutic procedures, and relapses, using an event-based approach. Developed in collaboration with medical partners and domain experts, BTO offers a holistic view of ALS and MS for supporting the representation of retrospective and prospective data. Furthermore, BTO adheres to Open Science and FAIR (Findable, Accessible, Interoperable, and Reusable) principles, making it a reliable framework for developing predictive tools to aid in medical decision-making and patient care. Although BTO is designed for ALS and MS, its modular structure makes it easily extendable to other brain-related diseases, showcasing its potential for broader applicability.Database URL  https://zenodo.org/records/7886998 .

自动疾病进展预测模型需要大量的训练数据,而这些数据很少能获得,尤其是在罕见疾病方面。一个可行的解决方案是整合来自不同医疗中心的数据。然而,不同的中心通常采用不同的数据收集程序,并为收集到的数据分配不同的语义。本体作为可互操作知识库的模式,代表了一种最先进的解决方案,可实现语义同源并促进来自不同来源的数据整合。这项工作提出了脑激酶本体(BTO),本体以全面和模块化的方式对与两种脑相关罕见疾病(ALS 和 MS)相关的临床数据进行建模。BTO 有助于对患者随访过程中收集的数据进行组织和标准化。它是通过自下而上的方法,将多个医疗中心目前使用的模式统一为一个通用本体而创建的。因此,BTO 能有效满足各种实际情况下的数据收集需求,并促进数据的可移植性和互操作性。BTO 采用基于事件的方法捕获各种临床事件,如疾病发病、症状、诊断和治疗过程以及复发。BTO 是与医疗合作伙伴和领域专家合作开发的,它提供了 ALS 和 MS 的整体视图,支持回顾性和前瞻性数据的表示。此外,BTO 遵循开放科学和 FAIR(可查找、可访问、可互操作和可重用)原则,是开发预测工具的可靠框架,有助于医疗决策和患者护理。虽然 BTO 是针对 ALS 和 MS 而设计的,但其模块化结构使其很容易扩展到其他脑相关疾病,从而展示了其更广泛的应用潜力。数据库网址 https://zenodo.org/records/7886998 。
{"title":"An extensible and unifying approach to retrospective clinical data modeling: the BrainTeaser Ontology.","authors":"Guglielmo Faggioli, Laura Menotti, Stefano Marchesin, Adriano Chió, Arianna Dagliati, Mamede de Carvalho, Marta Gromicho, Umberto Manera, Eleonora Tavazzi, Giorgio Maria Di Nunzio, Gianmaria Silvello, Nicola Ferro","doi":"10.1186/s13326-024-00317-y","DOIUrl":"10.1186/s13326-024-00317-y","url":null,"abstract":"<p><p>Automatic disease progression prediction models require large amounts of training data, which are seldom available, especially when it comes to rare diseases. A possible solution is to integrate data from different medical centres. Nevertheless, various centres often follow diverse data collection procedures and assign different semantics to collected data. Ontologies, used as schemas for interoperable knowledge bases, represent a state-of-the-art solution to homologate the semantics and foster data integration from various sources. This work presents the BrainTeaser Ontology (BTO), an ontology that models the clinical data associated with two brain-related rare diseases (ALS and MS) in a comprehensive and modular manner. BTO assists in organizing and standardizing the data collected during patient follow-up. It was created by harmonizing schemas currently used by multiple medical centers into a common ontology, following a bottom-up approach. As a result, BTO effectively addresses the practical data collection needs of various real-world situations and promotes data portability and interoperability. BTO captures various clinical occurrences, such as disease onset, symptoms, diagnostic and therapeutic procedures, and relapses, using an event-based approach. Developed in collaboration with medical partners and domain experts, BTO offers a holistic view of ALS and MS for supporting the representation of retrospective and prospective data. Furthermore, BTO adheres to Open Science and FAIR (Findable, Accessible, Interoperable, and Reusable) principles, making it a reliable framework for developing predictive tools to aid in medical decision-making and patient care. Although BTO is designed for ALS and MS, its modular structure makes it easily extendable to other brain-related diseases, showcasing its potential for broader applicability.Database URL  https://zenodo.org/records/7886998 .</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"16"},"PeriodicalIF":2.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363415/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142107743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Concretizing plan specifications as realizables within the OBO foundry. 在 OBO 铸造厂内将计划规格具体化为可实现的目标。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-20 DOI: 10.1186/s13326-024-00315-0
William D Duncan, Matthew Diller, Damion Dooley, William R Hogan, John Beverley

Background: Within the Open Biological and Biomedical Ontology (OBO) Foundry, many ontologies represent the execution of a plan specification as a process in which a realizable entity that concretizes the plan specification, a "realizable concretization" (RC), is realized. This representation, which we call the "RC-account", provides a straightforward way to relate a plan specification to the entity that bears the realizable concretization and the process that realizes the realizable concretization. However, the adequacy of the RC-account has not been evaluated in the scientific literature. In this manuscript, we provide this evaluation and, thereby, give ontology developers sound reasons to use or not use the RC-account pattern.

Results: Analysis of the RC-account reveals that it is not adequate for representing failed plans. If the realizable concretization is flawed in some way, it is unclear what (if any) relation holds between the realizable entity and the plan specification. If the execution (i.e., realization) of the realizable concretization fails to carry out the actions given in the plan specification, it is unclear under the RC-account how to directly relate the failed execution to the entity carrying out the instructions given in the plan specification. These issues are exacerbated in the presence of changing plans.

Conclusions: We propose two solutions for representing failed plans. The first uses the Common Core Ontologies 'prescribed by' relation to connect a plan specification to the entity or process that utilizes the plan specification as a guide. The second, more complex, solution incorporates the process of creating a plan (in the sense of an intention to execute a plan specification) into the representation of executing plan specifications. We hypothesize that the first solution (i.e., use of 'prescribed by') is adequate for most situations. However, more research is needed to test this hypothesis as well as explore the other solutions presented in this manuscript.

背景在开放生物和生物医学本体论(OBO)基金会中,许多本体论都将计划规范的执行表述为一个过程,在这个过程中,将计划规范具体化的可实现实体--"可实现具体化"(RC)--得以实现。我们把这种表示法称为 "RC-账户",它提供了一种直截了当的方法,将计划规范与承担可实现具体化的实体和实现可实现具体化的过程联系起来。然而,科学文献尚未对 RC-账户的适当性进行评估。在本手稿中,我们提供了这一评估,从而为本体开发者提供了使用或不使用 RC-账户模式的充分理由:结果:对 RC-账户的分析表明,它不足以表示失败的计划。如果可实现的具体化在某些方面存在缺陷,那么就不清楚可实现的实体与计划规范之间是什么关系(如果有关系的话)。如果可实现具体化的执行(即实现)未能执行计划说明中给出的操作,那么在 RC 账户下,如何将失败的执行与执行计划说明中给出的指令的实体直接联系起来就不清楚了。在计划不断变化的情况下,这些问题会更加严重:我们提出了两种表示失败计划的解决方案。第一种方案使用通用核心本体的 "由......规定 "关系,将计划规范与使用计划规范作为指导的实体或流程联系起来。第二种解决方案更为复杂,它将创建计划的过程(从执行计划规范的意图的意义上来说)纳入到执行计划规范的表示中。我们假设第一种解决方案(即使用 "由......规定")足以应对大多数情况。不过,我们还需要更多的研究来验证这一假设,并探索本手稿中提出的其他解决方案。
{"title":"Concretizing plan specifications as realizables within the OBO foundry.","authors":"William D Duncan, Matthew Diller, Damion Dooley, William R Hogan, John Beverley","doi":"10.1186/s13326-024-00315-0","DOIUrl":"10.1186/s13326-024-00315-0","url":null,"abstract":"<p><strong>Background: </strong>Within the Open Biological and Biomedical Ontology (OBO) Foundry, many ontologies represent the execution of a plan specification as a process in which a realizable entity that concretizes the plan specification, a \"realizable concretization\" (RC), is realized. This representation, which we call the \"RC-account\", provides a straightforward way to relate a plan specification to the entity that bears the realizable concretization and the process that realizes the realizable concretization. However, the adequacy of the RC-account has not been evaluated in the scientific literature. In this manuscript, we provide this evaluation and, thereby, give ontology developers sound reasons to use or not use the RC-account pattern.</p><p><strong>Results: </strong>Analysis of the RC-account reveals that it is not adequate for representing failed plans. If the realizable concretization is flawed in some way, it is unclear what (if any) relation holds between the realizable entity and the plan specification. If the execution (i.e., realization) of the realizable concretization fails to carry out the actions given in the plan specification, it is unclear under the RC-account how to directly relate the failed execution to the entity carrying out the instructions given in the plan specification. These issues are exacerbated in the presence of changing plans.</p><p><strong>Conclusions: </strong>We propose two solutions for representing failed plans. The first uses the Common Core Ontologies 'prescribed by' relation to connect a plan specification to the entity or process that utilizes the plan specification as a guide. The second, more complex, solution incorporates the process of creating a plan (in the sense of an intention to execute a plan specification) into the representation of executing plan specifications. We hypothesize that the first solution (i.e., use of 'prescribed by') is adequate for most situations. However, more research is needed to test this hypothesis as well as explore the other solutions presented in this manuscript.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"15"},"PeriodicalIF":2.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11334599/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142004295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models. 使用级联微调特定领域语言模型将临床试验中的疫苗名称映射到疫苗本体。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-10 DOI: 10.1186/s13326-024-00318-x
Jianfu Li, Yiming Li, Yuanyi Pan, Jinjing Guo, Zenan Sun, Fang Li, Yongqun He, Cui Tao

Background: Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects.

Clinicaltrials: gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance.

Results: In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy.

Conclusion: This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.

背景:疫苗通过提供对传染病的保护,彻底改变了公共卫生。疫苗能刺激免疫系统并产生记忆细胞,从而抵御目标疾病。临床试验评估疫苗的性能,包括剂量、给药途径和潜在的副作用。Clinicaltrials: gov 是一个宝贵的临床试验信息库,但其中的疫苗数据缺乏标准化,导致在自动概念映射、疫苗相关知识开发、循证决策和疫苗监控方面面临挑战:在这项研究中,我们开发了一个级联框架,利用多个领域知识源,包括临床试验、统一医学语言系统(UMLS)和疫苗本体(VO),来提高特定领域语言模型的性能,以便自动映射临床试验中的疫苗本体。疫苗本体(VO)是一个基于社区的本体,旨在促进疫苗数据的标准化、集成和计算机辅助推理。我们的方法包括从各种来源中提取和注释数据。然后,我们对 PubMedBERT 模型进行了预训练,最终开发出 CTPubMedBERT。随后,我们将使用 UMLS 进行预训练的 SAPBERT 纳入 CTPubMedBERT,从而增强了 CTPubMedBERT+SAPBERT。通过使用疫苗本体语料库和来自临床试验的疫苗数据进行微调,进一步完善了 CTPubMedBERT + SAPBERT + VO 模型。最后,我们利用一组预先训练好的模型和基于加权规则的集合方法,对疫苗语料库进行了归一化处理,提高了处理的准确性。概念规范化的排序过程包括对潜在概念进行优先排序和排序,以确定最适合特定语境的匹配概念。我们对前 10 个概念进行了排序,实验结果表明,我们提出的级联框架在疫苗映射方面一直优于现有的有效基线,前 1 个候选概念的准确率达到 71.8%,前 10 个候选概念的准确率达到 90.0%:本研究详细介绍了微调特定领域语言模型的级联框架,该框架可改善临床试验中疫苗的映射。通过有效利用特定领域的信息和应用不同预训练 BERT 模型的基于规则的加权集合,我们的框架可以显著提高临床试验中 VO 的映射能力。
{"title":"Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models.","authors":"Jianfu Li, Yiming Li, Yuanyi Pan, Jinjing Guo, Zenan Sun, Fang Li, Yongqun He, Cui Tao","doi":"10.1186/s13326-024-00318-x","DOIUrl":"10.1186/s13326-024-00318-x","url":null,"abstract":"<p><strong>Background: </strong>Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects.</p><p><strong>Clinicaltrials: </strong>gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance.</p><p><strong>Results: </strong>In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy.</p><p><strong>Conclusion: </strong>This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"15 1","pages":"14"},"PeriodicalIF":2.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11316402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141912790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Semantics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1