首页 > 最新文献

Journal of Biomedical Semantics最新文献

英文 中文
BASIL DB: bioactive semantic integration and linking database. BASIL DB:生物活性语义整合和链接数据库。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-08-13 DOI: 10.1186/s13326-025-00336-3
David Jackson, Paul Groth, Hazar Harmouch

Background: Bioactive compounds found in foods and plants can provide health benefits, including antioxidant and anti-inflammatory effects. Research into their role in disease prevention and personalized nutrition is expanding, but challenges such as data complexity, inconsistent methods, and the rapid growth of scientific literature can hinder progress. To address these issues, we developed BASIL DB (BioActive Semantic Integration and Linking Database), a knowledge graph (KG) database that leverages natural language processing (NLP) techniques to streamline data organization and analysis. This automated approach offers greater scalability and comprehensiveness than traditional methods such as manual data curation and entry.

Construction and content: The process of constructing the BASIL DB is divided into four fundamental steps: data collection, data preprocessing, data extraction, and data integration. Data on bioactives and foods are sourced from structured databases. The relevant randomized controlled trials (RCTs) were extracted from PubMed. The data are then prepared by cleaning inconsistencies and structuring them for analysis. In the data extraction phase, NLP tools, including a large language model (LLM), are utilized to analyze clinical trials and extract data on bioactive compounds and their health impacts. The integration phase compiles these data into a knowledge graph, which consists of the entities Foods, Bioactives, and Health Conditions as nodes and their interactions as edges. To quantify the relationships/interactions between these entities, we generate a weight for each edge on the basis of empirical evidence and methodological rigor.

Utility and discussion: The BASIL DB incorporates 433 compounds, 40296 research papers, 7256 health effects, and 4197 food items. The database features query and visualization capabilities, including interactive graphs and custom filtering options, that showcase different aspects of the data. Users are able to explore the relationships between bioactives and health effects, enhancing both research efficiency and insight discovery.

Conclusion: The BASIL DB is a knowledge graph database of bioactive compounds. This study provides a structured resource for exploring the relationships among bioactives, foods, and health outcomes, representing a step toward a more systematic and data-driven approach to understanding the health effects of bioactive compounds. Future work will focus on expanding the database and refining the utilized methods. Extending the BASIL DB will help bridge the gap between traditional and conventional approaches to nutrition, guiding future research in bioactive compound discovery and health optimization.

Availability: Users can access and explore the data via https://basil-db.github.io/info.html or fork and run the respective script via https://github.com/basil-db/scr

背景:在食物和植物中发现的生物活性化合物可以提供健康益处,包括抗氧化和抗炎作用。对它们在疾病预防和个性化营养中的作用的研究正在扩大,但数据复杂性、方法不一致以及科学文献的快速增长等挑战可能阻碍进展。为了解决这些问题,我们开发了BASIL DB(生物活性语义集成和链接数据库),这是一个知识图谱(KG)数据库,利用自然语言处理(NLP)技术来简化数据组织和分析。与手动数据管理和输入等传统方法相比,这种自动化方法提供了更大的可伸缩性和全面性。构建和内容:BASIL数据库的构建过程分为数据采集、数据预处理、数据提取和数据集成四个基本步骤。生物活性物质和食品的数据来源于结构化数据库。相关随机对照试验(rct)摘自PubMed。然后,通过清理不一致的数据并对其进行结构化以供分析来准备数据。在数据提取阶段,包括大型语言模型(LLM)在内的NLP工具被用于分析临床试验和提取生物活性化合物及其健康影响的数据。集成阶段将这些数据汇编成一个知识图,该知识图由实体“食品”、“生物活性”和“健康状况”组成,这些实体作为节点,它们之间的相互作用作为边缘。为了量化这些实体之间的关系/相互作用,我们根据经验证据和方法的严谨性为每个边缘生成权重。应用和讨论:BASIL数据库包含433种化合物,40296篇研究论文,7256种健康效应和4197种食品。该数据库具有查询和可视化功能,包括交互式图形和自定义过滤选项,可以显示数据的不同方面。用户能够探索生物活性物质与健康效应之间的关系,从而提高研究效率和洞察发现。结论:BASIL数据库是一个生物活性化合物知识图谱数据库。这项研究为探索生物活性物质、食物和健康结果之间的关系提供了一个结构化的资源,代表着朝着更系统和数据驱动的方法来理解生物活性物质对健康的影响迈出了一步。今后的工作将集中于扩大数据库和改进所使用的方法。扩展BASIL DB将有助于弥合传统和传统营养方法之间的差距,指导未来生物活性化合物的发现和健康优化研究。可用性:用户可以通过https://basil-db.github.io/info.html或fork访问和探索数据,并通过https://github.com/basil-db/script运行相应的脚本。
{"title":"BASIL DB: bioactive semantic integration and linking database.","authors":"David Jackson, Paul Groth, Hazar Harmouch","doi":"10.1186/s13326-025-00336-3","DOIUrl":"10.1186/s13326-025-00336-3","url":null,"abstract":"<p><strong>Background: </strong>Bioactive compounds found in foods and plants can provide health benefits, including antioxidant and anti-inflammatory effects. Research into their role in disease prevention and personalized nutrition is expanding, but challenges such as data complexity, inconsistent methods, and the rapid growth of scientific literature can hinder progress. To address these issues, we developed BASIL DB (BioActive Semantic Integration and Linking Database), a knowledge graph (KG) database that leverages natural language processing (NLP) techniques to streamline data organization and analysis. This automated approach offers greater scalability and comprehensiveness than traditional methods such as manual data curation and entry.</p><p><strong>Construction and content: </strong>The process of constructing the BASIL DB is divided into four fundamental steps: data collection, data preprocessing, data extraction, and data integration. Data on bioactives and foods are sourced from structured databases. The relevant randomized controlled trials (RCTs) were extracted from PubMed. The data are then prepared by cleaning inconsistencies and structuring them for analysis. In the data extraction phase, NLP tools, including a large language model (LLM), are utilized to analyze clinical trials and extract data on bioactive compounds and their health impacts. The integration phase compiles these data into a knowledge graph, which consists of the entities Foods, Bioactives, and Health Conditions as nodes and their interactions as edges. To quantify the relationships/interactions between these entities, we generate a weight for each edge on the basis of empirical evidence and methodological rigor.</p><p><strong>Utility and discussion: </strong>The BASIL DB incorporates 433 compounds, 40296 research papers, 7256 health effects, and 4197 food items. The database features query and visualization capabilities, including interactive graphs and custom filtering options, that showcase different aspects of the data. Users are able to explore the relationships between bioactives and health effects, enhancing both research efficiency and insight discovery.</p><p><strong>Conclusion: </strong>The BASIL DB is a knowledge graph database of bioactive compounds. This study provides a structured resource for exploring the relationships among bioactives, foods, and health outcomes, representing a step toward a more systematic and data-driven approach to understanding the health effects of bioactive compounds. Future work will focus on expanding the database and refining the utilized methods. Extending the BASIL DB will help bridge the gap between traditional and conventional approaches to nutrition, guiding future research in bioactive compound discovery and health optimization.</p><p><strong>Availability: </strong>Users can access and explore the data via https://basil-db.github.io/info.html or fork and run the respective script via https://github.com/basil-db/scr","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"14"},"PeriodicalIF":2.0,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12351831/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144846598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic classification of Indonesian consumer health questions. 印度尼西亚消费者健康问题的语义分类。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-07-28 DOI: 10.1186/s13326-025-00334-5
Raniah Nur Hanami, Rahmad Mahendra, Alfan Farizki Wicaksono
<p><strong>Purpose: </strong>Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient's intent and route them towards the relevant information.</p><p><strong>Methods: </strong>This paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model's predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model's decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of "semantic bias", where words with no inherent association with a specific semantic type disproportionately influence the model's predictions.</p><p><strong>Results: </strong>The annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words "kanker" (cancer) and "depresi" (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets.</p><p><strong>Conclusion: </strong>We constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with
目的:在线消费者健康论坛是公众与医疗专业人员联系的一种方式。虽然这些医疗论坛提供了有价值的服务,但由于可用的医疗保健专业人员数量有限,在线问答(QA)论坛可能难以及时提供答案。解决这个问题的一种方法是开发一种自动QA系统,可以为患者提供更快的答案。这种系统的一个关键组成部分可能是一个用于对问题的语义类型进行分类的模块。这将使系统了解患者的意图,并将他们导向相关信息。方法:本文提出了一种新的两步方法来解决印尼消费者健康问题中语义类型分类的挑战。我们承认印度尼西亚卫生领域数据的缺乏,这是机器学习模型的一个障碍。为了解决这一差距,我们首先介绍了一个新的注释印尼消费者健康问题的语料库。其次,我们利用这个新创建的语料库来构建和评估一个数据驱动的预测模型,用于对问题语义类型进行分类。为了提高模型预测的可信度和可解释性,我们采用了一个可解释的模型框架LIME。这个框架有助于更深入地理解基于单词的特征在模型决策过程中所起的作用。此外,它使我们能够进行全面的偏差分析,允许检测“语义偏差”,其中与特定语义类型没有固有关联的单词不成比例地影响模型的预测。结果:标注过程显示专家标注者之间的一致性中等。此外,并非所有具有高LIME概率的单词都可以被认为是问题类型的真实特征。这表明所使用的数据和机器学习模型本身存在潜在的偏差。值得注意的是,XGBoost、Naïve贝叶斯和MLP模型显示出一种趋势,即预测包含“kanker”(癌症)和“depression”(抑郁症)的问题属于诊断类别。在预测性能方面,Perceptron和XGBoost是表现最好的模型,在所有输入场景和加权因素中获得了最高的加权平均F1分数。Naïve贝叶斯在使用Borderline SMOTE平衡数据后表现最好,这表明它有望处理不平衡数据集。结论:构建了印度尼西亚消费者健康领域的查询语料库,包含964个问题,并标注了相应的语义类型。该语料库是构建预测模型的基础。我们进一步研究了疾病偏倚词对模型性能的影响。这些词表现出较高的LIME得分,但缺乏与特定语义类型的关联。我们使用有或没有这些偏差词的数据集训练模型,发现两种情况下模型性能没有显著差异,这表明模型可能在学习过程中具有减轻此类偏差影响的能力。
{"title":"Semantic classification of Indonesian consumer health questions.","authors":"Raniah Nur Hanami, Rahmad Mahendra, Alfan Farizki Wicaksono","doi":"10.1186/s13326-025-00334-5","DOIUrl":"10.1186/s13326-025-00334-5","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Purpose: &lt;/strong&gt;Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient's intent and route them towards the relevant information.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;This paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model's predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model's decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of \"semantic bias\", where words with no inherent association with a specific semantic type disproportionately influence the model's predictions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words \"kanker\" (cancer) and \"depresi\" (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusion: &lt;/strong&gt;We constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with ","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"13"},"PeriodicalIF":2.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12302743/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144731118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A fourfold pathogen reference ontology suite. 四层病原体参考本体套件。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-07-09 DOI: 10.1186/s13326-025-00333-6
John Beverley, Shane Babcock, Carter Benson, Giacomo De Colle, Sydney Cohen, Alexander D Diehl, Ram A N R Challa, Rachel A Mavrovich, Joshua Billig, Anthony Huffman, Yongqun He

Background: Infectious diseases remain a critical global health challenge, and the integration of standardized ontologies plays a vital role in managing related data. The Infectious Disease Ontology (IDO) and its extensions, such as the Coronavirus Infectious Disease Ontology (CIDO), are essential for organizing and disseminating information related to infectious diseases. The COVID-19 pandemic highlighted the need for updating IDO and its virus-specific extensions. There is an additional need to update IDO extensions specific to bacteria, fungus, and parasite infectious diseases.

Methods: The "hub-and-spoke" methodology is adopted to generate pathogen-specific extensions of IDO: Virus Infectious Disease Ontology (VIDO), Bacteria Infectious Disease Ontology (BIDO), Mycosis Infectious Disease Ontology (MIDO), and Parasite Infectious Disease Ontology (PIDO).

Results: IDO is introduced before reporting on the scopes, major classes and relations, applications and extensions of IDO to VIDO, BIDO, MIDO, and PIDO.

Conclusions: The creation of pathogen-specific reference ontologies advances modularization and reusability of infectious disease ontologies within the IDO ecosystem. Future work will focus on further refining these ontologies, creating new extensions, and developing application ontologies based on them, in line with ongoing efforts to standardize biological and biomedical terminologies for improved data sharing, quality, and analysis.

背景:传染病仍然是一个重大的全球卫生挑战,标准化本体的整合在管理相关数据方面起着至关重要的作用。传染病本体(IDO)及其扩展,如冠状病毒传染病本体(CIDO),对于组织和传播与传染病相关的信息至关重要。COVID-19大流行凸显了更新IDO及其病毒特异性扩展的必要性。另外还需要更新针对细菌、真菌和寄生虫传染病的IDO扩展。方法:采用“中心-辐条”方法生成IDO的病原体特异性扩展:病毒传染病本体(VIDO)、细菌传染病本体(BIDO)、真菌传染病本体(MIDO)和寄生虫传染病本体(PIDO)。结果:首先介绍IDO,然后报告IDO的范围、主要类和关系、IDO在video、BIDO、MIDO、PIDO中的应用和扩展。结论:病原体特异性参考本体的创建促进了IDO生态系统内传染病本体的模块化和可重用性。未来的工作将集中于进一步细化这些本体,创建新的扩展,并基于它们开发应用本体,与标准化生物学和生物医学术语以改进数据共享、质量和分析的持续努力保持一致。
{"title":"A fourfold pathogen reference ontology suite.","authors":"John Beverley, Shane Babcock, Carter Benson, Giacomo De Colle, Sydney Cohen, Alexander D Diehl, Ram A N R Challa, Rachel A Mavrovich, Joshua Billig, Anthony Huffman, Yongqun He","doi":"10.1186/s13326-025-00333-6","DOIUrl":"10.1186/s13326-025-00333-6","url":null,"abstract":"<p><strong>Background: </strong>Infectious diseases remain a critical global health challenge, and the integration of standardized ontologies plays a vital role in managing related data. The Infectious Disease Ontology (IDO) and its extensions, such as the Coronavirus Infectious Disease Ontology (CIDO), are essential for organizing and disseminating information related to infectious diseases. The COVID-19 pandemic highlighted the need for updating IDO and its virus-specific extensions. There is an additional need to update IDO extensions specific to bacteria, fungus, and parasite infectious diseases.</p><p><strong>Methods: </strong>The \"hub-and-spoke\" methodology is adopted to generate pathogen-specific extensions of IDO: Virus Infectious Disease Ontology (VIDO), Bacteria Infectious Disease Ontology (BIDO), Mycosis Infectious Disease Ontology (MIDO), and Parasite Infectious Disease Ontology (PIDO).</p><p><strong>Results: </strong>IDO is introduced before reporting on the scopes, major classes and relations, applications and extensions of IDO to VIDO, BIDO, MIDO, and PIDO.</p><p><strong>Conclusions: </strong>The creation of pathogen-specific reference ontologies advances modularization and reusability of infectious disease ontologies within the IDO ecosystem. Future work will focus on further refining these ontologies, creating new extensions, and developing application ontologies based on them, in line with ongoing efforts to standardize biological and biomedical terminologies for improved data sharing, quality, and analysis.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"12"},"PeriodicalIF":2.0,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12239493/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144600470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
medicX-KG: a knowledge graph for pharmacists' drug information needs. medicX-KG:药师药品信息需求知识图谱。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-07-01 DOI: 10.1186/s13326-025-00332-7
Lizzy Farrugia, Lilian M Azzopardi, Jeremy Debattista, Charlie Abela

The role of pharmacists is evolving from medicine dispensing to delivering comprehensive pharmaceutical services within multidisciplinary healthcare teams. Central to this shift is access to accurate, up-to-date medicinal product information supported by robust data integration. Leveraging artificial intelligence and semantic technologies, Knowledge Graphs (KGs) uncover hidden relationships and enable data-driven decision-making. This paper presents medicX-KG, a pharmacist-oriented knowledge graph supporting clinical and regulatory decisions. It forms the semantic layer of the broader medicX platform, powering predictive and explainable pharmacy services. medicX-KG integrates data from three sources, including, the British National Formulary (BNF), DrugBank, and the Malta Medicines Authority (MMA) that addresses Malta's regulatory landscape and combines European Medicines Agency alignment with partial UK supply dependence. The KG tackles the absence of a unified national drug repository, reducing pharmacists' reliance on fragmented sources. Its design was informed by interviews with practising pharmacists to ensure real-world applicability. We detail the KG's construction, including data extraction, ontology design, and semantic mapping. Evaluation demonstrates that medicX-KG effectively supports queries about drug availability, interactions, adverse reactions, and therapeutic classes. Limitations, including missing detailed dosage encoding and real-time updates, are discussed alongside directions for future enhancements.

药剂师的角色正在从药物调剂发展到在多学科医疗团队中提供全面的药物服务。这一转变的核心是获得由强大的数据集成支持的准确、最新的药品信息。利用人工智能和语义技术,知识图(KGs)可以揭示隐藏的关系,并实现数据驱动的决策。本文介绍了medicX-KG,一个支持临床和监管决策的面向药剂师的知识图谱。它构成了更广泛的medicX平台的语义层,为可预测和可解释的药房服务提供支持。medicX-KG整合了来自三个来源的数据,包括英国国家处方集(BNF)、药品银行和马耳他药品管理局(MMA),解决了马耳他的监管环境,并将欧洲药品管理局与部分英国供应依赖相结合。KG解决了缺乏统一的国家药物储存库的问题,减少了药剂师对分散来源的依赖。它的设计是通过与执业药剂师的访谈来确保现实世界的适用性。我们详细介绍了KG的构建,包括数据提取、本体设计和语义映射。评估表明,medicX-KG有效地支持关于药物可用性、相互作用、不良反应和治疗类别的查询。局限性,包括缺少详细的剂量编码和实时更新,讨论了未来增强的方向。
{"title":"medicX-KG: a knowledge graph for pharmacists' drug information needs.","authors":"Lizzy Farrugia, Lilian M Azzopardi, Jeremy Debattista, Charlie Abela","doi":"10.1186/s13326-025-00332-7","DOIUrl":"10.1186/s13326-025-00332-7","url":null,"abstract":"<p><p>The role of pharmacists is evolving from medicine dispensing to delivering comprehensive pharmaceutical services within multidisciplinary healthcare teams. Central to this shift is access to accurate, up-to-date medicinal product information supported by robust data integration. Leveraging artificial intelligence and semantic technologies, Knowledge Graphs (KGs) uncover hidden relationships and enable data-driven decision-making. This paper presents medicX-KG, a pharmacist-oriented knowledge graph supporting clinical and regulatory decisions. It forms the semantic layer of the broader medicX platform, powering predictive and explainable pharmacy services. medicX-KG integrates data from three sources, including, the British National Formulary (BNF), DrugBank, and the Malta Medicines Authority (MMA) that addresses Malta's regulatory landscape and combines European Medicines Agency alignment with partial UK supply dependence. The KG tackles the absence of a unified national drug repository, reducing pharmacists' reliance on fragmented sources. Its design was informed by interviews with practising pharmacists to ensure real-world applicability. We detail the KG's construction, including data extraction, ontology design, and semantic mapping. Evaluation demonstrates that medicX-KG effectively supports queries about drug availability, interactions, adverse reactions, and therapeutic classes. Limitations, including missing detailed dosage encoding and real-time updates, are discussed alongside directions for future enhancements.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"11"},"PeriodicalIF":2.0,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12211240/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144540334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unveiling differential adverse event profiles in vaccines via LLM text embeddings and ontology semantic analysis. 通过LLM文本嵌入和本体语义分析揭示疫苗的不同不良事件概况。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-05-23 DOI: 10.1186/s13326-025-00331-8
Zhigang Wang, Xingxian Li, Jie Zheng, Yongqun He

Background: Vaccines are crucial for preventing infectious diseases; however, they may also be associated with adverse events (AEs). Conventional analysis of vaccine AEs relies on manual review and assignment of AEs to terms in terminology or ontology, which is a time-consuming process and constrained in scope. This study explores the potential of using Large Language Models (LLMs) and LLM text embeddings for efficient and comprehensive vaccine AE analysis.

Results: We used Llama-3 LLM to extract AE information from FDA-approved vaccine package inserts for 111 licensed vaccines, including 15 influenza vaccines. Text embeddings were then generated for each vaccine's AEs using the nomic-embed-text and mxbai-embed-large models. Llama-3 achieved over 80% accuracy in extracting AE text from vaccine package inserts. To further evaluate the performance of text embedding, the vaccines were clustered using two clustering methods: (1) LLM text embedding-based clustering and (2) ontology-based semantic similarity analysis. The ontology-based method mapped AEs to the Human Phenotype Ontology (HPO) and Ontology of Adverse Events (OAE), with semantic similarity analyzed using Lin's method. Text embeddings were generated for each vaccine's AE description using the LLM nomic-embed-text and mxbai-embed-large models. Compared to the semantic similarity analysis, the LLM approach was able to capture more differential AE profiles. Furthermore, LLM-derived text embeddings were used to develop a Lasso logistic regression model to predict whether a vaccine is "Live" or "Non-Live". The term "Non-Live" refers to all vaccines that do not contain live organisms, including inactivated and mRNA vaccines. A comparative analysis showed that, despite similar clustering patterns, the nomic-embed-text model outperformed the other. It achieved 80.00% sensitivity, 83.06% specificity, and 81.89% accuracy in a 10-fold cross-validation. Many AE patterns, with examples demonstrated, were identified from our analysis with AE LLM embeddings.

Conclusion: This study demonstrates the effectiveness of LLMs for automated AE extraction and analysis, and LLM text embeddings capture latent information about AEs, enabling more comprehensive knowledge discovery. Our findings suggest that LLMs demonstrate substantial potential for improving vaccine safety and public health research.

背景:疫苗对预防传染病至关重要;然而,它们也可能与不良事件(ae)有关。传统的疫苗ae分析依赖于人工审查和将ae分配给术语或本体中的术语,这是一个耗时且范围有限的过程。本研究探索了使用大型语言模型(LLM)和LLM文本嵌入进行有效和全面的疫苗AE分析的潜力。结果:我们使用Llama-3 LLM从fda批准的111种疫苗说明书中提取AE信息,其中包括15种流感疫苗。然后使用nomic-embed-text和mxbai-embed-large模型为每种疫苗的ae生成文本嵌入。Llama-3从疫苗包装说明书中提取AE文本的准确率达到80%以上。为了进一步评估文本嵌入的性能,采用两种聚类方法对疫苗进行聚类:(1)基于LLM文本嵌入的聚类和(2)基于本体的语义相似度分析。基于本体的方法将ae映射到人类表型本体(Human Phenotype Ontology, HPO)和不良事件本体(Ontology of Adverse Events, OAE),并使用Lin的方法分析语义相似度。使用LLM nomic-embed-text和mxbai-embed-large模型为每种疫苗的AE描述生成文本嵌入。与语义相似度分析相比,LLM方法能够捕获更多不同的声发射特征。此外,法学硕士衍生的文本嵌入用于开发Lasso逻辑回归模型,以预测疫苗是“活的”还是“非活的”。“非活”一词是指不含活生物体的所有疫苗,包括灭活疫苗和mRNA疫苗。对比分析表明,尽管相似的聚类模式,nomic-embed-text模型优于另一种。在10倍交叉验证中,灵敏度为80.00%,特异性为83.06%,准确度为81.89%。通过对AE LLM嵌入的分析,我们确定了许多AE模式,并给出了示例。结论:本研究证明了LLM在自动声发射提取和分析方面的有效性,LLM文本嵌入捕获了声发射的潜在信息,从而实现了更全面的知识发现。我们的研究结果表明,法学硕士在改善疫苗安全和公共卫生研究方面具有巨大潜力。
{"title":"Unveiling differential adverse event profiles in vaccines via LLM text embeddings and ontology semantic analysis.","authors":"Zhigang Wang, Xingxian Li, Jie Zheng, Yongqun He","doi":"10.1186/s13326-025-00331-8","DOIUrl":"10.1186/s13326-025-00331-8","url":null,"abstract":"<p><strong>Background: </strong>Vaccines are crucial for preventing infectious diseases; however, they may also be associated with adverse events (AEs). Conventional analysis of vaccine AEs relies on manual review and assignment of AEs to terms in terminology or ontology, which is a time-consuming process and constrained in scope. This study explores the potential of using Large Language Models (LLMs) and LLM text embeddings for efficient and comprehensive vaccine AE analysis.</p><p><strong>Results: </strong>We used Llama-3 LLM to extract AE information from FDA-approved vaccine package inserts for 111 licensed vaccines, including 15 influenza vaccines. Text embeddings were then generated for each vaccine's AEs using the nomic-embed-text and mxbai-embed-large models. Llama-3 achieved over 80% accuracy in extracting AE text from vaccine package inserts. To further evaluate the performance of text embedding, the vaccines were clustered using two clustering methods: (1) LLM text embedding-based clustering and (2) ontology-based semantic similarity analysis. The ontology-based method mapped AEs to the Human Phenotype Ontology (HPO) and Ontology of Adverse Events (OAE), with semantic similarity analyzed using Lin's method. Text embeddings were generated for each vaccine's AE description using the LLM nomic-embed-text and mxbai-embed-large models. Compared to the semantic similarity analysis, the LLM approach was able to capture more differential AE profiles. Furthermore, LLM-derived text embeddings were used to develop a Lasso logistic regression model to predict whether a vaccine is \"Live\" or \"Non-Live\". The term \"Non-Live\" refers to all vaccines that do not contain live organisms, including inactivated and mRNA vaccines. A comparative analysis showed that, despite similar clustering patterns, the nomic-embed-text model outperformed the other. It achieved 80.00% sensitivity, 83.06% specificity, and 81.89% accuracy in a 10-fold cross-validation. Many AE patterns, with examples demonstrated, were identified from our analysis with AE LLM embeddings.</p><p><strong>Conclusion: </strong>This study demonstrates the effectiveness of LLMs for automated AE extraction and analysis, and LLM text embeddings capture latent information about AEs, enabling more comprehensive knowledge discovery. Our findings suggest that LLMs demonstrate substantial potential for improving vaccine safety and public health research.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"10"},"PeriodicalIF":2.0,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12102970/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144132354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The SPHN Schema Forge - transform healthcare semantics from human-readable to machine-readable by leveraging semantic web technologies. SPHN Schema Forge——利用语义web技术将医疗保健语义从人类可读转换为机器可读。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-05-08 DOI: 10.1186/s13326-025-00330-9
Vasundra Touré, Deepak Unni, Philip Krauss, Abdelhamid Abdelwahed, Jascha Buchhorn, Leon Hinderling, Thomas R Geiger, Sabine Österle

Background: The Swiss Personalized Health Network (SPHN) adopted the Resource Description Framework (RDF), a core component of the Semantic Web technology stack, for the formal encoding and exchange of healthcare data in a medical knowledge graph. The SPHN RDF Schema defines the semantics on how data elements should be represented. While RDF is proven to be machine readable and interpretable, it can be challenging for individuals without specialized background to read and understand the knowledge represented in RDF. For this reason, the semantics described in the SPHN RDF Schema are primarily defined in a user-accessible tabular format, the SPHN Dataset, before being translated into its RDF representation. However, this translation process was previously manual, time-consuming and labor-intensive.

Result: To automate and streamline the translation from tabular to RDF representation, the SPHN Schema Forge web service was developed. With a few clicks, this tool automatically converts an SPHN-compliant Dataset spreadsheet into an RDF schema. Additionally, it generates SHACL rules for data validation, an HTML visualization of the schema and SPARQL queries for basic data analysis.

Conclusion: The SPHN Schema Forge significantly reduces the manual effort and time required for schema generation, enabling researchers to focus on more meaningful tasks such as data interpretation and analysis within the SPHN framework.

背景:瑞士个性化健康网络(SPHN)采用了资源描述框架(RDF),这是语义Web技术栈的核心组件,用于医学知识图中医疗保健数据的正式编码和交换。SPHN RDF Schema定义了应该如何表示数据元素的语义。虽然RDF已被证明是机器可读和可解释的,但对于没有专门背景的个人来说,阅读和理解RDF中表示的知识可能是一项挑战。由于这个原因,SPHN RDF Schema中描述的语义在被转换为其RDF表示之前,主要以用户可访问的表格格式(SPHN Dataset)定义。然而,这种翻译过程以前是手工的,耗时且劳动密集。结果:为了自动化和简化从表格表示到RDF表示的转换,开发了SPHN Schema Forge web服务。只需单击几下,该工具就会自动将符合sphn的Dataset电子表格转换为RDF模式。此外,它还生成用于数据验证的SHACL规则、模式的HTML可视化和用于基本数据分析的SPARQL查询。结论:SPHN Schema Forge显著减少了模式生成所需的人工工作量和时间,使研究人员能够专注于更有意义的任务,如SPHN框架内的数据解释和分析。
{"title":"The SPHN Schema Forge - transform healthcare semantics from human-readable to machine-readable by leveraging semantic web technologies.","authors":"Vasundra Touré, Deepak Unni, Philip Krauss, Abdelhamid Abdelwahed, Jascha Buchhorn, Leon Hinderling, Thomas R Geiger, Sabine Österle","doi":"10.1186/s13326-025-00330-9","DOIUrl":"10.1186/s13326-025-00330-9","url":null,"abstract":"<p><strong>Background: </strong>The Swiss Personalized Health Network (SPHN) adopted the Resource Description Framework (RDF), a core component of the Semantic Web technology stack, for the formal encoding and exchange of healthcare data in a medical knowledge graph. The SPHN RDF Schema defines the semantics on how data elements should be represented. While RDF is proven to be machine readable and interpretable, it can be challenging for individuals without specialized background to read and understand the knowledge represented in RDF. For this reason, the semantics described in the SPHN RDF Schema are primarily defined in a user-accessible tabular format, the SPHN Dataset, before being translated into its RDF representation. However, this translation process was previously manual, time-consuming and labor-intensive.</p><p><strong>Result: </strong>To automate and streamline the translation from tabular to RDF representation, the SPHN Schema Forge web service was developed. With a few clicks, this tool automatically converts an SPHN-compliant Dataset spreadsheet into an RDF schema. Additionally, it generates SHACL rules for data validation, an HTML visualization of the schema and SPARQL queries for basic data analysis.</p><p><strong>Conclusion: </strong>The SPHN Schema Forge significantly reduces the manual effort and time required for schema generation, enabling researchers to focus on more meaningful tasks such as data interpretation and analysis within the SPHN framework.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"9"},"PeriodicalIF":2.0,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12063216/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144005244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sentences, entities, and keyphrases extraction from consumer health forums using multi-task learning. 使用多任务学习从消费者健康论坛中提取句子、实体和关键短语。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-05-06 DOI: 10.1186/s13326-025-00329-2
Tsaqif Naufal, Rahmad Mahendra, Alfan Farizki Wicaksono

Purpose: Online consumer health forums offer an alternative source of health-related information for internet users seeking specific details that may not be readily available through articles or other one-way communication channels. However, the effectiveness of these forums can be constrained by the limited number of healthcare professionals actively participating, which can impact response times to user inquiries. One potential solution to this issue is the integration of a semi-automatic system. A critical component of such a system is question processing, which often involves sentence recognition (SR), medical entity recognition (MER), and keyphrase extraction (KE) modules. We posit that the development of these three modules would enable the system to identify critical components of the question, thereby facilitating a deeper understanding of the question, and allowing for the re-formulation of more effective questions with extracted key information.

Methods: This work contributes to two key aspects related to these three tasks. First, we expand and publicly release an Indonesian dataset for each task. Second, we establish a baseline for all three tasks within the Indonesian language domain by employing transformer-based models with nine distinct encoder variations. Our feature studies revealed an interdependence among these three tasks. Consequently, we propose several multi-task learning (MTL) models, both in pairwise and three-way configurations, incorporating parallel and hierarchical architectures.

Results: Using F1-score at the chunk level, the inter-annotator agreements for SR, MER, and KE tasks were 88.61 % , 64.83 % , and 35.01 % respectively. In single-task learning (STL) settings, the best performance for each task was achieved by different model, with IndoNLU LARGE obtained the highest average score. These results suggested that a larger model did not always perform better. We also found no indication of which ones between Indonesian and multilingual language models that generally performed better for our tasks. In pairwise MTL settings, we found that pairing tasks could outperform the STL baseline for all three tasks. Despite varying loss weights across our three-way MTL models, we did not identify a consistent pattern. While some configurations improved MER and KE performance, none surpassed the best pairwise MTL model for the SR task.

Conclusion: We extended an Indonesian dataset for SR, MER, and KE tasks, resulted in 1, 173 labeled data points which splitted into 773 training instances, 200 validation instances, and 200 testing instances. We then used transformer-based models to set a baseline for all three tasks. Our MTL experiments suggested that additional informat

目的:在线消费者健康论坛为寻求具体细节的互联网用户提供了另一种健康相关信息来源,这些细节可能无法通过文章或其他单向沟通渠道轻易获得。然而,这些论坛的有效性可能会受到积极参与的医疗保健专业人员数量有限的限制,这可能会影响对用户查询的响应时间。这个问题的一个潜在解决方案是集成半自动系统。该系统的一个关键组成部分是问题处理,它通常涉及句子识别(SR)、医疗实体识别(MER)和关键短语提取(KE)模块。我们认为,这三个模块的开发将使系统能够确定问题的关键组成部分,从而促进对问题的更深入理解,并允许用提取的关键信息重新制定更有效的问题。方法:本工作对与这三项任务相关的两个关键方面作出了贡献。首先,我们扩展并公开发布每个任务的印尼语数据集。其次,我们通过使用具有九个不同编码器变体的基于转换器的模型,为印度尼西亚语言领域内的所有三个任务建立基线。我们的特征研究揭示了这三个任务之间的相互依存关系。因此,我们提出了几种多任务学习(MTL)模型,包括两两和三方配置,结合并行和分层架构。结果:在块水平上使用f1评分,SR、MER和KE任务的注释者间一致性分别为88.61%、64.83%和35.01%。在单任务学习(STL)设置中,不同的模型在每个任务上的表现都是最好的,其中IndoNLU LARGE的平均分最高。这些结果表明,更大的模型并不总是表现得更好。我们也没有发现印尼语和多语言语言模型中哪一种在我们的任务中表现得更好。在成对MTL设置中,我们发现配对任务可以在所有三个任务中优于STL基线。尽管在我们的三方MTL模型中损失权重不同,但我们没有确定一致的模式。虽然一些配置提高了MER和KE的性能,但没有一个超过SR任务的最佳成对MTL模型。结论:我们扩展了印度尼西亚的SR、MER和KE任务数据集,得到了1,173个标记数据点,分为773个训练实例、200个验证实例和200个测试实例。然后,我们使用基于转换器的模型为所有三个任务设置基线。我们的MTL实验表明,关于其他两个任务的额外信息对MER和KE任务的学习过程有帮助,而对SR任务的影响很小。
{"title":"Sentences, entities, and keyphrases extraction from consumer health forums using multi-task learning.","authors":"Tsaqif Naufal, Rahmad Mahendra, Alfan Farizki Wicaksono","doi":"10.1186/s13326-025-00329-2","DOIUrl":"10.1186/s13326-025-00329-2","url":null,"abstract":"<p><strong>Purpose: </strong>Online consumer health forums offer an alternative source of health-related information for internet users seeking specific details that may not be readily available through articles or other one-way communication channels. However, the effectiveness of these forums can be constrained by the limited number of healthcare professionals actively participating, which can impact response times to user inquiries. One potential solution to this issue is the integration of a semi-automatic system. A critical component of such a system is question processing, which often involves sentence recognition (SR), medical entity recognition (MER), and keyphrase extraction (KE) modules. We posit that the development of these three modules would enable the system to identify critical components of the question, thereby facilitating a deeper understanding of the question, and allowing for the re-formulation of more effective questions with extracted key information.</p><p><strong>Methods: </strong>This work contributes to two key aspects related to these three tasks. First, we expand and publicly release an Indonesian dataset for each task. Second, we establish a baseline for all three tasks within the Indonesian language domain by employing transformer-based models with nine distinct encoder variations. Our feature studies revealed an interdependence among these three tasks. Consequently, we propose several multi-task learning (MTL) models, both in pairwise and three-way configurations, incorporating parallel and hierarchical architectures.</p><p><strong>Results: </strong>Using F1-score at the chunk level, the inter-annotator agreements for SR, MER, and KE tasks were <math><mrow><mn>88.61</mn> <mo>%</mo> <mo>,</mo> <mn>64.83</mn> <mo>%</mo></mrow> </math> , and <math><mrow><mn>35.01</mn> <mo>%</mo></mrow> </math> respectively. In single-task learning (STL) settings, the best performance for each task was achieved by different model, with <math><msub><mtext>IndoNLU</mtext> <mtext>LARGE</mtext></msub> </math> obtained the highest average score. These results suggested that a larger model did not always perform better. We also found no indication of which ones between Indonesian and multilingual language models that generally performed better for our tasks. In pairwise MTL settings, we found that pairing tasks could outperform the STL baseline for all three tasks. Despite varying loss weights across our three-way MTL models, we did not identify a consistent pattern. While some configurations improved MER and KE performance, none surpassed the best pairwise MTL model for the SR task.</p><p><strong>Conclusion: </strong>We extended an Indonesian dataset for SR, MER, and KE tasks, resulted in 1, 173 labeled data points which splitted into 773 training instances, 200 validation instances, and 200 testing instances. We then used transformer-based models to set a baseline for all three tasks. Our MTL experiments suggested that additional informat","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"8"},"PeriodicalIF":2.0,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12057135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144025207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantics in action: a guide for representing clinical data elements with SNOMED CT. 语义的作用:用SNOMED CT表示临床数据元素的指南。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-27 DOI: 10.1186/s13326-025-00326-5
Julien Ehrsam, Christophe Gaudet-Blavignac, Mirjam Mattei, Monika Baumann, Christian Lovis

Background: Clinical data is abundant, but meaningful reuse remains lacking. Semantic representation using SNOMED CT can improve research, public health, and quality of care. However, the lack of applied guidelines to industrialise the process hinders sustainability and reproducibility. This work describes a guide for semantic representation of data elements with SNOMED CT, addressing challenges encountered during its application. The representation of the institutional data warehouse started with the guidelines proposed by SNOMED International and other groups. However, the application at large scale of manual expert-driven representation led to the development of additional rules.

Results: An eight-rule step-by-step guide was developed iteratively through focus groups. Continuously refined by usage and growing coverage, they are tested in practice to ensure they achieve the desired outcome. All rules prioritize maintaining semantic accuracy, which is the main goal of our strategy. They are divided into four groups which apply to understanding the data correctly (Context), and to using SNOMED CT properly (Single concepts first, Approved post-coordination, Extending post-coordination).

Conclusions: This work provides a practical framework for semantic representation using SNOMED CT, enabling greater accuracy and consistency by promoting a common method. While addressing challenges of large-scale implementation, the guide supports the drive from data centric models to a semantic centric approach, leveraging interoperability and more effective reuse of clinical data.

背景:临床数据丰富,但仍缺乏有意义的再利用。使用SNOMED CT的语义表示可以改善研究、公共卫生和护理质量。然而,缺乏将这一过程工业化的适用准则妨碍了可持续性和可重复性。这项工作描述了SNOMED CT数据元素语义表示的指南,解决了其应用过程中遇到的挑战。机构数据仓库的表示始于SNOMED国际和其他团体提出的准则。然而,人工专家驱动表示的大规模应用导致了额外规则的发展。结果:通过焦点小组迭代开发出八条分步指南。通过使用和不断增长的覆盖率不断改进,它们在实践中进行测试,以确保它们达到预期的结果。所有规则都优先考虑保持语义准确性,这是我们策略的主要目标。它们被分为四组,分别适用于正确理解数据(上下文)和正确使用SNOMED CT(首先是单一概念,批准后协调,扩展后协调)。结论:这项工作为使用SNOMED CT进行语义表示提供了一个实用的框架,通过推广一种通用方法,提高了准确性和一致性。在解决大规模实施的挑战的同时,该指南支持从以数据为中心的模型转向以语义为中心的方法,利用互操作性和更有效地重用临床数据。
{"title":"Semantics in action: a guide for representing clinical data elements with SNOMED CT.","authors":"Julien Ehrsam, Christophe Gaudet-Blavignac, Mirjam Mattei, Monika Baumann, Christian Lovis","doi":"10.1186/s13326-025-00326-5","DOIUrl":"10.1186/s13326-025-00326-5","url":null,"abstract":"<p><strong>Background: </strong>Clinical data is abundant, but meaningful reuse remains lacking. Semantic representation using SNOMED CT can improve research, public health, and quality of care. However, the lack of applied guidelines to industrialise the process hinders sustainability and reproducibility. This work describes a guide for semantic representation of data elements with SNOMED CT, addressing challenges encountered during its application. The representation of the institutional data warehouse started with the guidelines proposed by SNOMED International and other groups. However, the application at large scale of manual expert-driven representation led to the development of additional rules.</p><p><strong>Results: </strong>An eight-rule step-by-step guide was developed iteratively through focus groups. Continuously refined by usage and growing coverage, they are tested in practice to ensure they achieve the desired outcome. All rules prioritize maintaining semantic accuracy, which is the main goal of our strategy. They are divided into four groups which apply to understanding the data correctly (Context), and to using SNOMED CT properly (Single concepts first, Approved post-coordination, Extending post-coordination).</p><p><strong>Conclusions: </strong>This work provides a practical framework for semantic representation using SNOMED CT, enabling greater accuracy and consistency by promoting a common method. While addressing challenges of large-scale implementation, the guide supports the drive from data centric models to a semantic centric approach, leveraging interoperability and more effective reuse of clinical data.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"7"},"PeriodicalIF":2.0,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11948947/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143730232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Standardizing free-text data exemplified by two fields from the Immune Epitope Database. 标准化自由文本数据,以免疫表位数据库中的两个字段为例。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-22 DOI: 10.1186/s13326-025-00324-7
Sebastian Duesing, Jason Bennett, James A Overton, Randi Vita, Bjoern Peters

Background: While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e., removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different fields curated from the literature in the Immune Epitope Database (IEDB): "age" and "data-location" (the part of a paper in which data was found).

Results: Free text entries for the database fields for subject age (4095 distinct values) and publication data-location (251,810 distinct values) in the IEDB were analyzed. Normalization was performed in three steps, namely character normalization, word normalization, and phrase normalization, using generalizable rules developed and applied with the tool presented in this manuscript. For the age dataset, in the character stage, the application of 21 rules resulted in 99.97% output validity; in the word stage, the application of 94 rules resulted in 98.06% output validity; and in the phrase stage, the application of 16 rules resulted in 83.81% output validity. For the data-location dataset, in the character stage, the application of 39 rules resulted in 99.99% output validity; in the word stage, the application of 187 rules resulted in 98.46% output validity; and in the phrase stage, the application of 12 rules resulted in 97.95% output validity.

Conclusions: We developed a generalizable approach for normalization of free text as found in database fields with content on a specific topic. Creating and testing the rules took a one-time effort for a given field that can now be applied to data as it is being curated. The standardization achieved in two datasets tested produces significantly reduced variance in the content which enhances the findability and usability of that data, chiefly by improving search functionality and enabling linkages with formal ontologies.

背景:虽然非结构化数据,如自由文本,构成了大量公开可用的生物医学数据,但由于难以从中提取意义,因此在自动化分析中未得到充分利用。规范化自由文本数据,即删除不必要的差异,支持使用结构化词汇表(如本体)来表示数据,并允许对其进行协调查询。本文提出了一种适用于自由文本规范化的工具,并评估了该工具在免疫表位数据库(IEDB)文献中两个不同领域的应用:“年龄”和“数据位置”(论文中发现数据的部分)。结果:分析了IEDB中主题年龄(4095个不同值)和出版物数据位置(251,810个不同值)的数据库字段的自由文本条目。规范化分三个步骤进行,即字符规范化、单词规范化和短语规范化,使用本文中提供的工具开发和应用的可概括规则。对于年龄数据集,在字符阶段,21条规则的应用使输出有效性达到99.97%;在单词阶段,94条规则的应用使输出效度达到98.06%;在短语阶段,16条规则的应用使输出有效性达到83.81%。对于数据-位置数据集,在字符阶段,39条规则的应用使输出有效性达到99.99%;在单词阶段,187条规则的应用使输出有效性达到98.46%;在短语阶段,12条规则的应用使输出效度达到97.95%。结论:我们开发了一种通用的方法,用于在数据库字段中找到具有特定主题内容的自由文本的规范化。为给定字段创建和测试规则只需要一次性的工作,现在可以将这些规则应用于正在整理的数据。在测试的两个数据集中实现的标准化大大减少了内容的差异,主要通过改进搜索功能和启用与正式本体论的联系,增强了数据的可查找性和可用性。
{"title":"Standardizing free-text data exemplified by two fields from the Immune Epitope Database.","authors":"Sebastian Duesing, Jason Bennett, James A Overton, Randi Vita, Bjoern Peters","doi":"10.1186/s13326-025-00324-7","DOIUrl":"10.1186/s13326-025-00324-7","url":null,"abstract":"<p><strong>Background: </strong>While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e., removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different fields curated from the literature in the Immune Epitope Database (IEDB): \"age\" and \"data-location\" (the part of a paper in which data was found).</p><p><strong>Results: </strong>Free text entries for the database fields for subject age (4095 distinct values) and publication data-location (251,810 distinct values) in the IEDB were analyzed. Normalization was performed in three steps, namely character normalization, word normalization, and phrase normalization, using generalizable rules developed and applied with the tool presented in this manuscript. For the age dataset, in the character stage, the application of 21 rules resulted in 99.97% output validity; in the word stage, the application of 94 rules resulted in 98.06% output validity; and in the phrase stage, the application of 16 rules resulted in 83.81% output validity. For the data-location dataset, in the character stage, the application of 39 rules resulted in 99.99% output validity; in the word stage, the application of 187 rules resulted in 98.46% output validity; and in the phrase stage, the application of 12 rules resulted in 97.95% output validity.</p><p><strong>Conclusions: </strong>We developed a generalizable approach for normalization of free text as found in database fields with content on a specific topic. Creating and testing the rules took a one-time effort for a given field that can now be applied to data as it is being curated. The standardization achieved in two datasets tested produces significantly reduced variance in the content which enhances the findability and usability of that data, chiefly by improving search functionality and enabling linkages with formal ontologies.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"5"},"PeriodicalIF":2.0,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11929277/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143692223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Digital evolution: Novo Nordisk's shift to ontology-based data management. 数字化演进:诺和诺德向基于本体的数据管理的转变。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-03-22 DOI: 10.1186/s13326-025-00327-4
Shawn Zheng Kai Tan, Shounak Baksi, Thomas Gade Bjerregaard, Preethi Elangovan, Thrishna Kuttikattu Gopalakrishnan, Darko Hric, Joffrey Joumaa, Beidi Li, Kashif Rabbani, Santhosh Kannan Venkatesan, Joshua Daniel Valdez, Saritha Vettikunnel Kuriakose

The amount of biomedical data is growing, and managing it is increasingly challenging. While Findable, Accessible, Interoperable and Reusable (FAIR) data principles provide guidance, their adoption has proven difficult, especially in larger enterprises like pharmaceutical companies. In this manuscript, we describe how we leverage an Ontology-Based Data Management (OBDM) strategy for digital transformation in Novo Nordisk Research & Early Development. Here, we include both our technical blueprint and our approach for organizational change management. We further discuss how such an OBDM ecosystem plays a pivotal role in the organization's digital aspirations for data federation and discovery fuelled by artificial intelligence. Our aim for this paper is to share the lessons learned in order to foster dialogue with parties navigating similar waters while collectively advancing the efforts in the fields of data management, semantics and data driven drug discovery.

生物医学数据的数量在不断增长,对其进行管理也越来越具有挑战性。虽然可查找、可访问、可互操作和可重用(FAIR)数据原则提供了指导,但事实证明很难采用这些原则,尤其是在制药公司这样的大型企业。在本手稿中,我们将介绍如何利用基于本体的数据管理(OBDM)战略来实现诺和诺德研究与早期开发部的数字化转型。这里既包括我们的技术蓝图,也包括我们的组织变革管理方法。我们将进一步讨论这样一个 OBDM 生态系统如何在该组织的数字化愿望中发挥关键作用,以实现由人工智能推动的数据联合和发现。我们撰写本文的目的是分享经验教训,以促进与在类似水域航行的各方进行对话,同时共同推进数据管理、语义学和数据驱动药物发现领域的工作。
{"title":"Digital evolution: Novo Nordisk's shift to ontology-based data management.","authors":"Shawn Zheng Kai Tan, Shounak Baksi, Thomas Gade Bjerregaard, Preethi Elangovan, Thrishna Kuttikattu Gopalakrishnan, Darko Hric, Joffrey Joumaa, Beidi Li, Kashif Rabbani, Santhosh Kannan Venkatesan, Joshua Daniel Valdez, Saritha Vettikunnel Kuriakose","doi":"10.1186/s13326-025-00327-4","DOIUrl":"10.1186/s13326-025-00327-4","url":null,"abstract":"<p><p>The amount of biomedical data is growing, and managing it is increasingly challenging. While Findable, Accessible, Interoperable and Reusable (FAIR) data principles provide guidance, their adoption has proven difficult, especially in larger enterprises like pharmaceutical companies. In this manuscript, we describe how we leverage an Ontology-Based Data Management (OBDM) strategy for digital transformation in Novo Nordisk Research & Early Development. Here, we include both our technical blueprint and our approach for organizational change management. We further discuss how such an OBDM ecosystem plays a pivotal role in the organization's digital aspirations for data federation and discovery fuelled by artificial intelligence. Our aim for this paper is to share the lessons learned in order to foster dialogue with parties navigating similar waters while collectively advancing the efforts in the fields of data management, semantics and data driven drug discovery.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"6"},"PeriodicalIF":2.0,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11929979/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143692220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Semantics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1