首页 > 最新文献

Journal of Biomedical Semantics最新文献

英文 中文
Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments. 生物医学文献中实验模型的自动分类,以支持寻找动物实验的替代方法。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-09-01 DOI: 10.1186/s13326-023-00292-w
Mariana Neves, Antonina Klippert, Fanny Knöspel, Juliane Rudeck, Ailine Stolz, Zsofia Ban, Markus Becker, Kai Diederich, Barbara Grune, Pia Kahnau, Nils Ohnesorge, Johannes Pucher, Gilbert Schönfelder, Bettina Bert, Daniel Butzke

Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: "in vivo", "organs", "primary cells", "immortal cell lines", "invertebrates", "humans", "in silico" and "other" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for "others") to 0.82 (for "invertebrates"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - "Smart feature-based interactive" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).

目前的动物保护法要求用替代方法替代动物实验,只要这些方法适合达到预期的科学目标。然而,在科学文献中寻找替代方法是一项耗时的任务,需要仔细筛选大量的实验性生物医学出版物。识别潜在的相关方法,例如器官或细胞培养模型,或计算机模拟,可以通过专门为此目的构建的文本挖掘工具来支持。这些工具是在人类专家标记的相关数据集上训练(或微调)的。我们开发了GoldHamster语料库,该语料库由1600篇PubMed (Medline)文章(标题和摘要)组成,其中我们根据一组8个标签手动识别使用的实验模型,即:“体内”、“器官”、“原代细胞”、“不朽细胞系”、“无脊椎动物”、“人类”、“计算机”和“其他”(模型)。我们招募了13名具有生物医学领域专业知识的注释者,并将每篇文章分配给两个人。另外四轮注释旨在提高第一轮中存在分歧的注释的质量。此外,我们进行了各种基于监督学习的机器学习实验,以评估我们分类任务的语料库。我们为上述标签获得了7000多个文档级别的注释。在第一轮标注之后,标注者之间的一致性(kappa系数)在标签之间变化,范围从0.42(“其他”)到0.82(“无脊椎动物”),总分为0.62。在随后的几轮注释中,所有分歧都得到了解决。表现最好的机器学习实验使用了PubMedBERT预训练模型,并对我们的语料库进行了微调,其总体f分数为0.83。我们获得了一个对所有标签都具有高度一致性的语料库,我们的评估表明,根据使用的实验模型,我们的语料库适合用于训练可靠的生物医学文献自动分类预测模型。我们的SMAFIRA——“基于智能特征的交互式”搜索工具(https://smafira.bf3r.de)将使用这个分类器来支持动物实验替代方法的检索。语料库可以下载(https://doi.org/10.5281/zenodo.7152295),也可以下载源代码(https://github.com/mariananeves/goldhamster)和模型(https://huggingface.co/SMAFIRA/goldhamster)。
{"title":"Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments.","authors":"Mariana Neves, Antonina Klippert, Fanny Knöspel, Juliane Rudeck, Ailine Stolz, Zsofia Ban, Markus Becker, Kai Diederich, Barbara Grune, Pia Kahnau, Nils Ohnesorge, Johannes Pucher, Gilbert Schönfelder, Bettina Bert, Daniel Butzke","doi":"10.1186/s13326-023-00292-w","DOIUrl":"10.1186/s13326-023-00292-w","url":null,"abstract":"<p><p>Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: \"in vivo\", \"organs\", \"primary cells\", \"immortal cell lines\", \"invertebrates\", \"humans\", \"in silico\" and \"other\" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for \"others\") to 0.82 (for \"invertebrates\"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - \"Smart feature-based interactive\" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"13"},"PeriodicalIF":1.9,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10472567/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10178765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic transparency evaluation for open knowledge extraction systems. 开放知识提取系统的自动透明度评估。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-08-31 DOI: 10.1186/s13326-023-00293-9
Maryam Basereh, Annalina Caputo, Rob Brennan

Background: This paper proposes Cyrus, a new transparency evaluation framework, for Open Knowledge Extraction (OKE) systems. Cyrus is based on the state-of-the-art transparency models and linked data quality assessment dimensions. It brings together a comprehensive view of transparency dimensions for OKE systems. The Cyrus framework is used to evaluate the transparency of three linked datasets, which are built from the same corpus by three state-of-the-art OKE systems. The evaluation is automatically performed using a combination of three state-of-the-art FAIRness (Findability, Accessibility, Interoperability, Reusability) assessment tools and a linked data quality evaluation framework, called Luzzu. This evaluation includes six Cyrus data transparency dimensions for which existing assessment tools could be identified. OKE systems extract structured knowledge from unstructured or semi-structured text in the form of linked data. These systems are fundamental components of advanced knowledge services. However, due to the lack of a transparency framework for OKE, most OKE systems are not transparent. This means that their processes and outcomes are not understandable and interpretable. A comprehensive framework sheds light on different aspects of transparency, allows comparison between the transparency of different systems by supporting the development of transparency scores, gives insight into the transparency weaknesses of the system, and ways to improve them. Automatic transparency evaluation helps with scalability and facilitates transparency assessment. The transparency problem has been identified as critical by the European Union Trustworthy Artificial Intelligence (AI) guidelines. In this paper, Cyrus provides the first comprehensive view of transparency dimensions for OKE systems by merging the perspectives of the FAccT (Fairness, Accountability, and Transparency), FAIR, and linked data quality research communities.

Results: In Cyrus, data transparency includes ten dimensions which are grouped in two categories. In this paper, six of these dimensions, i.e., provenance, interpretability, understandability, licensing, availability, interlinking have been evaluated automatically for three state-of-the-art OKE systems, using the state-of-the-art metrics and tools. Covid-on-the-Web is identified to have the highest mean transparency.

Conclusions: This is the first research to study the transparency of OKE systems that provides a comprehensive set of transparency dimensions spanning ethics, trustworthy AI, and data quality approaches to transparency. It also demonstrates how to perform automated transparency evaluation that combines existing FAIRness and linked data quality assessment tools for the first time. We show that state-of-the-art OKE systems vary in the transparency of the linked data generated and that these differences can be automatically quantified leading to potential

背景:本文提出了一种新的面向开放知识抽取(OKE)系统的透明度评价框架Cyrus。Cyrus基于最先进的透明度模型和关联的数据质量评估维度。它汇集了对OKE系统透明度维度的全面视图。Cyrus框架用于评估三个关联数据集的透明度,这些数据集由三个最先进的OKE系统从相同的语料库构建而成。评估使用三种最先进的公平性(可查找性、可访问性、互操作性、可重用性)评估工具和一个被称为Luzzu的关联数据质量评估框架的组合自动执行。该评估包括六个Cyrus数据透明度维度,现有评估工具可以识别这些维度。OKE系统以关联数据的形式从非结构化或半结构化文本中提取结构化知识。这些系统是高级知识服务的基本组成部分。然而,由于缺乏一个透明的OKE框架,大多数OKE系统是不透明的。这意味着它们的过程和结果是不可理解和解释的。一个全面的框架阐明了透明度的不同方面,通过支持透明度分数的发展,允许对不同系统的透明度进行比较,深入了解系统的透明度弱点,以及改进它们的方法。自动透明度评估有助于可伸缩性和促进透明度评估。透明度问题已被欧盟可信赖人工智能(AI)指南确定为关键问题。在本文中,Cyrus通过合并FAccT(公平性、问责制和透明度)、FAIR和关联数据质量研究社区的观点,提供了OKE系统透明度维度的第一个全面视图。结果:在Cyrus中,数据透明度包括十个维度,分为两类。在本文中,使用最先进的度量和工具,对三个最先进的OKE系统自动评估了这些维度中的六个,即来源、可解释性、可理解性、许可、可用性、互连。网络上的冠状病毒被认为具有最高的平均透明度。结论:这是第一个研究OKE系统透明度的研究,该系统提供了一套全面的透明度维度,涵盖道德、可信赖的人工智能和数据质量透明度方法。它还首次演示了如何执行结合现有公平性和关联数据质量评估工具的自动化透明度评估。我们表明,最先进的OKE系统在生成的关联数据的透明度方面存在差异,这些差异可以自动量化,从而在可信赖的人工智能、合规性、数据保护、数据治理以及未来的OKE系统设计和测试中产生潜在的应用。
{"title":"Automatic transparency evaluation for open knowledge extraction systems.","authors":"Maryam Basereh, Annalina Caputo, Rob Brennan","doi":"10.1186/s13326-023-00293-9","DOIUrl":"10.1186/s13326-023-00293-9","url":null,"abstract":"<p><strong>Background: </strong>This paper proposes Cyrus, a new transparency evaluation framework, for Open Knowledge Extraction (OKE) systems. Cyrus is based on the state-of-the-art transparency models and linked data quality assessment dimensions. It brings together a comprehensive view of transparency dimensions for OKE systems. The Cyrus framework is used to evaluate the transparency of three linked datasets, which are built from the same corpus by three state-of-the-art OKE systems. The evaluation is automatically performed using a combination of three state-of-the-art FAIRness (Findability, Accessibility, Interoperability, Reusability) assessment tools and a linked data quality evaluation framework, called Luzzu. This evaluation includes six Cyrus data transparency dimensions for which existing assessment tools could be identified. OKE systems extract structured knowledge from unstructured or semi-structured text in the form of linked data. These systems are fundamental components of advanced knowledge services. However, due to the lack of a transparency framework for OKE, most OKE systems are not transparent. This means that their processes and outcomes are not understandable and interpretable. A comprehensive framework sheds light on different aspects of transparency, allows comparison between the transparency of different systems by supporting the development of transparency scores, gives insight into the transparency weaknesses of the system, and ways to improve them. Automatic transparency evaluation helps with scalability and facilitates transparency assessment. The transparency problem has been identified as critical by the European Union Trustworthy Artificial Intelligence (AI) guidelines. In this paper, Cyrus provides the first comprehensive view of transparency dimensions for OKE systems by merging the perspectives of the FAccT (Fairness, Accountability, and Transparency), FAIR, and linked data quality research communities.</p><p><strong>Results: </strong>In Cyrus, data transparency includes ten dimensions which are grouped in two categories. In this paper, six of these dimensions, i.e., provenance, interpretability, understandability, licensing, availability, interlinking have been evaluated automatically for three state-of-the-art OKE systems, using the state-of-the-art metrics and tools. Covid-on-the-Web is identified to have the highest mean transparency.</p><p><strong>Conclusions: </strong>This is the first research to study the transparency of OKE systems that provides a comprehensive set of transparency dimensions spanning ethics, trustworthy AI, and data quality approaches to transparency. It also demonstrates how to perform automated transparency evaluation that combines existing FAIRness and linked data quality assessment tools for the first time. We show that state-of-the-art OKE systems vary in the transparency of the linked data generated and that these differences can be automatically quantified leading to potential","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"12"},"PeriodicalIF":1.9,"publicationDate":"2023-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10468861/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10549601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-domain knowledge graph embeddings for gene-disease association prediction. 基因疾病关联预测的多领域知识图谱嵌入。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-08-14 DOI: 10.1186/s13326-023-00291-x
Susana Nunes, Rita T Sousa, Catia Pesquita

Background: Predicting gene-disease associations typically requires exploring diverse sources of information as well as sophisticated computational approaches. Knowledge graph embeddings can help tackle these challenges by creating representations of genes and diseases based on the scientific knowledge described in ontologies, which can then be explored by machine learning algorithms. However, state-of-the-art knowledge graph embeddings are produced over a single ontology or multiple but disconnected ones, ignoring the impact that considering multiple interconnected domains can have on complex tasks such as gene-disease association prediction.

Results: We propose a novel approach to predict gene-disease associations using rich semantic representations based on knowledge graph embeddings over multiple ontologies linked by logical definitions and compound ontology mappings. The experiments showed that considering richer knowledge graphs significantly improves gene-disease prediction and that different knowledge graph embeddings methods benefit more from distinct types of semantic richness.

Conclusions: This work demonstrated the potential for knowledge graph embeddings across multiple and interconnected biomedical ontologies to support gene-disease prediction. It also paved the way for considering other ontologies or tackling other tasks where multiple perspectives over the data can be beneficial. All software and data are freely available.

背景:预测基因与疾病的关联通常需要探索不同的信息来源以及复杂的计算方法。知识图嵌入可以通过基于本体中描述的科学知识创建基因和疾病的表示来帮助解决这些挑战,然后可以通过机器学习算法进行探索。然而,最先进的知识图嵌入是在单个本体或多个但不相连的本体上产生的,忽略了考虑多个相互连接的领域可能对复杂任务(如基因-疾病关联预测)的影响。结果:我们提出了一种预测基因-疾病关联的新方法,该方法使用基于知识图嵌入的丰富语义表示,通过逻辑定义和复合本体映射连接多个本体。实验表明,考虑更丰富的知识图可以显著提高基因疾病的预测效果,不同的知识图嵌入方法受益于不同类型的语义丰富度。结论:这项工作证明了跨多个相互关联的生物医学本体的知识图谱嵌入支持基因疾病预测的潜力。它还为考虑其他本体或处理其他任务铺平了道路,在这些任务中,数据的多个透视图可能是有益的。所有软件和数据都是免费提供的。
{"title":"Multi-domain knowledge graph embeddings for gene-disease association prediction.","authors":"Susana Nunes, Rita T Sousa, Catia Pesquita","doi":"10.1186/s13326-023-00291-x","DOIUrl":"10.1186/s13326-023-00291-x","url":null,"abstract":"<p><strong>Background: </strong>Predicting gene-disease associations typically requires exploring diverse sources of information as well as sophisticated computational approaches. Knowledge graph embeddings can help tackle these challenges by creating representations of genes and diseases based on the scientific knowledge described in ontologies, which can then be explored by machine learning algorithms. However, state-of-the-art knowledge graph embeddings are produced over a single ontology or multiple but disconnected ones, ignoring the impact that considering multiple interconnected domains can have on complex tasks such as gene-disease association prediction.</p><p><strong>Results: </strong>We propose a novel approach to predict gene-disease associations using rich semantic representations based on knowledge graph embeddings over multiple ontologies linked by logical definitions and compound ontology mappings. The experiments showed that considering richer knowledge graphs significantly improves gene-disease prediction and that different knowledge graph embeddings methods benefit more from distinct types of semantic richness.</p><p><strong>Conclusions: </strong>This work demonstrated the potential for knowledge graph embeddings across multiple and interconnected biomedical ontologies to support gene-disease prediction. It also paved the way for considering other ontologies or tackling other tasks where multiple perspectives over the data can be beneficial. All software and data are freely available.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"11"},"PeriodicalIF":1.9,"publicationDate":"2023-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10426189/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10003461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An extension of the BioAssay Ontology to include pharmacokinetic/pharmacodynamic terminology for the enrichment of scientific workflows. 生物测定本体的扩展,包括药代动力学/药效学术语,以丰富科学工作流程。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-08-11 DOI: 10.1186/s13326-023-00288-6
Steve Penn, Jane Lomax, Anneli Karlsson, Vincent Antonucci, Carl-Dieter Zachmann, Samantha Kanza, Stephan Schurer, John Turner

With the capacity to produce and record data electronically, Scientific research and the data associated with it have grown at an unprecedented rate. However, despite a decent amount of data now existing in an electronic form, it is still common for scientific research to be recorded in an unstructured text format with inconsistent context (vocabularies) which vastly reduces the potential for direct intelligent analysis. Research has demonstrated that the use of semantic technologies such as ontologies to structure and enrich scientific data can greatly improve this potential. However, whilst there are many ontologies that can be used for this purpose, there is still a vast quantity of scientific terminology that does not have adequate semantic representation. A key area for expansion identified by the authors was the pharmacokinetic/pharmacodynamic (PK/PD) domain due to its high usage across many areas of Pharma. As such we have produced a set of these terms and other bioassay related terms to be incorporated into the BioAssay Ontology (BAO), which was identified as the most relevant ontology for this work. A number of use cases developed by experts in the field were used to demonstrate how these new ontology terms can be used, and to set the scene for the continuation of this work with a look to expanding this work out into further relevant domains. The work done in this paper was part of Phase 1 of the SEED project (Semantically Enriching electronic laboratory notebook (eLN) Data).

由于能够以电子方式产生和记录数据,科学研究和与之相关的数据以前所未有的速度增长。然而,尽管现在有相当数量的数据以电子形式存在,科学研究仍然普遍以不一致的上下文(词汇表)的非结构化文本格式记录,这大大降低了直接智能分析的潜力。研究表明,使用语义技术(如本体)来构建和丰富科学数据可以极大地提高这种潜力。然而,虽然有许多本体可以用于此目的,但仍然有大量的科学术语没有足够的语义表示。作者确定的一个关键扩展领域是药代动力学/药效学(PK/PD)领域,因为它在制药的许多领域都有很高的使用。因此,我们已经制作了一套这些术语和其他生物测定相关术语,并将其纳入生物测定本体(BAO),该本体被确定为与本工作最相关的本体。由该领域的专家开发的许多用例被用来演示如何使用这些新的本体术语,并为这项工作的继续设置场景,以期将这项工作扩展到进一步的相关领域。本文所做的工作是SEED项目(语义丰富电子实验室笔记本(eLN)数据)第一阶段的一部分。
{"title":"An extension of the BioAssay Ontology to include pharmacokinetic/pharmacodynamic terminology for the enrichment of scientific workflows.","authors":"Steve Penn, Jane Lomax, Anneli Karlsson, Vincent Antonucci, Carl-Dieter Zachmann, Samantha Kanza, Stephan Schurer, John Turner","doi":"10.1186/s13326-023-00288-6","DOIUrl":"10.1186/s13326-023-00288-6","url":null,"abstract":"<p><p>With the capacity to produce and record data electronically, Scientific research and the data associated with it have grown at an unprecedented rate. However, despite a decent amount of data now existing in an electronic form, it is still common for scientific research to be recorded in an unstructured text format with inconsistent context (vocabularies) which vastly reduces the potential for direct intelligent analysis. Research has demonstrated that the use of semantic technologies such as ontologies to structure and enrich scientific data can greatly improve this potential. However, whilst there are many ontologies that can be used for this purpose, there is still a vast quantity of scientific terminology that does not have adequate semantic representation. A key area for expansion identified by the authors was the pharmacokinetic/pharmacodynamic (PK/PD) domain due to its high usage across many areas of Pharma. As such we have produced a set of these terms and other bioassay related terms to be incorporated into the BioAssay Ontology (BAO), which was identified as the most relevant ontology for this work. A number of use cases developed by experts in the field were used to demonstrate how these new ontology terms can be used, and to set the scene for the continuation of this work with a look to expanding this work out into further relevant domains. The work done in this paper was part of Phase 1 of the SEED project (Semantically Enriching electronic laboratory notebook (eLN) Data).</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"10"},"PeriodicalIF":1.9,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10416407/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9997460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving the classification of cardinality phenotypes using collections. 利用集合改进基数表型的分类。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-08-07 DOI: 10.1186/s13326-023-00290-y
Sarah M Alghamdi, Robert Hoehndorf

Motivation: Phenotypes are observable characteristics of an organism and they can be highly variable. Information about phenotypes is collected in a clinical context to characterize disease, and is also collected in model organisms and stored in model organism databases where they are used to understand gene functions. Phenotype data is also used in computational data analysis and machine learning methods to provide novel insights into disease mechanisms and support personalized diagnosis of disease. For mammalian organisms and in a clinical context, ontologies such as the Human Phenotype Ontology and the Mammalian Phenotype Ontology are widely used to formally and precisely describe phenotypes. We specifically analyze axioms pertaining to phenotypes of collections of entities within a body, and we find that some of the axioms in phenotype ontologies lead to inferences that may not accurately reflect the underlying biological phenomena.

Results: We reformulate the phenotypes of collections of entities using an ontological theory of collections. By reformulating phenotypes of collections in phenotypes ontologies, we avoid potentially incorrect inferences pertaining to the cardinality of these collections. We apply our method to two phenotype ontologies and show that the reformulation not only removes some problematic inferences but also quantitatively improves biological data analysis.

动机:表型是生物体的可观察特征,它们可以是高度可变的。在临床环境中收集有关表型的信息以表征疾病,也在模式生物中收集并存储在模式生物数据库中,用于了解基因功能。表型数据也用于计算数据分析和机器学习方法,为疾病机制提供新的见解,并支持疾病的个性化诊断。对于哺乳动物有机体和在临床环境中,本体,如人类表型本体和哺乳动物表型本体被广泛用于正式和精确地描述表型。我们特别分析了与身体内实体集合的表型有关的公理,我们发现表型本体论中的一些公理导致的推论可能不能准确反映潜在的生物现象。结果:我们使用集合的本体论理论重新制定实体集合的表型。通过在表型本体论中重新制定集合的表型,我们避免了与这些集合的基数性有关的潜在错误推论。我们将我们的方法应用于两种表型本体论,并表明重新表述不仅消除了一些有问题的推论,而且在定量上提高了生物学数据分析。
{"title":"Improving the classification of cardinality phenotypes using collections.","authors":"Sarah M Alghamdi, Robert Hoehndorf","doi":"10.1186/s13326-023-00290-y","DOIUrl":"10.1186/s13326-023-00290-y","url":null,"abstract":"<p><strong>Motivation: </strong>Phenotypes are observable characteristics of an organism and they can be highly variable. Information about phenotypes is collected in a clinical context to characterize disease, and is also collected in model organisms and stored in model organism databases where they are used to understand gene functions. Phenotype data is also used in computational data analysis and machine learning methods to provide novel insights into disease mechanisms and support personalized diagnosis of disease. For mammalian organisms and in a clinical context, ontologies such as the Human Phenotype Ontology and the Mammalian Phenotype Ontology are widely used to formally and precisely describe phenotypes. We specifically analyze axioms pertaining to phenotypes of collections of entities within a body, and we find that some of the axioms in phenotype ontologies lead to inferences that may not accurately reflect the underlying biological phenomena.</p><p><strong>Results: </strong>We reformulate the phenotypes of collections of entities using an ontological theory of collections. By reformulating phenotypes of collections in phenotypes ontologies, we avoid potentially incorrect inferences pertaining to the cardinality of these collections. We apply our method to two phenotype ontologies and show that the reformulation not only removes some problematic inferences but also quantitatively improves biological data analysis.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"9"},"PeriodicalIF":1.9,"publicationDate":"2023-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10405428/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9959650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantically enabling clinical decision support recommendations. 语义上支持临床决策支持建议。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-07-18 DOI: 10.1186/s13326-023-00285-9
Oshani Seneviratne, Amar K Das, Shruthi Chari, Nkechinyere N Agu, Sabbir M Rashid, Jamie McCusker, Jade S Franklin, Miao Qi, Kristin P Bennett, Ching-Hua Chen, James A Hendler, Deborah L McGuinness

Background: Clinical decision support systems have been widely deployed to guide healthcare decisions on patient diagnosis, treatment choices, and patient management through evidence-based recommendations. These recommendations are typically derived from clinical practice guidelines created by clinical specialties or healthcare organizations. Although there have been many different technical approaches to encoding guideline recommendations into decision support systems, much of the previous work has not focused on enabling system generated recommendations through the formalization of changes in a guideline, the provenance of a recommendation, and applicability of the evidence. Prior work indicates that healthcare providers may not find that guideline-derived recommendations always meet their needs for reasons such as lack of relevance, transparency, time pressure, and applicability to their clinical practice.

Results: We introduce several semantic techniques that model diseases based on clinical practice guidelines, provenance of the guidelines, and the study cohorts they are based on to enhance the capabilities of clinical decision support systems. We have explored ways to enable clinical decision support systems with semantic technologies that can represent and link to details in related items from the scientific literature and quickly adapt to changing information from the guidelines, identifying gaps, and supporting personalized explanations. Previous semantics-driven clinical decision systems have limited support in all these aspects, and we present the ontologies and semantic web based software tools in three distinct areas that are unified using a standard set of ontologies and a custom-built knowledge graph framework: (i) guideline modeling to characterize diseases, (ii) guideline provenance to attach evidence to treatment decisions from authoritative sources, and (iii) study cohort modeling to identify relevant research publications for complicated patients.

Conclusions: We have enhanced existing, evidence-based knowledge by developing ontologies and software that enables clinicians to conveniently access updates to and provenance of guidelines, as well as gather additional information from research studies applicable to their patients' unique circumstances. Our software solutions leverage many well-used existing biomedical ontologies and build upon decades of knowledge representation and reasoning work, leading to explainable results.

背景:临床决策支持系统已被广泛应用于通过循证建议来指导患者诊断、治疗选择和患者管理方面的医疗决策。这些建议通常来自临床专家或医疗保健组织创建的临床实践指南。虽然已经有许多不同的技术方法将指南建议编码到决策支持系统中,但是以前的大部分工作并没有集中在通过指南变更的形式化、建议的来源和证据的适用性来使系统生成建议。先前的工作表明,由于缺乏相关性、透明度、时间压力和对临床实践的适用性等原因,医疗保健提供者可能不会发现指南衍生的建议总是满足他们的需求。结果:我们引入了几种基于临床实践指南、指南来源及其所基于的研究队列的疾病建模语义技术,以增强临床决策支持系统的能力。我们已经探索了使用语义技术使临床决策支持系统能够表示和链接科学文献中相关项目的细节,并快速适应指南中不断变化的信息,识别差距,并支持个性化解释的方法。以前的语义驱动的临床决策系统在所有这些方面的支持都是有限的,我们在三个不同的领域提出了本体和基于语义网的软件工具,这些工具使用一组标准的本体和一个定制的知识图谱框架进行统一:(i)建立指导性模型以描述疾病特征,(ii)建立指导性来源,为权威来源的治疗决定提供证据,以及(iii)建立研究队列模型以确定有关复杂患者的研究出版物。结论:我们通过开发本体论和软件增强了现有的循证知识,使临床医生能够方便地访问指南的更新和来源,并从适用于其患者独特情况的研究中收集额外的信息。我们的软件解决方案利用了许多广泛使用的现有生物医学本体,并建立在数十年的知识表示和推理工作的基础上,从而产生可解释的结果。
{"title":"Semantically enabling clinical decision support recommendations.","authors":"Oshani Seneviratne,&nbsp;Amar K Das,&nbsp;Shruthi Chari,&nbsp;Nkechinyere N Agu,&nbsp;Sabbir M Rashid,&nbsp;Jamie McCusker,&nbsp;Jade S Franklin,&nbsp;Miao Qi,&nbsp;Kristin P Bennett,&nbsp;Ching-Hua Chen,&nbsp;James A Hendler,&nbsp;Deborah L McGuinness","doi":"10.1186/s13326-023-00285-9","DOIUrl":"https://doi.org/10.1186/s13326-023-00285-9","url":null,"abstract":"<p><strong>Background: </strong>Clinical decision support systems have been widely deployed to guide healthcare decisions on patient diagnosis, treatment choices, and patient management through evidence-based recommendations. These recommendations are typically derived from clinical practice guidelines created by clinical specialties or healthcare organizations. Although there have been many different technical approaches to encoding guideline recommendations into decision support systems, much of the previous work has not focused on enabling system generated recommendations through the formalization of changes in a guideline, the provenance of a recommendation, and applicability of the evidence. Prior work indicates that healthcare providers may not find that guideline-derived recommendations always meet their needs for reasons such as lack of relevance, transparency, time pressure, and applicability to their clinical practice.</p><p><strong>Results: </strong>We introduce several semantic techniques that model diseases based on clinical practice guidelines, provenance of the guidelines, and the study cohorts they are based on to enhance the capabilities of clinical decision support systems. We have explored ways to enable clinical decision support systems with semantic technologies that can represent and link to details in related items from the scientific literature and quickly adapt to changing information from the guidelines, identifying gaps, and supporting personalized explanations. Previous semantics-driven clinical decision systems have limited support in all these aspects, and we present the ontologies and semantic web based software tools in three distinct areas that are unified using a standard set of ontologies and a custom-built knowledge graph framework: (i) guideline modeling to characterize diseases, (ii) guideline provenance to attach evidence to treatment decisions from authoritative sources, and (iii) study cohort modeling to identify relevant research publications for complicated patients.</p><p><strong>Conclusions: </strong>We have enhanced existing, evidence-based knowledge by developing ontologies and software that enables clinicians to conveniently access updates to and provenance of guidelines, as well as gather additional information from research studies applicable to their patients' unique circumstances. Our software solutions leverage many well-used existing biomedical ontologies and build upon decades of knowledge representation and reasoning work, leading to explainable results.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"8"},"PeriodicalIF":1.9,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10353186/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9847112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FAIR-Checker: supporting digital resource findability and reuse with Knowledge Graphs and Semantic Web standards. FAIR-Checker:通过知识图和语义网标准支持数字资源的可查找性和重用。
IF 2 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-07-01 DOI: 10.1186/s13326-023-00289-5
Alban Gaignard, Thomas Rosnet, Frédéric De Lamotte, Vincent Lefort, Marie-Dominique Devignes

The current rise of Open Science and Reproducibility in the Life Sciences requires the creation of rich, machine-actionable metadata in order to better share and reuse biological digital resources such as datasets, bioinformatics tools, training materials, etc. For this purpose, FAIR principles have been defined for both data and metadata and adopted by large communities, leading to the definition of specific metrics. However, automatic FAIRness assessment is still difficult because computational evaluations frequently require technical expertise and can be time-consuming. As a first step to address these issues, we propose FAIR-Checker, a web-based tool to assess the FAIRness of metadata presented by digital resources. FAIR-Checker offers two main facets: a "Check" module providing a thorough metadata evaluation and recommendations, and an "Inspect" module which assists users in improving metadata quality and therefore the FAIRness of their resource. FAIR-Checker leverages Semantic Web standards and technologies such as SPARQL queries and SHACL constraints to automatically assess FAIR metrics. Users are notified of missing, necessary, or recommended metadata for various resource categories. We evaluate FAIR-Checker in the context of improving the FAIRification of individual resources, through better metadata, as well as analyzing the FAIRness of more than 25 thousand bioinformatics software descriptions.

当前开放科学和生命科学可重复性的兴起需要创建丰富的、机器可操作的元数据,以便更好地共享和重用生物数字资源,如数据集、生物信息学工具、培训材料等。为此,已经为数据和元数据定义了FAIR原则,并被大型社区采用,从而定义了特定指标。然而,自动公平性评估仍然很困难,因为计算性评估经常需要技术专长,并且可能很耗时。作为解决这些问题的第一步,我们提出了FAIR-Checker,这是一个基于网络的工具,用于评估数字资源所呈现的元数据的公平性。FAIR-Checker提供两个主要方面:“Check”模块提供全面的元数据评估和建议,“Inspect”模块帮助用户提高元数据质量,从而提高资源的公平性。FAIR- checker利用语义Web标准和技术,如SPARQL查询和acl约束来自动评估FAIR指标。系统会通知用户各种资源类别缺少、必要或推荐的元数据。我们通过更好的元数据,以及分析超过2.5万个生物信息学软件描述的公平性,在提高个体资源公平性的背景下评估了FAIR-Checker。
{"title":"FAIR-Checker: supporting digital resource findability and reuse with Knowledge Graphs and Semantic Web standards.","authors":"Alban Gaignard, Thomas Rosnet, Frédéric De Lamotte, Vincent Lefort, Marie-Dominique Devignes","doi":"10.1186/s13326-023-00289-5","DOIUrl":"10.1186/s13326-023-00289-5","url":null,"abstract":"<p><p>The current rise of Open Science and Reproducibility in the Life Sciences requires the creation of rich, machine-actionable metadata in order to better share and reuse biological digital resources such as datasets, bioinformatics tools, training materials, etc. For this purpose, FAIR principles have been defined for both data and metadata and adopted by large communities, leading to the definition of specific metrics. However, automatic FAIRness assessment is still difficult because computational evaluations frequently require technical expertise and can be time-consuming. As a first step to address these issues, we propose FAIR-Checker, a web-based tool to assess the FAIRness of metadata presented by digital resources. FAIR-Checker offers two main facets: a \"Check\" module providing a thorough metadata evaluation and recommendations, and an \"Inspect\" module which assists users in improving metadata quality and therefore the FAIRness of their resource. FAIR-Checker leverages Semantic Web standards and technologies such as SPARQL queries and SHACL constraints to automatically assess FAIR metrics. Users are notified of missing, necessary, or recommended metadata for various resource categories. We evaluate FAIR-Checker in the context of improving the FAIRification of individual resources, through better metadata, as well as analyzing the FAIRness of more than 25 thousand bioinformatics software descriptions.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"7"},"PeriodicalIF":2.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10315041/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9799838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Features of a FAIR vocabulary. FAIR词汇的特征。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-06-01 DOI: 10.1186/s13326-023-00286-8
Fuqi Xu, Nick Juty, Carole Goble, Simon Jupp, Helen Parkinson, Mélanie Courtot

Background: The Findable, Accessible, Interoperable and Reusable(FAIR) Principles explicitly require the use of FAIR vocabularies, but what precisely constitutes a FAIR vocabulary remains unclear. Being able to define FAIR vocabularies, identify features of FAIR vocabularies, and provide assessment approaches against the features can guide the development of vocabularies.

Results: We differentiate data, data resources and vocabularies used for FAIR, examine the application of the FAIR Principles to vocabularies, align their requirements with the Open Biomedical Ontologies principles, and propose FAIR Vocabulary Features. We also design assessment approaches for FAIR vocabularies by mapping the FVFs with existing FAIR assessment indicators. Finally, we demonstrate how they can be used for evaluating and improving vocabularies using exemplary biomedical vocabularies.

Conclusions: Our work proposes features of FAIR vocabularies and corresponding indicators for assessing the FAIR levels of different types of vocabularies, identifies use cases for vocabulary engineers, and guides the evolution of vocabularies.

背景:可查找、可访问、可互操作和可重用(FAIR)原则明确要求使用FAIR词汇表,但具体是什么构成FAIR词汇表尚不清楚。能够定义FAIR词汇表,识别FAIR词汇表的特征,并提供针对这些特征的评估方法,可以指导词汇表的开发。结果:我们区分了FAIR使用的数据、数据资源和词汇表,考察了FAIR原则在词汇表中的应用,并将其要求与开放生物医学本体原则相结合,提出了FAIR词汇表特征。我们还通过将FVFs与现有的FAIR评估指标进行映射,设计了FAIR词汇表的评估方法。最后,我们将通过示例性生物医学词汇来演示如何使用它们来评估和改进词汇。结论:我们的工作提出了公平词汇的特征和相应的指标来评估不同类型词汇的公平水平,为词汇工程师确定了用例,并指导了词汇的演变。
{"title":"Features of a FAIR vocabulary.","authors":"Fuqi Xu,&nbsp;Nick Juty,&nbsp;Carole Goble,&nbsp;Simon Jupp,&nbsp;Helen Parkinson,&nbsp;Mélanie Courtot","doi":"10.1186/s13326-023-00286-8","DOIUrl":"https://doi.org/10.1186/s13326-023-00286-8","url":null,"abstract":"<p><strong>Background: </strong>The Findable, Accessible, Interoperable and Reusable(FAIR) Principles explicitly require the use of FAIR vocabularies, but what precisely constitutes a FAIR vocabulary remains unclear. Being able to define FAIR vocabularies, identify features of FAIR vocabularies, and provide assessment approaches against the features can guide the development of vocabularies.</p><p><strong>Results: </strong>We differentiate data, data resources and vocabularies used for FAIR, examine the application of the FAIR Principles to vocabularies, align their requirements with the Open Biomedical Ontologies principles, and propose FAIR Vocabulary Features. We also design assessment approaches for FAIR vocabularies by mapping the FVFs with existing FAIR assessment indicators. Finally, we demonstrate how they can be used for evaluating and improving vocabularies using exemplary biomedical vocabularies.</p><p><strong>Conclusions: </strong>Our work proposes features of FAIR vocabularies and corresponding indicators for assessing the FAIR levels of different types of vocabularies, identifies use cases for vocabulary engineers, and guides the evolution of vocabularies.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"6"},"PeriodicalIF":1.9,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10236849/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9672525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature. 多重采样方案和深度学习提高了文献中药物相互作用信息检索分析的主动学习性能。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-05-30 DOI: 10.1186/s13326-023-00287-7
Weixin Xie, Kunjie Fan, Shijun Zhang, Lang Li

Background: Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper.

Results: PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively.

Conclusions: By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.

背景:药物相互作用(DDI)信息检索(IR)是从 PubMed 文献中提取的一项重要的自然语言处理(NLP)任务。在 DDI IR 分析中首次研究了主动学习(AL)。从 PubMed 摘要中进行 DDI IR 分析面临的挑战是,在大量的阴性样本中,DDI 阳性样本相对较少。为了提高 AL 分析的效率,特意设计了随机阴性采样和阳性采样。文中展示了随机阴性取样和阳性取样的一致性:PubMed 摘要分为两个池。筛选池包含所有通过 PubMed DDI 关键词查询的摘要,而未筛选池包含所有其他摘要。在预设召回率为 0.95 的条件下,对 DDI IR 分析的精确度进行评估和比较。在使用支持向量机(SVM)进行的筛选池 IR 分析中,相似性采样加不确定性采样比不确定性采样提高了精确度,分别从 0.89 提高到 0.92。在非筛选池红外分析中,综合随机负采样、正采样和相似性采样比不确定性采样的精度分别从 0.72 提高到 0.81。当我们将 SVM 改为深度学习方法时,所有采样方案在筛选池和非筛选池中都能持续改进 DDI AL 分析。深度学习比 SVM 的精确度有明显提高,在筛选池中分别为 0.96 对 0.92,在未筛选池中分别为 0.90 对 0.81:通过将各种采样方案和深度学习算法整合到 AL 中,文献中的 DDI IR 分析得到了显著改善。在正负样本极不平衡的情况下,随机负向采样和正向采样是改进 AL 分析的高效方法。
{"title":"Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature.","authors":"Weixin Xie, Kunjie Fan, Shijun Zhang, Lang Li","doi":"10.1186/s13326-023-00287-7","DOIUrl":"10.1186/s13326-023-00287-7","url":null,"abstract":"<p><strong>Background: </strong>Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper.</p><p><strong>Results: </strong>PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively.</p><p><strong>Conclusions: </strong>By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"5"},"PeriodicalIF":1.9,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10228061/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9740363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Constructing a knowledge graph for open government data: the case of Nova Scotia disease datasets. 构建开放政府数据的知识图谱:以新斯科舍省疾病数据集为例。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-04-18 DOI: 10.1186/s13326-023-00284-w
Enayat Rajabi, Rishi Midha, Jairo Francisco de Souza

The majority of available datasets in open government data are statistical. They are widely published by various governments to be used by the public and data consumers. However, most open government data portals do not provide the five-star Linked Data standard datasets. The published datasets are isolated from one another while conceptually connected. This paper constructs a knowledge graph for the disease-related datasets of a Canadian government data portal, Nova Scotia Open Data. We leveraged the Semantic Web technologies to transform the disease-related datasets into Resource Description Framework (RDF) and enriched them with semantic rules. An RDF data model using the RDF Cube vocabulary was designed in this work to develop a graph that adheres to best practices and standards, allowing for expansion, modification and flexible re-use. The study also discusses the lessons learned during the cross-dimensional knowledge graph construction and integration of open statistical datasets from multiple sources.

开放政府数据中的大多数可用数据集都是统计数据。它们由各国政府广泛发布,供公众和数据消费者使用。然而,大多数开放的政府数据门户网站不提供五星级的关联数据标准数据集。发布的数据集彼此隔离,但在概念上是连接的。本文构建了加拿大政府数据门户网站Nova Scotia Open data的疾病相关数据集的知识图谱。我们利用语义Web技术将疾病相关数据集转换为资源描述框架(RDF),并用语义规则对其进行丰富。本文设计了一个使用RDF Cube词汇表的RDF数据模型,用于开发符合最佳实践和标准的图,允许扩展、修改和灵活重用。研究还讨论了跨维知识图谱构建和多源开放统计数据集集成的经验教训。
{"title":"Constructing a knowledge graph for open government data: the case of Nova Scotia disease datasets.","authors":"Enayat Rajabi,&nbsp;Rishi Midha,&nbsp;Jairo Francisco de Souza","doi":"10.1186/s13326-023-00284-w","DOIUrl":"https://doi.org/10.1186/s13326-023-00284-w","url":null,"abstract":"<p><p>The majority of available datasets in open government data are statistical. They are widely published by various governments to be used by the public and data consumers. However, most open government data portals do not provide the five-star Linked Data standard datasets. The published datasets are isolated from one another while conceptually connected. This paper constructs a knowledge graph for the disease-related datasets of a Canadian government data portal, Nova Scotia Open Data. We leveraged the Semantic Web technologies to transform the disease-related datasets into Resource Description Framework (RDF) and enriched them with semantic rules. An RDF data model using the RDF Cube vocabulary was designed in this work to develop a graph that adheres to best practices and standards, allowing for expansion, modification and flexible re-use. The study also discusses the lessons learned during the cross-dimensional knowledge graph construction and integration of open statistical datasets from multiple sources.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"4"},"PeriodicalIF":1.9,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10111831/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9478716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Journal of Biomedical Semantics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1