首页 > 最新文献

Journal of Biomedical Semantics最新文献

英文 中文
BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs BioBLP:多模态生物医学知识图谱的模块化学习框架
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-12-08 DOI: 10.1186/s13326-023-00301-y
Daniel Daza, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, Paul Groth
Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. We aim to understand how to incorporate multimodal data into biomedical KG embeddings, and analyze the resulting performance in comparison with traditional methods. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. Further analyses show that incorporating attribute data does outperform baselines over entities below a certain node degree, comprising approximately 75% of the diseases in the graph. We also observe that optimizing attribute encoders is a challenging task that increases optimization costs. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. BioBLP allows to investigate different ways of incorporating multimodal biomedical data for learning representations in KGs. With a particular implementation, we find that incorporating attribute data does not consistently outperform baselines, but improvements are obtained on a comparatively large subset of entities below a specific node-degree. Our results indicate a potential for improved performance in scientific discovery tasks where understudied areas of the KG would benefit from link prediction methods.
知识图谱(KG)是表示生物医学领域实体间复杂关系的重要工具。目前已提出了几种学习嵌入的方法,可用于预测此类图中的新链接。有些方法忽略了生物医学 KG 中与实体相关的宝贵属性数据,如蛋白质序列或分子图。其他方法包含了这些数据,但假设实体可以用相同的数据模式来表示。生物医学 KG 并不总是这种情况,其中的实体表现出不同的模式,而这些模式对它们在主题领域中的表示至关重要。我们的目标是了解如何将多模态数据纳入生物医学 KG 嵌入,并与传统方法比较分析由此产生的性能。我们提出了一个模块化框架,用于学习带有实体属性的 KG 嵌入,该框架允许对不同模态的属性数据进行编码,同时还支持属性缺失的实体。此外,我们还提出了一种高效的预训练策略,以减少所需的训练运行时间。我们使用包含约 200 万个三元组的生物医学 KG 对模型进行了训练,并在链接预测和药物-蛋白质相互作用预测任务中评估了所得实体嵌入的性能,并与不考虑属性数据的方法进行了比较。在标准链接预测评估中,提出的方法具有竞争力,但性能低于不使用属性数据的基线方法。在药物-蛋白质相互作用预测任务中进行评估时,该方法与基线方法相比更胜一筹。进一步的分析表明,对于低于一定节点度的实体(约占图中疾病的 75%),结合属性数据的效果确实优于基线方法。我们还发现,优化属性编码器是一项具有挑战性的任务,会增加优化成本。我们提出的预训练策略能显著提高性能,同时减少所需的训练运行时间。BioBLP 允许研究将多模态生物医学数据纳入幼稚园学习表征的不同方法。通过特定的实现方法,我们发现纳入属性数据并不能始终优于基线,但在特定节点度以下的相对较大的实体子集上却能获得改进。我们的研究结果表明,在科学发现任务中,KG 中未被充分研究的领域将从链接预测方法中获益,从而提高性能。
{"title":"BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs","authors":"Daniel Daza, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, Paul Groth","doi":"10.1186/s13326-023-00301-y","DOIUrl":"https://doi.org/10.1186/s13326-023-00301-y","url":null,"abstract":"Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. We aim to understand how to incorporate multimodal data into biomedical KG embeddings, and analyze the resulting performance in comparison with traditional methods. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. Further analyses show that incorporating attribute data does outperform baselines over entities below a certain node degree, comprising approximately 75% of the diseases in the graph. We also observe that optimizing attribute encoders is a challenging task that increases optimization costs. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. BioBLP allows to investigate different ways of incorporating multimodal biomedical data for learning representations in KGs. With a particular implementation, we find that incorporating attribute data does not consistently outperform baselines, but improvements are obtained on a comparatively large subset of entities below a specific node-degree. Our results indicate a potential for improved performance in scientific discovery tasks where understudied areas of the KG would benefit from link prediction methods.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"86 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138562929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing resolvability, parsability, and consistency of RDF resources: a use case in rare diseases. 评估 RDF 资源的可解析性、可分析性和一致性:罕见疾病用例。
IF 1.6 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-12-05 DOI: 10.1186/s13326-023-00299-3
Shuxin Zhang, Nirupama Benis, Ronald Cornet

Introduction: Healthcare data and the knowledge gleaned from it play a key role in improving the health of current and future patients. These knowledge sources are regularly represented as 'linked' resources based on the Resource Description Framework (RDF). Making resources 'linkable' to facilitate their interoperability is especially important in the rare-disease domain, where health resources are scattered and scarce. However, to benefit from using RDF, resources need to be of good quality. Based on existing metrics, we aim to assess the quality of RDF resources related to rare diseases and provide recommendations for their improvement.

Methods: Sixteen resources of relevance for the rare-disease domain were selected: two schemas, three metadatasets, and eleven ontologies. These resources were tested on six objective metrics regarding resolvability, parsability, and consistency. Any URI that failed the test based on any of the six metrics was recorded as an error. The error count and percentage of each tested resource were recorded. The assessment results were represented in RDF, using the Data Quality Vocabulary schema.

Results: For three out of the six metrics, the assessment revealed quality issues. Eleven resources have non-resolvable URIs with proportion to all URIs ranging from 0.1% (6/6,712) in the Anatomical Therapeutic Chemical Classification to 13.7% (17/124) in the WikiPathways Ontology; seven resources have undefined URIs; and two resources have incorrectly used properties of the 'owl:ObjectProperty' type. Individual errors were examined to generate suggestions for the development of high-quality RDF resources, including the tested resources.

Conclusion: We assessed the resolvability, parsability, and consistency of RDF resources in the rare-disease domain, and determined the extent of these types of errors that potentially affect interoperability. The qualitative investigation on these errors reveals how they can be avoided. All findings serve as valuable input for the development of a guideline for creating high-quality RDF resources, thereby enhancing the interoperability of biomedical resources.

导言:医疗保健数据和从中获取的知识在改善当前和未来患者的健康状况方面发挥着关键作用。这些知识源通常以基于资源描述框架(RDF)的 "链接 "资源形式表示。使资源 "可链接 "以促进其互操作性在罕见病领域尤为重要,因为该领域的医疗资源分散且稀缺。然而,要从使用 RDF 中获益,资源必须具有良好的质量。基于现有的衡量标准,我们旨在评估与罕见病相关的 RDF 资源的质量,并提出改进建议:我们选择了 16 个与罕见病领域相关的资源:两个模式、三个元数据集和 11 个本体。对这些资源进行了有关可解析性、可分析性和一致性的六项客观指标测试。任何未通过六项指标中任何一项测试的 URI 都会被记录为错误。每个测试资源的错误计数和百分比都被记录下来。评估结果使用数据质量词汇模式 RDF 表示:在六个指标中,有三个指标的评估结果显示存在质量问题。有 11 个资源的 URI 无法解析,占所有 URI 的比例从解剖学治疗化学分类的 0.1%(6/6,712)到 WikiPathways 本体的 13.7%(17/124)不等;有 7 个资源的 URI 未定义;有 2 个资源错误地使用了 "owl:ObjectProperty "类型的属性。通过对个别错误的研究,我们提出了开发高质量 RDF 资源的建议,其中包括测试过的资源:我们评估了罕见病领域中 RDF 资源的可解析性、可分析性和一致性,并确定了这些可能影响互操作性的错误类型的严重程度。对这些错误的定性调查揭示了如何避免这些错误。所有研究结果都为制定创建高质量 RDF 资源的指南提供了有价值的信息,从而提高了生物医学资源的互操作性。
{"title":"Assessing resolvability, parsability, and consistency of RDF resources: a use case in rare diseases.","authors":"Shuxin Zhang, Nirupama Benis, Ronald Cornet","doi":"10.1186/s13326-023-00299-3","DOIUrl":"10.1186/s13326-023-00299-3","url":null,"abstract":"<p><strong>Introduction: </strong>Healthcare data and the knowledge gleaned from it play a key role in improving the health of current and future patients. These knowledge sources are regularly represented as 'linked' resources based on the Resource Description Framework (RDF). Making resources 'linkable' to facilitate their interoperability is especially important in the rare-disease domain, where health resources are scattered and scarce. However, to benefit from using RDF, resources need to be of good quality. Based on existing metrics, we aim to assess the quality of RDF resources related to rare diseases and provide recommendations for their improvement.</p><p><strong>Methods: </strong>Sixteen resources of relevance for the rare-disease domain were selected: two schemas, three metadatasets, and eleven ontologies. These resources were tested on six objective metrics regarding resolvability, parsability, and consistency. Any URI that failed the test based on any of the six metrics was recorded as an error. The error count and percentage of each tested resource were recorded. The assessment results were represented in RDF, using the Data Quality Vocabulary schema.</p><p><strong>Results: </strong>For three out of the six metrics, the assessment revealed quality issues. Eleven resources have non-resolvable URIs with proportion to all URIs ranging from 0.1% (6/6,712) in the Anatomical Therapeutic Chemical Classification to 13.7% (17/124) in the WikiPathways Ontology; seven resources have undefined URIs; and two resources have incorrectly used properties of the 'owl:ObjectProperty' type. Individual errors were examined to generate suggestions for the development of high-quality RDF resources, including the tested resources.</p><p><strong>Conclusion: </strong>We assessed the resolvability, parsability, and consistency of RDF resources in the rare-disease domain, and determined the extent of these types of errors that potentially affect interoperability. The qualitative investigation on these errors reveals how they can be avoided. All findings serve as valuable input for the development of a guideline for creating high-quality RDF resources, thereby enhancing the interoperability of biomedical resources.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"19"},"PeriodicalIF":1.6,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10696869/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138487612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph. COVID-19研究的影响:使用机器学习和领域独立知识图预测有影响力的学术文献的研究。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-11-28 DOI: 10.1186/s13326-023-00298-4
Gollam Rabby, Jennifer D'Souza, Allard Oelen, Lucie Dvorackova, Vojtěch Svátek, Sören Auer

Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.

许多研究调查了文献计量学特征和未分类的学术文献,以进行有影响力的学术文献预测任务。在本文中,我们描述了我们的工作,试图超越文献计量元数据来预测有影响力的学术文献。此外,本工作还研究了对分类学术文献有影响的学术文献预测任务。我们还提出了一种新的方法,利用领域无关的知识图来增强文献表示方法,利用分类的学术内容来寻找有影响力的学术文献。作为输入库,我们使用了世卫组织关于COVID-19主题的学术文献语料库。本研究考察了机器学习的不同文档表示方法,包括TF-IDF、BOW和基于嵌入的语言模型(BERT)。TF-IDF文档表示方法比其他方法效果更好。从测试的各种机器学习方法中,逻辑回归在学术文献类别分类方面表现优于其他方法,随机森林算法在有影响力的学术文献预测方面取得了最好的结果,借助领域无关的知识图,特别是DBpedia,增强了预测具有分类学术内容的有影响力的学术文献的文档表示方法。在这种情况下,我们的研究结合了最先进的机器学习方法和BOW文档表示方法。我们还使用直接类型(RDF类型)和来自DBpedia的不限定关系增强了BOW文档表示。从这个实验中,我们没有发现增强的文档表示对学术文档类别分类有任何影响。我们发现在有影响力的学术文献预测中使用分类数据有一定的效果。
{"title":"Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph.","authors":"Gollam Rabby, Jennifer D'Souza, Allard Oelen, Lucie Dvorackova, Vojtěch Svátek, Sören Auer","doi":"10.1186/s13326-023-00298-4","DOIUrl":"10.1186/s13326-023-00298-4","url":null,"abstract":"<p><p>Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"18"},"PeriodicalIF":1.9,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10683290/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138451554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data management plans as linked open data: exploiting ARGOS FAIR and machine actionable outputs in the OpenAIRE research graph. 数据管理计划作为链接的开放数据:利用OpenAIRE研究图中的ARGOS FAIR和机器可操作输出。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-11-02 DOI: 10.1186/s13326-023-00297-5
Elli Papadopoulou, Alessia Bardi, George Kakaletris, Diamadis Tziotzios, Paolo Manghi, Natalia Manola

Background: Open Science Graphs (OSGs) are scientific knowledge graphs representing different entities of the research lifecycle (e.g. projects, people, research outcomes, institutions) and the relationships among them. They present a contextualized view of current research that supports discovery, re-use, reproducibility, monitoring, transparency and omni-comprehensive assessment. A Data Management Plan (DMP) contains information concerning both the research processes and the data collected, generated and/or re-used during a project's lifetime. Automated solutions and workflows that connect DMPs with the actual data and other contextual information (e.g., publications, fundings) are missing from the landscape. DMPs being submitted as deliverables also limit their findability. In an open and FAIR-enabling research ecosystem information linking between research processes and research outputs is essential. ARGOS tool for FAIR data management contributes to the OpenAIRE Research Graph (RG) and utilises its underlying services and trusted sources to progressively automate validation and automations of Research Data Management (RDM) practices.

Results: A comparative analysis was conducted between the data models of ARGOS and OpenAIRE Research Graph against the DMP Common Standard. Following this, we extended ARGOS with export format converters and semantic tagging, and the OpenAIRE RG with a DMP entity and semantics between existing entities and relationships. This enabled the integration of ARGOS machine actionable DMPs (ma-DMPs) to the OpenAIRE OSG, enriching and exposing DMPs as FAIR outputs.

Conclusions: This paper, to our knowledge, is the first to introduce exposing ma-DMPs in OSGs and making the link between OSGs and DMPs, introducing the latter as entities in the research lifecycle. Further, it provides insight to ARGOS DMP service interoperability practices and integrations to populate the OpenAIRE Research Graph with DMP entities and relationships and strengthen both FAIRness of outputs as well as information exchange in a standard way.

背景:开放科学图(OSG)是代表研究生命周期的不同实体(如项目、人员、研究成果、机构)及其之间关系的科学知识图。他们提出了当前研究的背景观点,支持发现、重复使用、再现性、监测、透明度和全方位综合评估。数据管理计划(DMP)包含有关研究过程以及在项目生命周期内收集、生成和/或重复使用的数据的信息。将DMP与实际数据和其他上下文信息(如出版物、资助)连接起来的自动化解决方案和工作流程在这一领域中缺失。DMP作为可交付成果提交也限制了其可查找性。在一个开放和FAIR的研究生态系统中,研究过程和研究产出之间的信息联系至关重要。用于FAIR数据管理的ARGOS工具有助于OpenAIRE研究图(RG),并利用其底层服务和可信来源逐步自动化研究数据管理(RDM)实践的验证和自动化。结果:ARGOS和OpenAIRE Research Graph的数据模型与DMP通用标准进行了比较分析。在此之后,我们使用导出格式转换器和语义标记扩展了ARGOS,并使用DMP实体和现有实体和关系之间的语义扩展了OpenAIRE RG。这使得ARGOS机器可操作DMP(ma DMP)能够集成到OpenAIRE OSG,丰富并公开DMP作为FAIR输出。结论:据我们所知,本文首次介绍了在OSG中暴露ma DMP,并在OSG和DMP之间建立联系,将后者作为研究生命周期中的实体引入。此外,它还深入了解了ARGOS DMP服务互操作性实践和集成,以用DMP实体和关系填充OpenAIRE研究图,并以标准方式加强输出的公平性和信息交换。
{"title":"Data management plans as linked open data: exploiting ARGOS FAIR and machine actionable outputs in the OpenAIRE research graph.","authors":"Elli Papadopoulou,&nbsp;Alessia Bardi,&nbsp;George Kakaletris,&nbsp;Diamadis Tziotzios,&nbsp;Paolo Manghi,&nbsp;Natalia Manola","doi":"10.1186/s13326-023-00297-5","DOIUrl":"10.1186/s13326-023-00297-5","url":null,"abstract":"<p><strong>Background: </strong>Open Science Graphs (OSGs) are scientific knowledge graphs representing different entities of the research lifecycle (e.g. projects, people, research outcomes, institutions) and the relationships among them. They present a contextualized view of current research that supports discovery, re-use, reproducibility, monitoring, transparency and omni-comprehensive assessment. A Data Management Plan (DMP) contains information concerning both the research processes and the data collected, generated and/or re-used during a project's lifetime. Automated solutions and workflows that connect DMPs with the actual data and other contextual information (e.g., publications, fundings) are missing from the landscape. DMPs being submitted as deliverables also limit their findability. In an open and FAIR-enabling research ecosystem information linking between research processes and research outputs is essential. ARGOS tool for FAIR data management contributes to the OpenAIRE Research Graph (RG) and utilises its underlying services and trusted sources to progressively automate validation and automations of Research Data Management (RDM) practices.</p><p><strong>Results: </strong>A comparative analysis was conducted between the data models of ARGOS and OpenAIRE Research Graph against the DMP Common Standard. Following this, we extended ARGOS with export format converters and semantic tagging, and the OpenAIRE RG with a DMP entity and semantics between existing entities and relationships. This enabled the integration of ARGOS machine actionable DMPs (ma-DMPs) to the OpenAIRE OSG, enriching and exposing DMPs as FAIR outputs.</p><p><strong>Conclusions: </strong>This paper, to our knowledge, is the first to introduce exposing ma-DMPs in OSGs and making the link between OSGs and DMPs, introducing the latter as entities in the research lifecycle. Further, it provides insight to ARGOS DMP service interoperability practices and integrations to populate the OpenAIRE Research Graph with DMP entities and relationships and strengthen both FAIRness of outputs as well as information exchange in a standard way.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"17"},"PeriodicalIF":1.9,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10621150/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71423853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Context-based refinement of mappings in evolving life science ontologies. 进化生命科学本体论中映射的基于上下文的精化。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-10-19 DOI: 10.1186/s13326-023-00294-8
Victor Eiti Yamamoto, Juliana Medeiros Destro, Julio Cesar Dos Reis

Background: Biomedical computational systems benefit from ontologies and their associated mappings. Indeed, aligned ontologies in life sciences play a central role in several semantic-enabled tasks, especially in data exchange. It is crucial to maintain up-to-date alignments according to new knowledge inserted in novel ontology releases. Refining ontology mappings in place, based on adding concepts, demands further research.

Results: This article studies the mapping refinement phenomenon by proposing techniques to refine a set of established mappings based on the evolution of biomedical ontologies. In our first analysis, we investigate ways of suggesting correspondences with the new ontology version without applying a matching operation to the whole set of ontology entities. In the second analysis, the refinement technique enables deriving new mappings and updating the semantic type of the mapping beyond equivalence. Our study explores the neighborhood of concepts in the alignment process to refine mapping sets.

Conclusion: Experimental evaluations with several versions of aligned biomedical ontologies were conducted. Those experiments demonstrated the usefulness of ontology evolution changes to support the process of mapping refinement. Furthermore, using context in ontological concepts was effective in our techniques.

背景:生物医学计算系统受益于本体论及其相关映射。事实上,生命科学中的对齐本体在一些语义支持的任务中发挥着核心作用,尤其是在数据交换中。根据新的本体发布中插入的新知识来保持最新的对齐是至关重要的。在添加概念的基础上,对本体映射进行适当的细化需要进一步的研究。结果:本文研究了映射精化现象,提出了基于生物医学本体论进化来精化一组已建立的映射的技术。在我们的第一次分析中,我们研究了在不将匹配操作应用于整个本体实体集的情况下,建议与新本体版本对应的方法。在第二种分析中,精化技术能够导出新的映射,并更新映射的语义类型,使其超越等价性。我们的研究探索了对齐过程中概念的邻域,以完善映射集。结论:对几种版本的生物医学本体进行了实验评估。这些实验证明了本体进化变化对支持映射精化过程的有用性。此外,在本体论概念中使用上下文在我们的技术中是有效的。
{"title":"Context-based refinement of mappings in evolving life science ontologies.","authors":"Victor Eiti Yamamoto, Juliana Medeiros Destro, Julio Cesar Dos Reis","doi":"10.1186/s13326-023-00294-8","DOIUrl":"10.1186/s13326-023-00294-8","url":null,"abstract":"<p><strong>Background: </strong>Biomedical computational systems benefit from ontologies and their associated mappings. Indeed, aligned ontologies in life sciences play a central role in several semantic-enabled tasks, especially in data exchange. It is crucial to maintain up-to-date alignments according to new knowledge inserted in novel ontology releases. Refining ontology mappings in place, based on adding concepts, demands further research.</p><p><strong>Results: </strong>This article studies the mapping refinement phenomenon by proposing techniques to refine a set of established mappings based on the evolution of biomedical ontologies. In our first analysis, we investigate ways of suggesting correspondences with the new ontology version without applying a matching operation to the whole set of ontology entities. In the second analysis, the refinement technique enables deriving new mappings and updating the semantic type of the mapping beyond equivalence. Our study explores the neighborhood of concepts in the alignment process to refine mapping sets.</p><p><strong>Conclusion: </strong>Experimental evaluations with several versions of aligned biomedical ontologies were conducted. Those experiments demonstrated the usefulness of ontology evolution changes to support the process of mapping refinement. Furthermore, using context in ontological concepts was effective in our techniques.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"16"},"PeriodicalIF":1.9,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10585791/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49677735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis and implementation of the DynDiff tool when comparing versions of ontology. 比较本体版本时DynDiff工具的分析和实现。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-09-28 DOI: 10.1186/s13326-023-00295-7
Sara Diaz Benavides, Silvio D Cardoso, Marcos Da Silveira, Cédric Pruski

Background: Ontologies play a key role in the management of medical knowledge because they have the properties to support a wide range of knowledge-intensive tasks. The dynamic nature of knowledge requires frequent changes to the ontologies to keep them up-to-date. The challenge is to understand and manage these changes and their impact on depending systems well in order to handle the growing volume of data annotated with ontologies and the limited documentation describing the changes.

Methods: We present a method to detect and characterize the changes occurring between different versions of an ontology together with an ontology of changes entitled DynDiffOnto, designed according to Semantic Web best practices and FAIR principles. We further describe the implementation of the method and the evaluation of the tool with different ontologies from the biomedical domain (i.e. ICD9-CM, MeSH, NCIt, SNOMEDCT, GO, IOBC and CIDO), showing its performance in terms of time execution and capacity to classify ontological changes, compared with other state-of-the-art approaches.

Results: The experiments show a top-level performance of DynDiff for large ontologies and a good performance for smaller ones, with respect to execution time and capability to identify complex changes. In this paper, we further highlight the impact of ontology matchers on the diff computation and the possibility to parameterize the matcher in DynDiff, enabling the possibility of benefits from state-of-the-art matchers.

Conclusion: DynDiff is an efficient tool to compute differences between ontology versions and classify these differences according to DynDiffOnto concepts. This work also contributes to a better understanding of ontological changes through DynDiffOnto, which was designed to express the semantics of the changes between versions of an ontology and can be used to document the evolution of an ontology.

背景:本体论在医学知识管理中发挥着关键作用,因为它们具有支持广泛的知识密集型任务的特性。知识的动态性质要求对本体进行频繁的更改,以使其保持最新状态。挑战在于理解和管理这些变化及其对依赖系统的影响,以便处理越来越多的用本体注释的数据和描述这些变化的有限文档。方法:我们提出了一种检测和表征不同版本本体之间发生的变化的方法,以及根据语义网最佳实践和FAIR原则设计的名为DynDiffOnto的变化本体。我们进一步描述了该方法的实现以及该工具在生物医学领域的不同本体(即ICD9-CM、MeSH、NCIt、SNOMEDCT、GO、IOBC和CIDO)的评估,与其他最先进的方法相比,显示了其在时间执行和对本体变化进行分类的能力方面的性能。结果:实验表明,在执行时间和识别复杂变化的能力方面,DynDiff对大型本体具有顶级性能,对小型本体具有良好性能。在本文中,我们进一步强调了本体匹配器对diff计算的影响,以及在DynDiff中参数化匹配器的可能性,从而有可能从最先进的匹配器中获益。结论:DynDiff是一种计算本体版本之间差异并根据DynDiffOnto概念对这些差异进行分类的有效工具。这项工作也有助于通过DynDiffOnto更好地理解本体论的变化,DynDiff Onto旨在表达本体论版本之间变化的语义,并可用于记录本体论的演变。
{"title":"Analysis and implementation of the DynDiff tool when comparing versions of ontology.","authors":"Sara Diaz Benavides, Silvio D Cardoso, Marcos Da Silveira, Cédric Pruski","doi":"10.1186/s13326-023-00295-7","DOIUrl":"10.1186/s13326-023-00295-7","url":null,"abstract":"<p><strong>Background: </strong>Ontologies play a key role in the management of medical knowledge because they have the properties to support a wide range of knowledge-intensive tasks. The dynamic nature of knowledge requires frequent changes to the ontologies to keep them up-to-date. The challenge is to understand and manage these changes and their impact on depending systems well in order to handle the growing volume of data annotated with ontologies and the limited documentation describing the changes.</p><p><strong>Methods: </strong>We present a method to detect and characterize the changes occurring between different versions of an ontology together with an ontology of changes entitled DynDiffOnto, designed according to Semantic Web best practices and FAIR principles. We further describe the implementation of the method and the evaluation of the tool with different ontologies from the biomedical domain (i.e. ICD9-CM, MeSH, NCIt, SNOMEDCT, GO, IOBC and CIDO), showing its performance in terms of time execution and capacity to classify ontological changes, compared with other state-of-the-art approaches.</p><p><strong>Results: </strong>The experiments show a top-level performance of DynDiff for large ontologies and a good performance for smaller ones, with respect to execution time and capability to identify complex changes. In this paper, we further highlight the impact of ontology matchers on the diff computation and the possibility to parameterize the matcher in DynDiff, enabling the possibility of benefits from state-of-the-art matchers.</p><p><strong>Conclusion: </strong>DynDiff is an efficient tool to compute differences between ontology versions and classify these differences according to DynDiffOnto concepts. This work also contributes to a better understanding of ontological changes through DynDiffOnto, which was designed to express the semantics of the changes between versions of an ontology and can be used to document the evolution of an ontology.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"15"},"PeriodicalIF":1.9,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10537977/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41114733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Development and validation of the early warning system scores ontology. 预警系统评分本体的开发和验证。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-09-20 DOI: 10.1186/s13326-023-00296-6
Cilia E Zayas, Justin M Whorton, Kevin W Sexton, Charles D Mabry, S Clint Dowland, Mathias Brochhausen

Background: Clinical early warning scoring systems, have improved patient outcomes in a range of specializations and global contexts. These systems are used to predict patient deterioration. A multitude of patient-level physiological decompensation data has been made available through the widespread integration of early warning scoring systems within EHRs across national and international health care organizations. These data can be used to promote secondary research. The diversity of early warning scoring systems and various EHR systems is one barrier to secondary analysis of early warning score data. Given that early warning score parameters are varied, this makes it difficult to query across providers and EHR systems. Moreover, mapping and merging the parameters is challenging. We develop and validate the Early Warning System Scores Ontology (EWSSO), representing three commonly used early warning scores: the National Early Warning Score (NEWS), the six-item modified Early Warning Score (MEWS), and the quick Sequential Organ Failure Assessment (qSOFA) to overcome these problems.

Methods: We apply the Software Development Lifecycle Framework-conceived by Winston Boyce in 1970-to model the activities involved in organizing, producing, and evaluating the EWSSO. We also follow OBO Foundry Principles and the principles of best practice for domain ontology design, terms, definitions, and classifications to meet BFO requirements for ontology building.

Results: We developed twenty-nine new classes, reused four classes and four object properties to create the EWSSO. When we queried the data our ontology-based process could differentiate between necessary and unnecessary features for score calculation 100% of the time. Further, our process applied the proper temperature conversions for the early warning score calculator 100% of the time.

Conclusions: Using synthetic datasets, we demonstrate the EWSSO can be used to generate and query health system data on vital signs and provide input to calculate the NEWS, six-item MEWS, and qSOFA. Future work includes extending the EWSSO by introducing additional early warning scores for adult and pediatric patient populations and creating patient profiles that contain clinical, demographic, and outcomes data regarding the patient.

背景:临床预警评分系统在一系列专业和全球背景下改善了患者的预后。这些系统用于预测患者病情恶化。通过在国家和国际卫生保健组织的EHR中广泛集成早期预警评分系统,已经提供了大量患者水平的生理失代偿数据。这些数据可用于促进二次研究。预警评分系统和各种EHR系统的多样性是对预警评分数据进行二次分析的障碍之一。鉴于预警分数参数各不相同,因此很难在供应商和EHR系统之间进行查询。此外,映射和合并参数也是一项挑战。为了克服这些问题,我们开发并验证了预警系统分数本体论(EWSSO),它代表了三种常用的预警分数:国家预警分数(NEWS)、六项修正预警分数(MEWS)和快速顺序器官衰竭评估(qSOFA)。方法:我们应用Winston Boyce在1970年提出的软件开发生命周期框架来对组织、生产和评估EWSSO所涉及的活动进行建模。我们还遵循海外建筑运营管理局铸造原则和领域本体设计、术语、定义和分类的最佳实践原则,以满足BFO对本体构建的要求。结果:我们开发了二十九个新类,重用了四个类和四个对象属性来创建EWSSO。当我们查询数据时,我们基于本体的过程可以100%区分必要和不必要的特征,用于分数计算。此外,我们的过程在100%的时间内为预警分数计算器应用了适当的温度转换。结论:使用合成数据集,我们证明了EWSSO可以用于生成和查询健康系统的生命体征数据,并为计算NEWS、六项MEWS和qSOFA提供输入。未来的工作包括通过为成人和儿科患者群体引入额外的早期预警分数来扩展EWSSO,并创建包含患者临床、人口统计和结果数据的患者档案。
{"title":"Development and validation of the early warning system scores ontology.","authors":"Cilia E Zayas, Justin M Whorton, Kevin W Sexton, Charles D Mabry, S Clint Dowland, Mathias Brochhausen","doi":"10.1186/s13326-023-00296-6","DOIUrl":"10.1186/s13326-023-00296-6","url":null,"abstract":"<p><strong>Background: </strong>Clinical early warning scoring systems, have improved patient outcomes in a range of specializations and global contexts. These systems are used to predict patient deterioration. A multitude of patient-level physiological decompensation data has been made available through the widespread integration of early warning scoring systems within EHRs across national and international health care organizations. These data can be used to promote secondary research. The diversity of early warning scoring systems and various EHR systems is one barrier to secondary analysis of early warning score data. Given that early warning score parameters are varied, this makes it difficult to query across providers and EHR systems. Moreover, mapping and merging the parameters is challenging. We develop and validate the Early Warning System Scores Ontology (EWSSO), representing three commonly used early warning scores: the National Early Warning Score (NEWS), the six-item modified Early Warning Score (MEWS), and the quick Sequential Organ Failure Assessment (qSOFA) to overcome these problems.</p><p><strong>Methods: </strong>We apply the Software Development Lifecycle Framework-conceived by Winston Boyce in 1970-to model the activities involved in organizing, producing, and evaluating the EWSSO. We also follow OBO Foundry Principles and the principles of best practice for domain ontology design, terms, definitions, and classifications to meet BFO requirements for ontology building.</p><p><strong>Results: </strong>We developed twenty-nine new classes, reused four classes and four object properties to create the EWSSO. When we queried the data our ontology-based process could differentiate between necessary and unnecessary features for score calculation 100% of the time. Further, our process applied the proper temperature conversions for the early warning score calculator 100% of the time.</p><p><strong>Conclusions: </strong>Using synthetic datasets, we demonstrate the EWSSO can be used to generate and query health system data on vital signs and provide input to calculate the NEWS, six-item MEWS, and qSOFA. Future work includes extending the EWSSO by introducing additional early warning scores for adult and pediatric patient populations and creating patient profiles that contain clinical, demographic, and outcomes data regarding the patient.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"14"},"PeriodicalIF":1.9,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10510162/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41123049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments. 生物医学文献中实验模型的自动分类,以支持寻找动物实验的替代方法。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-09-01 DOI: 10.1186/s13326-023-00292-w
Mariana Neves, Antonina Klippert, Fanny Knöspel, Juliane Rudeck, Ailine Stolz, Zsofia Ban, Markus Becker, Kai Diederich, Barbara Grune, Pia Kahnau, Nils Ohnesorge, Johannes Pucher, Gilbert Schönfelder, Bettina Bert, Daniel Butzke

Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: "in vivo", "organs", "primary cells", "immortal cell lines", "invertebrates", "humans", "in silico" and "other" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for "others") to 0.82 (for "invertebrates"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - "Smart feature-based interactive" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).

目前的动物保护法要求用替代方法替代动物实验,只要这些方法适合达到预期的科学目标。然而,在科学文献中寻找替代方法是一项耗时的任务,需要仔细筛选大量的实验性生物医学出版物。识别潜在的相关方法,例如器官或细胞培养模型,或计算机模拟,可以通过专门为此目的构建的文本挖掘工具来支持。这些工具是在人类专家标记的相关数据集上训练(或微调)的。我们开发了GoldHamster语料库,该语料库由1600篇PubMed (Medline)文章(标题和摘要)组成,其中我们根据一组8个标签手动识别使用的实验模型,即:“体内”、“器官”、“原代细胞”、“不朽细胞系”、“无脊椎动物”、“人类”、“计算机”和“其他”(模型)。我们招募了13名具有生物医学领域专业知识的注释者,并将每篇文章分配给两个人。另外四轮注释旨在提高第一轮中存在分歧的注释的质量。此外,我们进行了各种基于监督学习的机器学习实验,以评估我们分类任务的语料库。我们为上述标签获得了7000多个文档级别的注释。在第一轮标注之后,标注者之间的一致性(kappa系数)在标签之间变化,范围从0.42(“其他”)到0.82(“无脊椎动物”),总分为0.62。在随后的几轮注释中,所有分歧都得到了解决。表现最好的机器学习实验使用了PubMedBERT预训练模型,并对我们的语料库进行了微调,其总体f分数为0.83。我们获得了一个对所有标签都具有高度一致性的语料库,我们的评估表明,根据使用的实验模型,我们的语料库适合用于训练可靠的生物医学文献自动分类预测模型。我们的SMAFIRA——“基于智能特征的交互式”搜索工具(https://smafira.bf3r.de)将使用这个分类器来支持动物实验替代方法的检索。语料库可以下载(https://doi.org/10.5281/zenodo.7152295),也可以下载源代码(https://github.com/mariananeves/goldhamster)和模型(https://huggingface.co/SMAFIRA/goldhamster)。
{"title":"Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments.","authors":"Mariana Neves, Antonina Klippert, Fanny Knöspel, Juliane Rudeck, Ailine Stolz, Zsofia Ban, Markus Becker, Kai Diederich, Barbara Grune, Pia Kahnau, Nils Ohnesorge, Johannes Pucher, Gilbert Schönfelder, Bettina Bert, Daniel Butzke","doi":"10.1186/s13326-023-00292-w","DOIUrl":"10.1186/s13326-023-00292-w","url":null,"abstract":"<p><p>Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: \"in vivo\", \"organs\", \"primary cells\", \"immortal cell lines\", \"invertebrates\", \"humans\", \"in silico\" and \"other\" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for \"others\") to 0.82 (for \"invertebrates\"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - \"Smart feature-based interactive\" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"13"},"PeriodicalIF":1.9,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10472567/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10178765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic transparency evaluation for open knowledge extraction systems. 开放知识提取系统的自动透明度评估。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-08-31 DOI: 10.1186/s13326-023-00293-9
Maryam Basereh, Annalina Caputo, Rob Brennan
<p><strong>Background: </strong>This paper proposes Cyrus, a new transparency evaluation framework, for Open Knowledge Extraction (OKE) systems. Cyrus is based on the state-of-the-art transparency models and linked data quality assessment dimensions. It brings together a comprehensive view of transparency dimensions for OKE systems. The Cyrus framework is used to evaluate the transparency of three linked datasets, which are built from the same corpus by three state-of-the-art OKE systems. The evaluation is automatically performed using a combination of three state-of-the-art FAIRness (Findability, Accessibility, Interoperability, Reusability) assessment tools and a linked data quality evaluation framework, called Luzzu. This evaluation includes six Cyrus data transparency dimensions for which existing assessment tools could be identified. OKE systems extract structured knowledge from unstructured or semi-structured text in the form of linked data. These systems are fundamental components of advanced knowledge services. However, due to the lack of a transparency framework for OKE, most OKE systems are not transparent. This means that their processes and outcomes are not understandable and interpretable. A comprehensive framework sheds light on different aspects of transparency, allows comparison between the transparency of different systems by supporting the development of transparency scores, gives insight into the transparency weaknesses of the system, and ways to improve them. Automatic transparency evaluation helps with scalability and facilitates transparency assessment. The transparency problem has been identified as critical by the European Union Trustworthy Artificial Intelligence (AI) guidelines. In this paper, Cyrus provides the first comprehensive view of transparency dimensions for OKE systems by merging the perspectives of the FAccT (Fairness, Accountability, and Transparency), FAIR, and linked data quality research communities.</p><p><strong>Results: </strong>In Cyrus, data transparency includes ten dimensions which are grouped in two categories. In this paper, six of these dimensions, i.e., provenance, interpretability, understandability, licensing, availability, interlinking have been evaluated automatically for three state-of-the-art OKE systems, using the state-of-the-art metrics and tools. Covid-on-the-Web is identified to have the highest mean transparency.</p><p><strong>Conclusions: </strong>This is the first research to study the transparency of OKE systems that provides a comprehensive set of transparency dimensions spanning ethics, trustworthy AI, and data quality approaches to transparency. It also demonstrates how to perform automated transparency evaluation that combines existing FAIRness and linked data quality assessment tools for the first time. We show that state-of-the-art OKE systems vary in the transparency of the linked data generated and that these differences can be automatically quantified leading to potential
背景:本文提出了一种新的面向开放知识抽取(OKE)系统的透明度评价框架Cyrus。Cyrus基于最先进的透明度模型和关联的数据质量评估维度。它汇集了对OKE系统透明度维度的全面视图。Cyrus框架用于评估三个关联数据集的透明度,这些数据集由三个最先进的OKE系统从相同的语料库构建而成。评估使用三种最先进的公平性(可查找性、可访问性、互操作性、可重用性)评估工具和一个被称为Luzzu的关联数据质量评估框架的组合自动执行。该评估包括六个Cyrus数据透明度维度,现有评估工具可以识别这些维度。OKE系统以关联数据的形式从非结构化或半结构化文本中提取结构化知识。这些系统是高级知识服务的基本组成部分。然而,由于缺乏一个透明的OKE框架,大多数OKE系统是不透明的。这意味着它们的过程和结果是不可理解和解释的。一个全面的框架阐明了透明度的不同方面,通过支持透明度分数的发展,允许对不同系统的透明度进行比较,深入了解系统的透明度弱点,以及改进它们的方法。自动透明度评估有助于可伸缩性和促进透明度评估。透明度问题已被欧盟可信赖人工智能(AI)指南确定为关键问题。在本文中,Cyrus通过合并FAccT(公平性、问责制和透明度)、FAIR和关联数据质量研究社区的观点,提供了OKE系统透明度维度的第一个全面视图。结果:在Cyrus中,数据透明度包括十个维度,分为两类。在本文中,使用最先进的度量和工具,对三个最先进的OKE系统自动评估了这些维度中的六个,即来源、可解释性、可理解性、许可、可用性、互连。网络上的冠状病毒被认为具有最高的平均透明度。结论:这是第一个研究OKE系统透明度的研究,该系统提供了一套全面的透明度维度,涵盖道德、可信赖的人工智能和数据质量透明度方法。它还首次演示了如何执行结合现有公平性和关联数据质量评估工具的自动化透明度评估。我们表明,最先进的OKE系统在生成的关联数据的透明度方面存在差异,这些差异可以自动量化,从而在可信赖的人工智能、合规性、数据保护、数据治理以及未来的OKE系统设计和测试中产生潜在的应用。
{"title":"Automatic transparency evaluation for open knowledge extraction systems.","authors":"Maryam Basereh, Annalina Caputo, Rob Brennan","doi":"10.1186/s13326-023-00293-9","DOIUrl":"10.1186/s13326-023-00293-9","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;This paper proposes Cyrus, a new transparency evaluation framework, for Open Knowledge Extraction (OKE) systems. Cyrus is based on the state-of-the-art transparency models and linked data quality assessment dimensions. It brings together a comprehensive view of transparency dimensions for OKE systems. The Cyrus framework is used to evaluate the transparency of three linked datasets, which are built from the same corpus by three state-of-the-art OKE systems. The evaluation is automatically performed using a combination of three state-of-the-art FAIRness (Findability, Accessibility, Interoperability, Reusability) assessment tools and a linked data quality evaluation framework, called Luzzu. This evaluation includes six Cyrus data transparency dimensions for which existing assessment tools could be identified. OKE systems extract structured knowledge from unstructured or semi-structured text in the form of linked data. These systems are fundamental components of advanced knowledge services. However, due to the lack of a transparency framework for OKE, most OKE systems are not transparent. This means that their processes and outcomes are not understandable and interpretable. A comprehensive framework sheds light on different aspects of transparency, allows comparison between the transparency of different systems by supporting the development of transparency scores, gives insight into the transparency weaknesses of the system, and ways to improve them. Automatic transparency evaluation helps with scalability and facilitates transparency assessment. The transparency problem has been identified as critical by the European Union Trustworthy Artificial Intelligence (AI) guidelines. In this paper, Cyrus provides the first comprehensive view of transparency dimensions for OKE systems by merging the perspectives of the FAccT (Fairness, Accountability, and Transparency), FAIR, and linked data quality research communities.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;In Cyrus, data transparency includes ten dimensions which are grouped in two categories. In this paper, six of these dimensions, i.e., provenance, interpretability, understandability, licensing, availability, interlinking have been evaluated automatically for three state-of-the-art OKE systems, using the state-of-the-art metrics and tools. Covid-on-the-Web is identified to have the highest mean transparency.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;This is the first research to study the transparency of OKE systems that provides a comprehensive set of transparency dimensions spanning ethics, trustworthy AI, and data quality approaches to transparency. It also demonstrates how to perform automated transparency evaluation that combines existing FAIRness and linked data quality assessment tools for the first time. We show that state-of-the-art OKE systems vary in the transparency of the linked data generated and that these differences can be automatically quantified leading to potential","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"12"},"PeriodicalIF":1.9,"publicationDate":"2023-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10468861/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10549601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-domain knowledge graph embeddings for gene-disease association prediction. 基因疾病关联预测的多领域知识图谱嵌入。
IF 1.9 3区 工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-08-14 DOI: 10.1186/s13326-023-00291-x
Susana Nunes, Rita T Sousa, Catia Pesquita

Background: Predicting gene-disease associations typically requires exploring diverse sources of information as well as sophisticated computational approaches. Knowledge graph embeddings can help tackle these challenges by creating representations of genes and diseases based on the scientific knowledge described in ontologies, which can then be explored by machine learning algorithms. However, state-of-the-art knowledge graph embeddings are produced over a single ontology or multiple but disconnected ones, ignoring the impact that considering multiple interconnected domains can have on complex tasks such as gene-disease association prediction.

Results: We propose a novel approach to predict gene-disease associations using rich semantic representations based on knowledge graph embeddings over multiple ontologies linked by logical definitions and compound ontology mappings. The experiments showed that considering richer knowledge graphs significantly improves gene-disease prediction and that different knowledge graph embeddings methods benefit more from distinct types of semantic richness.

Conclusions: This work demonstrated the potential for knowledge graph embeddings across multiple and interconnected biomedical ontologies to support gene-disease prediction. It also paved the way for considering other ontologies or tackling other tasks where multiple perspectives over the data can be beneficial. All software and data are freely available.

背景:预测基因与疾病的关联通常需要探索不同的信息来源以及复杂的计算方法。知识图嵌入可以通过基于本体中描述的科学知识创建基因和疾病的表示来帮助解决这些挑战,然后可以通过机器学习算法进行探索。然而,最先进的知识图嵌入是在单个本体或多个但不相连的本体上产生的,忽略了考虑多个相互连接的领域可能对复杂任务(如基因-疾病关联预测)的影响。结果:我们提出了一种预测基因-疾病关联的新方法,该方法使用基于知识图嵌入的丰富语义表示,通过逻辑定义和复合本体映射连接多个本体。实验表明,考虑更丰富的知识图可以显著提高基因疾病的预测效果,不同的知识图嵌入方法受益于不同类型的语义丰富度。结论:这项工作证明了跨多个相互关联的生物医学本体的知识图谱嵌入支持基因疾病预测的潜力。它还为考虑其他本体或处理其他任务铺平了道路,在这些任务中,数据的多个透视图可能是有益的。所有软件和数据都是免费提供的。
{"title":"Multi-domain knowledge graph embeddings for gene-disease association prediction.","authors":"Susana Nunes, Rita T Sousa, Catia Pesquita","doi":"10.1186/s13326-023-00291-x","DOIUrl":"10.1186/s13326-023-00291-x","url":null,"abstract":"<p><strong>Background: </strong>Predicting gene-disease associations typically requires exploring diverse sources of information as well as sophisticated computational approaches. Knowledge graph embeddings can help tackle these challenges by creating representations of genes and diseases based on the scientific knowledge described in ontologies, which can then be explored by machine learning algorithms. However, state-of-the-art knowledge graph embeddings are produced over a single ontology or multiple but disconnected ones, ignoring the impact that considering multiple interconnected domains can have on complex tasks such as gene-disease association prediction.</p><p><strong>Results: </strong>We propose a novel approach to predict gene-disease associations using rich semantic representations based on knowledge graph embeddings over multiple ontologies linked by logical definitions and compound ontology mappings. The experiments showed that considering richer knowledge graphs significantly improves gene-disease prediction and that different knowledge graph embeddings methods benefit more from distinct types of semantic richness.</p><p><strong>Conclusions: </strong>This work demonstrated the potential for knowledge graph embeddings across multiple and interconnected biomedical ontologies to support gene-disease prediction. It also paved the way for considering other ontologies or tackling other tasks where multiple perspectives over the data can be beneficial. All software and data are freely available.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"11"},"PeriodicalIF":1.9,"publicationDate":"2023-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10426189/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10003461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Semantics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1