Daniel Daza, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, Paul Groth
{"title":"BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs","authors":"Daniel Daza, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, Paul Groth","doi":"10.1186/s13326-023-00301-y","DOIUrl":null,"url":null,"abstract":"Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. We aim to understand how to incorporate multimodal data into biomedical KG embeddings, and analyze the resulting performance in comparison with traditional methods. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. Further analyses show that incorporating attribute data does outperform baselines over entities below a certain node degree, comprising approximately 75% of the diseases in the graph. We also observe that optimizing attribute encoders is a challenging task that increases optimization costs. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. BioBLP allows to investigate different ways of incorporating multimodal biomedical data for learning representations in KGs. With a particular implementation, we find that incorporating attribute data does not consistently outperform baselines, but improvements are obtained on a comparatively large subset of entities below a specific node-degree. Our results indicate a potential for improved performance in scientific discovery tasks where understudied areas of the KG would benefit from link prediction methods.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"86 1","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-023-00301-y","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. We aim to understand how to incorporate multimodal data into biomedical KG embeddings, and analyze the resulting performance in comparison with traditional methods. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. Further analyses show that incorporating attribute data does outperform baselines over entities below a certain node degree, comprising approximately 75% of the diseases in the graph. We also observe that optimizing attribute encoders is a challenging task that increases optimization costs. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. BioBLP allows to investigate different ways of incorporating multimodal biomedical data for learning representations in KGs. With a particular implementation, we find that incorporating attribute data does not consistently outperform baselines, but improvements are obtained on a comparatively large subset of entities below a specific node-degree. Our results indicate a potential for improved performance in scientific discovery tasks where understudied areas of the KG would benefit from link prediction methods.
知识图谱(KG)是表示生物医学领域实体间复杂关系的重要工具。目前已提出了几种学习嵌入的方法,可用于预测此类图中的新链接。有些方法忽略了生物医学 KG 中与实体相关的宝贵属性数据,如蛋白质序列或分子图。其他方法包含了这些数据,但假设实体可以用相同的数据模式来表示。生物医学 KG 并不总是这种情况,其中的实体表现出不同的模式,而这些模式对它们在主题领域中的表示至关重要。我们的目标是了解如何将多模态数据纳入生物医学 KG 嵌入,并与传统方法比较分析由此产生的性能。我们提出了一个模块化框架,用于学习带有实体属性的 KG 嵌入,该框架允许对不同模态的属性数据进行编码,同时还支持属性缺失的实体。此外,我们还提出了一种高效的预训练策略,以减少所需的训练运行时间。我们使用包含约 200 万个三元组的生物医学 KG 对模型进行了训练,并在链接预测和药物-蛋白质相互作用预测任务中评估了所得实体嵌入的性能,并与不考虑属性数据的方法进行了比较。在标准链接预测评估中,提出的方法具有竞争力,但性能低于不使用属性数据的基线方法。在药物-蛋白质相互作用预测任务中进行评估时,该方法与基线方法相比更胜一筹。进一步的分析表明,对于低于一定节点度的实体(约占图中疾病的 75%),结合属性数据的效果确实优于基线方法。我们还发现,优化属性编码器是一项具有挑战性的任务,会增加优化成本。我们提出的预训练策略能显著提高性能,同时减少所需的训练运行时间。BioBLP 允许研究将多模态生物医学数据纳入幼稚园学习表征的不同方法。通过特定的实现方法,我们发现纳入属性数据并不能始终优于基线,但在特定节点度以下的相对较大的实体子集上却能获得改进。我们的研究结果表明,在科学发现任务中,KG 中未被充分研究的领域将从链接预测方法中获益,从而提高性能。
期刊介绍:
Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas:
Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability.
Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.