ProLLM：用于蛋白质-蛋白质相互作用预测的蛋白质思维链增强型 LLM

arXiv - QuanBio - Molecular Networks Pub Date : 2024-03-30 DOI:arxiv-2405.06649

Mingyu Jin, Haochen Xue, Zhenting Wang, Boming Kang, Ruosong Ye, Kaixiong Zhou, Mengnan Du, Yongfeng Zhang

{"title":"ProLLM：用于蛋白质-蛋白质相互作用预测的蛋白质思维链增强型 LLM","authors":"Mingyu Jin, Haochen Xue, Zhenting Wang, Boming Kang, Ruosong Ye, Kaixiong Zhou, Mengnan Du, Yongfeng Zhang","doi":"arxiv-2405.06649","DOIUrl":null,"url":null,"abstract":"The prediction of protein-protein interactions (PPIs) is crucial for\nunderstanding biological functions and diseases. Previous machine learning\napproaches to PPI prediction mainly focus on direct physical interactions,\nignoring the broader context of nonphysical connections through intermediate\nproteins, thus limiting their effectiveness. The emergence of Large Language\nModels (LLMs) provides a new opportunity for addressing this complex biological\nchallenge. By transforming structured data into natural language prompts, we\ncan map the relationships between proteins into texts. This approach allows\nLLMs to identify indirect connections between proteins, tracing the path from\nupstream to downstream. Therefore, we propose a novel framework ProLLM that\nemploys an LLM tailored for PPI for the first time. Specifically, we propose\nProtein Chain of Thought (ProCoT), which replicates the biological mechanism of\nsignaling pathways as natural language prompts. ProCoT considers a signaling\npathway as a protein reasoning process, which starts from upstream proteins and\npasses through several intermediate proteins to transmit biological signals to\ndownstream proteins. Thus, we can use ProCoT to predict the interaction between\nupstream proteins and downstream proteins. The training of ProLLM employs the\nProCoT format, which enhances the model's understanding of complex biological\nproblems. In addition to ProCoT, this paper also contributes to the exploration\nof embedding replacement of protein sites in natural language prompts, and\ninstruction fine-tuning in protein knowledge datasets. We demonstrate the\nefficacy of ProLLM through rigorous validation against benchmark datasets,\nshowing significant improvement over existing methods in terms of prediction\naccuracy and generalizability. The code is available at:\nhttps://github.com/MingyuJ666/ProLLM.","PeriodicalId":501325,"journal":{"name":"arXiv - QuanBio - Molecular Networks","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction\",\"authors\":\"Mingyu Jin, Haochen Xue, Zhenting Wang, Boming Kang, Ruosong Ye, Kaixiong Zhou, Mengnan Du, Yongfeng Zhang\",\"doi\":\"arxiv-2405.06649\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The prediction of protein-protein interactions (PPIs) is crucial for\\nunderstanding biological functions and diseases. Previous machine learning\\napproaches to PPI prediction mainly focus on direct physical interactions,\\nignoring the broader context of nonphysical connections through intermediate\\nproteins, thus limiting their effectiveness. The emergence of Large Language\\nModels (LLMs) provides a new opportunity for addressing this complex biological\\nchallenge. By transforming structured data into natural language prompts, we\\ncan map the relationships between proteins into texts. This approach allows\\nLLMs to identify indirect connections between proteins, tracing the path from\\nupstream to downstream. Therefore, we propose a novel framework ProLLM that\\nemploys an LLM tailored for PPI for the first time. Specifically, we propose\\nProtein Chain of Thought (ProCoT), which replicates the biological mechanism of\\nsignaling pathways as natural language prompts. ProCoT considers a signaling\\npathway as a protein reasoning process, which starts from upstream proteins and\\npasses through several intermediate proteins to transmit biological signals to\\ndownstream proteins. Thus, we can use ProCoT to predict the interaction between\\nupstream proteins and downstream proteins. The training of ProLLM employs the\\nProCoT format, which enhances the model's understanding of complex biological\\nproblems. In addition to ProCoT, this paper also contributes to the exploration\\nof embedding replacement of protein sites in natural language prompts, and\\ninstruction fine-tuning in protein knowledge datasets. We demonstrate the\\nefficacy of ProLLM through rigorous validation against benchmark datasets,\\nshowing significant improvement over existing methods in terms of prediction\\naccuracy and generalizability. The code is available at:\\nhttps://github.com/MingyuJ666/ProLLM.\",\"PeriodicalId\":501325,\"journal\":{\"name\":\"arXiv - QuanBio - Molecular Networks\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Molecular Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.06649\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Molecular Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.06649","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

预测蛋白质之间的相互作用（PPIs）对于理解生物功能和疾病至关重要。以往预测蛋白质相互作用的机器学习方法主要关注直接的物理相互作用，而忽略了通过中间蛋白的非物理连接这一更广泛的背景，从而限制了其有效性。大型语言模型（LLM）的出现为解决这一复杂的生物学难题提供了新的机遇。通过将结构化数据转化为自然语言提示，我们可以将蛋白质之间的关系映射到文本中。通过这种方法，LLMs 可以识别蛋白质之间的间接联系，追踪从上游到下游的路径。因此，我们提出了一个新颖的框架 ProLLM，它首次采用了为 PPI 量身定制的 LLM。具体来说，我们提出了蛋白质思维链（ProCoT），它以自然语言提示的形式复制了信号通路的生物学机制。ProCoT 将信号通路视为一个蛋白质推理过程，它从上游蛋白质开始，通过几个中间蛋白质将生物信号传递给下游蛋白质。因此，我们可以利用 ProCoT 预测上游蛋白质与下游蛋白质之间的相互作用。ProLLM 的训练采用了 ProCoT 格式，这增强了模型对复杂生物问题的理解。除ProCoT外，本文还有助于探索在自然语言提示中嵌入蛋白质位点的替换，以及蛋白质知识数据集中的指令微调。我们通过对基准数据集的严格验证来证明 ProLLM 的有效性，结果表明它在预测准确性和普适性方面都比现有方法有显著提高。代码见：https://github.com/MingyuJ666/ProLLM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction

The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions, ignoring the broader context of nonphysical connections through intermediate proteins, thus limiting their effectiveness. The emergence of Large Language Models (LLMs) provides a new opportunity for addressing this complex biological challenge. By transforming structured data into natural language prompts, we can map the relationships between proteins into texts. This approach allows LLMs to identify indirect connections between proteins, tracing the path from upstream to downstream. Therefore, we propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time. Specifically, we propose Protein Chain of Thought (ProCoT), which replicates the biological mechanism of signaling pathways as natural language prompts. ProCoT considers a signaling pathway as a protein reasoning process, which starts from upstream proteins and passes through several intermediate proteins to transmit biological signals to downstream proteins. Thus, we can use ProCoT to predict the interaction between upstream proteins and downstream proteins. The training of ProLLM employs the ProCoT format, which enhances the model's understanding of complex biological problems. In addition to ProCoT, this paper also contributes to the exploration of embedding replacement of protein sites in natural language prompts, and instruction fine-tuning in protein knowledge datasets. We demonstrate the efficacy of ProLLM through rigorous validation against benchmark datasets, showing significant improvement over existing methods in terms of prediction accuracy and generalizability. The code is available at: https://github.com/MingyuJ666/ProLLM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Molecular Networks

自引率

0.00%

发文量