Mingyu Jin, Haochen Xue, Zhenting Wang, Boming Kang, Ruosong Ye, Kaixiong Zhou, Mengnan Du, Yongfeng Zhang
{"title":"ProLLM:用于蛋白质-蛋白质相互作用预测的蛋白质思维链增强型 LLM","authors":"Mingyu Jin, Haochen Xue, Zhenting Wang, Boming Kang, Ruosong Ye, Kaixiong Zhou, Mengnan Du, Yongfeng Zhang","doi":"arxiv-2405.06649","DOIUrl":null,"url":null,"abstract":"The prediction of protein-protein interactions (PPIs) is crucial for\nunderstanding biological functions and diseases. Previous machine learning\napproaches to PPI prediction mainly focus on direct physical interactions,\nignoring the broader context of nonphysical connections through intermediate\nproteins, thus limiting their effectiveness. The emergence of Large Language\nModels (LLMs) provides a new opportunity for addressing this complex biological\nchallenge. By transforming structured data into natural language prompts, we\ncan map the relationships between proteins into texts. This approach allows\nLLMs to identify indirect connections between proteins, tracing the path from\nupstream to downstream. Therefore, we propose a novel framework ProLLM that\nemploys an LLM tailored for PPI for the first time. Specifically, we propose\nProtein Chain of Thought (ProCoT), which replicates the biological mechanism of\nsignaling pathways as natural language prompts. ProCoT considers a signaling\npathway as a protein reasoning process, which starts from upstream proteins and\npasses through several intermediate proteins to transmit biological signals to\ndownstream proteins. Thus, we can use ProCoT to predict the interaction between\nupstream proteins and downstream proteins. The training of ProLLM employs the\nProCoT format, which enhances the model's understanding of complex biological\nproblems. In addition to ProCoT, this paper also contributes to the exploration\nof embedding replacement of protein sites in natural language prompts, and\ninstruction fine-tuning in protein knowledge datasets. We demonstrate the\nefficacy of ProLLM through rigorous validation against benchmark datasets,\nshowing significant improvement over existing methods in terms of prediction\naccuracy and generalizability. The code is available at:\nhttps://github.com/MingyuJ666/ProLLM.","PeriodicalId":501325,"journal":{"name":"arXiv - QuanBio - Molecular Networks","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction\",\"authors\":\"Mingyu Jin, Haochen Xue, Zhenting Wang, Boming Kang, Ruosong Ye, Kaixiong Zhou, Mengnan Du, Yongfeng Zhang\",\"doi\":\"arxiv-2405.06649\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The prediction of protein-protein interactions (PPIs) is crucial for\\nunderstanding biological functions and diseases. Previous machine learning\\napproaches to PPI prediction mainly focus on direct physical interactions,\\nignoring the broader context of nonphysical connections through intermediate\\nproteins, thus limiting their effectiveness. The emergence of Large Language\\nModels (LLMs) provides a new opportunity for addressing this complex biological\\nchallenge. By transforming structured data into natural language prompts, we\\ncan map the relationships between proteins into texts. This approach allows\\nLLMs to identify indirect connections between proteins, tracing the path from\\nupstream to downstream. Therefore, we propose a novel framework ProLLM that\\nemploys an LLM tailored for PPI for the first time. Specifically, we propose\\nProtein Chain of Thought (ProCoT), which replicates the biological mechanism of\\nsignaling pathways as natural language prompts. ProCoT considers a signaling\\npathway as a protein reasoning process, which starts from upstream proteins and\\npasses through several intermediate proteins to transmit biological signals to\\ndownstream proteins. Thus, we can use ProCoT to predict the interaction between\\nupstream proteins and downstream proteins. The training of ProLLM employs the\\nProCoT format, which enhances the model's understanding of complex biological\\nproblems. In addition to ProCoT, this paper also contributes to the exploration\\nof embedding replacement of protein sites in natural language prompts, and\\ninstruction fine-tuning in protein knowledge datasets. We demonstrate the\\nefficacy of ProLLM through rigorous validation against benchmark datasets,\\nshowing significant improvement over existing methods in terms of prediction\\naccuracy and generalizability. The code is available at:\\nhttps://github.com/MingyuJ666/ProLLM.\",\"PeriodicalId\":501325,\"journal\":{\"name\":\"arXiv - QuanBio - Molecular Networks\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Molecular Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.06649\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Molecular Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.06649","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction
The prediction of protein-protein interactions (PPIs) is crucial for
understanding biological functions and diseases. Previous machine learning
approaches to PPI prediction mainly focus on direct physical interactions,
ignoring the broader context of nonphysical connections through intermediate
proteins, thus limiting their effectiveness. The emergence of Large Language
Models (LLMs) provides a new opportunity for addressing this complex biological
challenge. By transforming structured data into natural language prompts, we
can map the relationships between proteins into texts. This approach allows
LLMs to identify indirect connections between proteins, tracing the path from
upstream to downstream. Therefore, we propose a novel framework ProLLM that
employs an LLM tailored for PPI for the first time. Specifically, we propose
Protein Chain of Thought (ProCoT), which replicates the biological mechanism of
signaling pathways as natural language prompts. ProCoT considers a signaling
pathway as a protein reasoning process, which starts from upstream proteins and
passes through several intermediate proteins to transmit biological signals to
downstream proteins. Thus, we can use ProCoT to predict the interaction between
upstream proteins and downstream proteins. The training of ProLLM employs the
ProCoT format, which enhances the model's understanding of complex biological
problems. In addition to ProCoT, this paper also contributes to the exploration
of embedding replacement of protein sites in natural language prompts, and
instruction fine-tuning in protein knowledge datasets. We demonstrate the
efficacy of ProLLM through rigorous validation against benchmark datasets,
showing significant improvement over existing methods in terms of prediction
accuracy and generalizability. The code is available at:
https://github.com/MingyuJ666/ProLLM.