Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun
{"title":"利用多教师蒸馏学习实现准确高效的蛋白质嵌入","authors":"Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun","doi":"arxiv-2405.11735","DOIUrl":null,"url":null,"abstract":"Motivation: Protein embedding, which represents proteins as numerical\nvectors, is a crucial step in various learning-based protein\nannotation/classification problems, including gene ontology prediction,\nprotein-protein interaction prediction, and protein structure prediction.\nHowever, existing protein embedding methods are often computationally expensive\ndue to their large number of parameters, which can reach millions or even\nbillions. The growing availability of large-scale protein datasets and the need\nfor efficient analysis tools have created a pressing demand for efficient\nprotein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacher\ndistillation learning, which leverages the knowledge of multiple pre-trained\nprotein embedding models to learn a compact and informative representation of\nproteins. Our method achieves comparable performance to state-of-the-art\nmethods while significantly reducing computational costs and resource\nrequirements. Specifically, our approach reduces computational time by ~70\\%\nand maintains almost the same accuracy as the original large models. This makes\nour method well-suited for large-scale protein analysis and enables the\nbioinformatics community to perform protein embedding tasks more efficiently.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accurate and efficient protein embedding using multi-teacher distillation learning\",\"authors\":\"Jiayu Shang, Cheng Peng, Yongxin Ji, Jiaojiao Guan, Dehan Cai, Xubo Tang, Yanni Sun\",\"doi\":\"arxiv-2405.11735\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Protein embedding, which represents proteins as numerical\\nvectors, is a crucial step in various learning-based protein\\nannotation/classification problems, including gene ontology prediction,\\nprotein-protein interaction prediction, and protein structure prediction.\\nHowever, existing protein embedding methods are often computationally expensive\\ndue to their large number of parameters, which can reach millions or even\\nbillions. The growing availability of large-scale protein datasets and the need\\nfor efficient analysis tools have created a pressing demand for efficient\\nprotein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacher\\ndistillation learning, which leverages the knowledge of multiple pre-trained\\nprotein embedding models to learn a compact and informative representation of\\nproteins. Our method achieves comparable performance to state-of-the-art\\nmethods while significantly reducing computational costs and resource\\nrequirements. Specifically, our approach reduces computational time by ~70\\\\%\\nand maintains almost the same accuracy as the original large models. This makes\\nour method well-suited for large-scale protein analysis and enables the\\nbioinformatics community to perform protein embedding tasks more efficiently.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.11735\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.11735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Accurate and efficient protein embedding using multi-teacher distillation learning
Motivation: Protein embedding, which represents proteins as numerical
vectors, is a crucial step in various learning-based protein
annotation/classification problems, including gene ontology prediction,
protein-protein interaction prediction, and protein structure prediction.
However, existing protein embedding methods are often computationally expensive
due to their large number of parameters, which can reach millions or even
billions. The growing availability of large-scale protein datasets and the need
for efficient analysis tools have created a pressing demand for efficient
protein embedding methods. Results: We propose a novel protein embedding approach based on multi-teacher
distillation learning, which leverages the knowledge of multiple pre-trained
protein embedding models to learn a compact and informative representation of
proteins. Our method achieves comparable performance to state-of-the-art
methods while significantly reducing computational costs and resource
requirements. Specifically, our approach reduces computational time by ~70\%
and maintains almost the same accuracy as the original large models. This makes
our method well-suited for large-scale protein analysis and enables the
bioinformatics community to perform protein embedding tasks more efficiently.