Multimodal pretraining for unsupervised protein representation learning.

IF 2.5 Q3 BIOCHEMICAL RESEARCH METHODS Biology Methods and Protocols Pub Date : 2024-06-18 eCollection Date: 2024-01-01 DOI:10.1093/biomethods/bpae043
Viet Thanh Duy Nguyen, Truong Son Hy
{"title":"Multimodal pretraining for unsupervised protein representation learning.","authors":"Viet Thanh Duy Nguyen, Truong Son Hy","doi":"10.1093/biomethods/bpae043","DOIUrl":null,"url":null,"abstract":"<p><p>Proteins are complex biomolecules essential for numerous biological processes, making them crucial targets for advancements in molecular biology, medical research, and drug design. Understanding their intricate, hierarchical structures, and functions is vital for progress in these fields. To capture this complexity, we introduce Multimodal Protein Representation Learning (MPRL), a novel framework for symmetry-preserving multimodal pretraining that learns unified, unsupervised protein representations by integrating primary and tertiary structures. MPRL employs Evolutionary Scale Modeling (ESM-2) for sequence analysis, Variational Graph Auto-Encoders (VGAE) for residue-level graphs, and PointNet Autoencoder (PAE) for 3D point clouds of atoms, each designed to capture the spatial and evolutionary intricacies of proteins while preserving critical symmetries. By leveraging Auto-Fusion to synthesize joint representations from these pretrained models, MPRL ensures robust and comprehensive protein representations. Our extensive evaluation demonstrates that MPRL significantly enhances performance in various tasks such as protein-ligand binding affinity prediction, protein fold classification, enzyme activity identification, and mutation stability prediction. This framework advances the understanding of protein dynamics and facilitates future research in the field. Our source code is publicly available at https://github.com/HySonLab/Protein_Pretrain.</p>","PeriodicalId":36528,"journal":{"name":"Biology Methods and Protocols","volume":null,"pages":null},"PeriodicalIF":2.5000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11233121/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology Methods and Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/biomethods/bpae043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Proteins are complex biomolecules essential for numerous biological processes, making them crucial targets for advancements in molecular biology, medical research, and drug design. Understanding their intricate, hierarchical structures, and functions is vital for progress in these fields. To capture this complexity, we introduce Multimodal Protein Representation Learning (MPRL), a novel framework for symmetry-preserving multimodal pretraining that learns unified, unsupervised protein representations by integrating primary and tertiary structures. MPRL employs Evolutionary Scale Modeling (ESM-2) for sequence analysis, Variational Graph Auto-Encoders (VGAE) for residue-level graphs, and PointNet Autoencoder (PAE) for 3D point clouds of atoms, each designed to capture the spatial and evolutionary intricacies of proteins while preserving critical symmetries. By leveraging Auto-Fusion to synthesize joint representations from these pretrained models, MPRL ensures robust and comprehensive protein representations. Our extensive evaluation demonstrates that MPRL significantly enhances performance in various tasks such as protein-ligand binding affinity prediction, protein fold classification, enzyme activity identification, and mutation stability prediction. This framework advances the understanding of protein dynamics and facilitates future research in the field. Our source code is publicly available at https://github.com/HySonLab/Protein_Pretrain.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于无监督蛋白质表征学习的多模式预训练。
蛋白质是复杂的生物大分子,对许多生物过程至关重要,因此成为分子生物学、医学研究和药物设计领域取得进展的重要目标。了解它们错综复杂的层次结构和功能对这些领域的研究进展至关重要。为了捕捉这种复杂性,我们引入了多模态蛋白质表征学习(MPRL),这是一种用于对称性保护多模态预训练的新型框架,它通过整合一级和三级结构来学习统一的、无监督的蛋白质表征。MPRL 采用进化尺度建模(ESM-2)进行序列分析,采用变异图自动编码器(VGAE)进行残基级图形分析,采用点网自动编码器(PAE)进行三维原子点云分析,每种方法都旨在捕捉蛋白质在空间和进化方面的复杂性,同时保留关键的对称性。通过利用自动融合(Auto-Fusion)技术从这些预训练模型中合成联合表征,MPRL 确保了稳健而全面的蛋白质表征。我们的广泛评估表明,MPRL 显著提高了蛋白质配体结合亲和力预测、蛋白质折叠分类、酶活性识别和突变稳定性预测等各种任务的性能。该框架促进了对蛋白质动力学的理解,并推动了该领域的未来研究。我们的源代码可在 https://github.com/HySonLab/Protein_Pretrain 公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Biology Methods and Protocols
Biology Methods and Protocols Agricultural and Biological Sciences-Agricultural and Biological Sciences (all)
CiteScore
3.80
自引率
2.80%
发文量
28
审稿时长
19 weeks
期刊最新文献
Optimizing Western blotting immunodetection: Streamlining antibody cocktails for reduced protocol time and enhanced multiplexing applications. Live cell fluorescence microscopy-an end-to-end workflow for high-throughput image and data analysis. A reproducible method to study traumatic injury-induced zebrafish brain regeneration. Cluster analysis identifies long COVID subtypes in Belgian patients. Unpacking unstructured data: A pilot study on extracting insights from neuropathological reports of Parkinson's Disease patients using large language models.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1