Structure of the space of folding protein sequences defined by large language models.

IF 2 4区 生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Physical biology Pub Date : 2024-01-31 DOI:10.1088/1478-3975/ad205c
A Zambon, R Zecchina, G Tiana
{"title":"Structure of the space of folding protein sequences defined by large language models.","authors":"A Zambon, R Zecchina, G Tiana","doi":"10.1088/1478-3975/ad205c","DOIUrl":null,"url":null,"abstract":"<p><p>Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.</p>","PeriodicalId":20207,"journal":{"name":"Physical biology","volume":null,"pages":null},"PeriodicalIF":2.0000,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physical biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1088/1478-3975/ad205c","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
由大型语言模型定义的折叠蛋白质序列空间结构。
蛋白质是高维序列空间中的一个流形,其几何结构引导着蛋白质的自然进化。利用最近开发的基于转换器模型的结构预测工具,我们首先研究了由有效能量定义的蛋白质序列景观,有效能量是序列可折叠性的代表。这种景观与机器学习和约束满足问题中遇到的优化挑战具有相同的特征。我们的分析表明,天然蛋白质主要位于该能量景观中宽阔平坦的最小值处。为了进一步研究,我们采用了专门设计的统计力学算法,以探索相对平坦景观中具有高局部熵的区域。我们的研究结果表明,与使用蒙特卡洛马尔科夫链等传统方法相比,这些专门算法可以识别出熵值更高的山谷。在一个概念验证案例中,我们发现这些高熵最小值与自然序列表现出显著的相似性,尤其是在关键位点和局部熵方面。此外,分子动力学评估表明,这些序列的稳定性与天然蛋白质非常相似。我们的工具结合了机器学习和统计物理学的进步,为探索序列景观提供了新的见解,在这种景观中,宽而平坦的极小值与大多数较窄的极小值并存。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Physical biology
Physical biology 生物-生物物理
CiteScore
4.20
自引率
0.00%
发文量
50
审稿时长
3 months
期刊介绍: Physical Biology publishes articles in the broad interdisciplinary field bridging biology with the physical sciences and engineering. This journal focuses on research in which quantitative approaches – experimental, theoretical and modeling – lead to new insights into biological systems at all scales of space and time, and all levels of organizational complexity. Physical Biology accepts contributions from a wide range of biological sub-fields, including topics such as: molecular biophysics, including single molecule studies, protein-protein and protein-DNA interactions subcellular structures, organelle dynamics, membranes, protein assemblies, chromosome structure intracellular processes, e.g. cytoskeleton dynamics, cellular transport, cell division systems biology, e.g. signaling, gene regulation and metabolic networks cells and their microenvironment, e.g. cell mechanics and motility, chemotaxis, extracellular matrix, biofilms cell-material interactions, e.g. biointerfaces, electrical stimulation and sensing, endocytosis cell-cell interactions, cell aggregates, organoids, tissues and organs developmental dynamics, including pattern formation and morphogenesis physical and evolutionary aspects of disease, e.g. cancer progression, amyloid formation neuronal systems, including information processing by networks, memory and learning population dynamics, ecology, and evolution collective action and emergence of collective phenomena.
期刊最新文献
A role of fear on diseased food web model with multiple functional response. Two fitness inference schemes compared using allele frequencies from 1,068,391 sequences sampled in the UK during the COVID-19 pandemic. Unraveling the role of exercise in cancer suppression: insights from a mathematical model. An exactly solvable model for RNA polymerase during the elongation stage. A theoretical framework for predicting the heterogeneous stiffness map of brain white matter tissue.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1