数据驱动的蛋白质工程

Protein engineering Pub Date : 2021-01-01 DOI:10.1002/9783527815128.ch6

J. Greenhalgh, Apoorv Saraogee, Philip A. Romero

{"title":"数据驱动的蛋白质工程","authors":"J. Greenhalgh, Apoorv Saraogee, Philip A. Romero","doi":"10.1002/9783527815128.ch6","DOIUrl":null,"url":null,"abstract":"Introduction A protein’s sequence of amino acids encodes its function. This “function” could refer to a protein’s natural biological function, or it could also be any other property including binding affinity toward a particular ligand, thermodynamic stability, or catalytic activity. A detailed understanding of how these functions are encoded would allow us to more accurately reconstruct the tree of life and possibly predict future evolutionary events, diagnose genetic diseases before they manifest symptoms, and design new proteins with useful properties. We know that a protein sequence folds into a three-dimensional structure, and this structure positions specific chemical groups to perform a function; however, we’re missing the quantitative details of this sequence-structure-function mapping. This mapping is extraordinarily complex because it involves thousands of molecular interactions that are dynamically coupled across multiple length and time scales. Computational methods can be used to model the mapping from sequence to structure to function. Tools such as molecular dynamics simulations or Rosetta use atomic representations of protein structures and physics-based energy functions to model structures and functions (1–3). While these models are based on well-founded physical principles, they often fail to capture a protein’s overall global behavior and properties. There are numerous challenges associated with physics-based models including consideration of conformational dynamics, the requirement to make energy function approximations for the sake of computational efficiency, and the fact that, for many complex properties such as enzyme catalysis, the molecular basis is simply unknown (4). In systems composed of thousands of atoms, the propagation of small errors quickly overwhelms any predictive accuracy. Despite tremendous breakthroughs and research progress over the last century, we still lack the key details to reliably predict, simulate, and design protein function. In this chapter, we present the emerging field of data-driven protein engineering. Instead of physically modeling the relationships between protein sequence, structure, and function, data-driven methods use ideas from statistics and machine learning to infer these complex relationships from data. This top-down modeling approach implicitly captures the numerous and possibly unknown factors that shape the mapping from sequence to function. Statistical models have been used to understand the molecular basis of protein function and provide exceptional predictive accuracy for protein design.","PeriodicalId":20902,"journal":{"name":"Protein engineering","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Data‐driven Protein Engineering\",\"authors\":\"J. Greenhalgh, Apoorv Saraogee, Philip A. Romero\",\"doi\":\"10.1002/9783527815128.ch6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction A protein’s sequence of amino acids encodes its function. This “function” could refer to a protein’s natural biological function, or it could also be any other property including binding affinity toward a particular ligand, thermodynamic stability, or catalytic activity. A detailed understanding of how these functions are encoded would allow us to more accurately reconstruct the tree of life and possibly predict future evolutionary events, diagnose genetic diseases before they manifest symptoms, and design new proteins with useful properties. We know that a protein sequence folds into a three-dimensional structure, and this structure positions specific chemical groups to perform a function; however, we’re missing the quantitative details of this sequence-structure-function mapping. This mapping is extraordinarily complex because it involves thousands of molecular interactions that are dynamically coupled across multiple length and time scales. Computational methods can be used to model the mapping from sequence to structure to function. Tools such as molecular dynamics simulations or Rosetta use atomic representations of protein structures and physics-based energy functions to model structures and functions (1–3). While these models are based on well-founded physical principles, they often fail to capture a protein’s overall global behavior and properties. There are numerous challenges associated with physics-based models including consideration of conformational dynamics, the requirement to make energy function approximations for the sake of computational efficiency, and the fact that, for many complex properties such as enzyme catalysis, the molecular basis is simply unknown (4). In systems composed of thousands of atoms, the propagation of small errors quickly overwhelms any predictive accuracy. Despite tremendous breakthroughs and research progress over the last century, we still lack the key details to reliably predict, simulate, and design protein function. In this chapter, we present the emerging field of data-driven protein engineering. Instead of physically modeling the relationships between protein sequence, structure, and function, data-driven methods use ideas from statistics and machine learning to infer these complex relationships from data. This top-down modeling approach implicitly captures the numerous and possibly unknown factors that shape the mapping from sequence to function. Statistical models have been used to understand the molecular basis of protein function and provide exceptional predictive accuracy for protein design.\",\"PeriodicalId\":20902,\"journal\":{\"name\":\"Protein engineering\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Protein engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/9783527815128.ch6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Protein engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/9783527815128.ch6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

蛋白质的氨基酸序列编码其功能。这种“功能”可以指蛋白质的天然生物功能，也可以是任何其他属性，包括对特定配体的结合亲和力、热力学稳定性或催化活性。详细了解这些功能是如何编码的，将使我们能够更准确地重建生命之树，并可能预测未来的进化事件，在出现症状之前诊断遗传疾病，并设计具有有用特性的新蛋白质。我们知道蛋白质序列折叠成三维结构，这种结构定位特定的化学基团来执行功能;然而，我们缺少这种序列-结构-功能映射的定量细节。这种映射是非常复杂的，因为它涉及成千上万的分子相互作用，这些相互作用是在多个长度和时间尺度上动态耦合的。计算方法可以用来模拟从序列到结构到功能的映射。分子动力学模拟或Rosetta等工具使用蛋白质结构的原子表示和基于物理的能量函数来模拟结构和功能(1-3)。虽然这些模型是基于有充分根据的物理原理，但它们往往无法捕捉到蛋白质的整体行为和特性。与基于物理的模型相关的许多挑战包括考虑构象动力学，为了计算效率而进行能量函数近似的要求，以及对于许多复杂性质(如酶催化)，分子基础根本是未知的事实(4)。在由数千个原子组成的系统中，小误差的传播很快就会超过任何预测的准确性。尽管在上个世纪取得了巨大的突破和研究进展，但我们仍然缺乏可靠预测、模拟和设计蛋白质功能的关键细节。在本章中，我们介绍了数据驱动蛋白质工程的新兴领域。数据驱动的方法不是对蛋白质序列、结构和功能之间的关系进行物理建模，而是使用统计学和机器学习的思想从数据中推断出这些复杂的关系。这种自顶向下的建模方法隐含地捕获了许多可能未知的因素，这些因素塑造了从序列到功能的映射。统计模型已被用于了解蛋白质功能的分子基础，并为蛋白质设计提供了卓越的预测准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Data‐driven Protein Engineering

Introduction A protein’s sequence of amino acids encodes its function. This “function” could refer to a protein’s natural biological function, or it could also be any other property including binding affinity toward a particular ligand, thermodynamic stability, or catalytic activity. A detailed understanding of how these functions are encoded would allow us to more accurately reconstruct the tree of life and possibly predict future evolutionary events, diagnose genetic diseases before they manifest symptoms, and design new proteins with useful properties. We know that a protein sequence folds into a three-dimensional structure, and this structure positions specific chemical groups to perform a function; however, we’re missing the quantitative details of this sequence-structure-function mapping. This mapping is extraordinarily complex because it involves thousands of molecular interactions that are dynamically coupled across multiple length and time scales. Computational methods can be used to model the mapping from sequence to structure to function. Tools such as molecular dynamics simulations or Rosetta use atomic representations of protein structures and physics-based energy functions to model structures and functions (1–3). While these models are based on well-founded physical principles, they often fail to capture a protein’s overall global behavior and properties. There are numerous challenges associated with physics-based models including consideration of conformational dynamics, the requirement to make energy function approximations for the sake of computational efficiency, and the fact that, for many complex properties such as enzyme catalysis, the molecular basis is simply unknown (4). In systems composed of thousands of atoms, the propagation of small errors quickly overwhelms any predictive accuracy. Despite tremendous breakthroughs and research progress over the last century, we still lack the key details to reliably predict, simulate, and design protein function. In this chapter, we present the emerging field of data-driven protein engineering. Instead of physically modeling the relationships between protein sequence, structure, and function, data-driven methods use ideas from statistics and machine learning to infer these complex relationships from data. This top-down modeling approach implicitly captures the numerous and possibly unknown factors that shape the mapping from sequence to function. Statistical models have been used to understand the molecular basis of protein function and provide exceptional predictive accuracy for protein design.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Protein engineering

自引率

0.00%

发文量