Speaking to nature: a deep learning representational model of proteins ushers in protein linguistics.

IF 2.5 Q2 BIOCHEMICAL RESEARCH METHODS Synthetic biology (Oxford, England) Pub Date : 2019-05-21 eCollection Date: 2019-01-01 DOI:10.1093/synbio/ysz013

Daniel Bojar

{"title":"Speaking to nature: a deep learning representational model of proteins ushers in protein linguistics.","authors":"Daniel Bojar","doi":"10.1093/synbio/ysz013","DOIUrl":null,"url":null,"abstract":"Understanding, modifying and designing proteins require an intimate knowledge of their 3D structure. Even structure-agnostic protein engineering approaches, such as directed evolution, are limited in scope because of the vast potential sequence space and the epistatic effects that multiple mutations have on protein function. To overcome these difficulties, a holistic understanding of sequence–structure–function relationships has to be established. In their recent preprint, members of the Church Group at the Wyss Institute and collaborators describe a novel approach to predicting protein stability and functionality from raw sequence (1). Their representational model UniRep (unified representation), for the first time, demonstrates an advanced understanding of protein features by means of language modeling. Using deep learning techniques, which were recently recognized with the prestigious Turing Award, Alley et al. built a language model for proteins with amino acids as characters based on natural language processing (NLP) techniques. NLP has not only revolutionized our computational understanding of language—think for instance voice-to-text software—but has been coopted for exciting applications in synthetic biology. The recurrent neural network (RNN; a type of neural network which can process sequential inputs such as text) used by Alley et al. was trained by iteratively predicting the next amino acid given the preceding amino acids for the 24 million protein sequences contained in the UniRef50 database. The RNN thus gathered implicit knowledge about the context of a given amino acid and higher-level features such as secondary structure. The authors then averaged the protein representation of their RNN at every sequence position to yield a protein language representation they call UniRep. They then extended UniRep by adding representations of the final sequence position of their RNN to generate the more complete representation called ‘UniRep Fusion’, which serves as an overview of the entire protein sequence. UniRep Fusion was then used as an input for a machine learning model to predict protein stability. Notably, this architecture was more accurate than Rosetta, the de facto state-ofthe-art for predicting protein stability. Their protein language representation allowed the authors to predict the relative brightness of 64 800 GFP mutants differing in as few as one amino acids. Remarkably, their predicted relative brightness values correlated strongly with experimental observation (r1⁄4 0.98). UniRep, as the representation of 24 million proteins, captures many phenomena of general importance for protein structure and function. These general features can be complemented by dataset-specific attributes when training on a subset of protein mutants or de novo designed proteins. This approach could for instance be adopted for screening novel proteins generated by deep learning models. Analogous to de novo designed proteins by Rosetta, generating proteins through protein language models might be most advantageous for proteins with radically new functionalities, which are unlikely to be generated by incremental directed evolution. To arrive in this virtual world of protein engineering though, more advances have to be made. It required the authors of UniRep 1 week of GPU usage to train their large model for one epoch (seeing every protein sequence in UniRef50 once). Switching from the redundancy-ridden UniRef50 database ( 24 million sequences) to preUEP (2), a redundancy-reduced protein sequence database ( 8 million sequences), might enable faster training. This reductionist approach might allow for the ‘vocabulary’ of the model to be extended from single amino acids to larger protein fragments, capturing more structural properties. In general, there are a plethora of NLP techniques developed for written languages which might be useful in protein linguistics. One particularly promising concept would be attention (3), the selective focus on sequence stretches far away from each other, which dramatically improves language models. Given that protein language may be considered one of the most natural languages by definition, modern NLP techniques could transform protein linguistics into a potent tool for the study as well as engineering of proteins for the purposes of synthetic biology.","PeriodicalId":74902,"journal":{"name":"Synthetic biology (Oxford, England)","volume":"4 1","pages":"ysz013"},"PeriodicalIF":2.5000,"publicationDate":"2019-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1093/synbio/ysz013","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Synthetic biology (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/synbio/ysz013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Understanding, modifying and designing proteins require an intimate knowledge of their 3D structure. Even structure-agnostic protein engineering approaches, such as directed evolution, are limited in scope because of the vast potential sequence space and the epistatic effects that multiple mutations have on protein function. To overcome these difficulties, a holistic understanding of sequence–structure–function relationships has to be established. In their recent preprint, members of the Church Group at the Wyss Institute and collaborators describe a novel approach to predicting protein stability and functionality from raw sequence (1). Their representational model UniRep (unified representation), for the first time, demonstrates an advanced understanding of protein features by means of language modeling. Using deep learning techniques, which were recently recognized with the prestigious Turing Award, Alley et al. built a language model for proteins with amino acids as characters based on natural language processing (NLP) techniques. NLP has not only revolutionized our computational understanding of language—think for instance voice-to-text software—but has been coopted for exciting applications in synthetic biology. The recurrent neural network (RNN; a type of neural network which can process sequential inputs such as text) used by Alley et al. was trained by iteratively predicting the next amino acid given the preceding amino acids for the 24 million protein sequences contained in the UniRef50 database. The RNN thus gathered implicit knowledge about the context of a given amino acid and higher-level features such as secondary structure. The authors then averaged the protein representation of their RNN at every sequence position to yield a protein language representation they call UniRep. They then extended UniRep by adding representations of the final sequence position of their RNN to generate the more complete representation called ‘UniRep Fusion’, which serves as an overview of the entire protein sequence. UniRep Fusion was then used as an input for a machine learning model to predict protein stability. Notably, this architecture was more accurate than Rosetta, the de facto state-ofthe-art for predicting protein stability. Their protein language representation allowed the authors to predict the relative brightness of 64 800 GFP mutants differing in as few as one amino acids. Remarkably, their predicted relative brightness values correlated strongly with experimental observation (r1⁄4 0.98). UniRep, as the representation of 24 million proteins, captures many phenomena of general importance for protein structure and function. These general features can be complemented by dataset-specific attributes when training on a subset of protein mutants or de novo designed proteins. This approach could for instance be adopted for screening novel proteins generated by deep learning models. Analogous to de novo designed proteins by Rosetta, generating proteins through protein language models might be most advantageous for proteins with radically new functionalities, which are unlikely to be generated by incremental directed evolution. To arrive in this virtual world of protein engineering though, more advances have to be made. It required the authors of UniRep 1 week of GPU usage to train their large model for one epoch (seeing every protein sequence in UniRef50 once). Switching from the redundancy-ridden UniRef50 database ( 24 million sequences) to preUEP (2), a redundancy-reduced protein sequence database ( 8 million sequences), might enable faster training. This reductionist approach might allow for the ‘vocabulary’ of the model to be extended from single amino acids to larger protein fragments, capturing more structural properties. In general, there are a plethora of NLP techniques developed for written languages which might be useful in protein linguistics. One particularly promising concept would be attention (3), the selective focus on sequence stretches far away from each other, which dramatically improves language models. Given that protein language may be considered one of the most natural languages by definition, modern NLP techniques could transform protein linguistics into a potent tool for the study as well as engineering of proteins for the purposes of synthetic biology.

查看原文