Adibvafa Fallahpour, Vincent Gureghian, Guillaume J. Filion, Ariel B. Lindner, Amir Pandi
{"title":"CodonTransformer: a multispecies codon optimizer using context-aware neural networks","authors":"Adibvafa Fallahpour, Vincent Gureghian, Guillaume J. Filion, Ariel B. Lindner, Amir Pandi","doi":"10.1101/2024.09.13.612903","DOIUrl":null,"url":null,"abstract":"The genetic code is degenerate allowing a multitude of possible DNA sequences to encode the same protein. This degeneracy impacts the efficiency of heterologous protein production due to the codon usage preferences of each organism. The process of tailoring organism-specific synonymous codons, known as codon optimization, must respect local sequence patterns that go beyond global codon preferences. As a result, the search space faces a combinatorial explosion that makes exhaustive exploration impossible. Nevertheless, throughout the diverse life on Earth, natural selection has already optimized the sequences, thereby providing a rich source of data allowing machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life. The model demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers we used, and to a novel sequence representation that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a state-of-the-art codon optimization framework with a customizable open-access model and a user-friendly interface.","PeriodicalId":501408,"journal":{"name":"bioRxiv - Synthetic Biology","volume":"52 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Synthetic Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.13.612903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The genetic code is degenerate allowing a multitude of possible DNA sequences to encode the same protein. This degeneracy impacts the efficiency of heterologous protein production due to the codon usage preferences of each organism. The process of tailoring organism-specific synonymous codons, known as codon optimization, must respect local sequence patterns that go beyond global codon preferences. As a result, the search space faces a combinatorial explosion that makes exhaustive exploration impossible. Nevertheless, throughout the diverse life on Earth, natural selection has already optimized the sequences, thereby providing a rich source of data allowing machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life. The model demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers we used, and to a novel sequence representation that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a state-of-the-art codon optimization framework with a customizable open-access model and a user-friendly interface.
遗传密码是退化的,允许多种可能的 DNA 序列编码相同的蛋白质。由于每种生物的密码子使用偏好不同,这种退化性影响了异源蛋白质的生产效率。定制生物特异性同义密码子的过程被称为密码子优化,必须尊重超出全局密码子偏好的局部序列模式。因此,搜索空间面临着组合爆炸,不可能进行详尽的探索。然而,在地球上多种多样的生命中,自然选择已经对序列进行了优化,从而提供了丰富的数据源,使机器学习算法能够探索潜在的规则。在这里,我们介绍一个多物种深度学习模型--CodonTransformer,它是在来自164个生物体的100多万个DNA-蛋白质对上训练出来的,这些生物体横跨生命的所有领域。得益于我们使用的注意力机制和 Transformers 的双向性,以及结合了生物体、氨基酸和密码子编码的新型序列表示法,该模型展示了上下文感知能力。CodonTransformer 生成的宿主特异性 DNA 序列具有类似自然的密码子分布图和负顺式调控元素。这项工作引入了一种新颖的共享标记表示和编码与对齐多掩码(STREAM)策略,并提供了一个最先进的密码子优化框架,该框架具有可定制的开放存取模型和用户友好界面。