解决抗体种系偏差及其对语言模型的影响，改进抗体设计。

Bioinformatics (Oxford, England) Pub Date : 2024-11-01 DOI:10.1093/bioinformatics/btae618

Tobias H Olsen, Iain H Moal, Charlotte M Deane

{"title":"解决抗体种系偏差及其对语言模型的影响，改进抗体设计。","authors":"Tobias H Olsen, Iain H Moal, Charlotte M Deane","doi":"10.1093/bioinformatics/btae618","DOIUrl":null,"url":null,"abstract":"Motivation: The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive, and time-consuming task, with the final antibody needing to not only have strong and specific binding but also be minimally impacted by developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a few nongermline mutations outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias toward germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.Results: In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimized for predicting nongermline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability.Availability and implementation: AbLang-2 is trained on both unpaired and paired data, and is freely available at https://github.com/oxpig/AbLang2.git.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11543624/pdf/","citationCount":"0","resultStr":"{\"title\":\"Addressing the antibody germline bias and its effect on language models for improved antibody design.\",\"authors\":\"Tobias H Olsen, Iain H Moal, Charlotte M Deane\",\"doi\":\"10.1093/bioinformatics/btae618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive, and time-consuming task, with the final antibody needing to not only have strong and specific binding but also be minimally impacted by developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a few nongermline mutations outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias toward germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.Results: In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimized for predicting nongermline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability.Availability and implementation: AbLang-2 is trained on both unpaired and paired data, and is freely available at https://github.com/oxpig/AbLang2.git.\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11543624/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btae618\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

动机抗体的多功能结合特性使其成为一类极其重要的生物治疗药物。然而，治疗性抗体的开发是一项复杂、昂贵和耗时的任务，最终的抗体不仅需要具有强大的特异性结合力，还需要将可开发性问题的影响降至最低。基于转换器的语言模型在蛋白质序列空间的成功应用，以及大量抗体序列的可用性，促进了许多抗体特异性语言模型的开发，以帮助指导抗体设计。抗体的多样性主要来自 V(D)J 重组、CDRs 内的突变和/或 CDRs 外的少数非基因突变。因此，所有天然抗体序列的可变结构域有很大一部分仍然是种系的。这就影响了抗体特异性语言模型的预训练，因为序列数据的这个方面会对种系残基产生普遍偏倚。这就提出了一个挑战，因为远离种系的突变往往对产生特异性和与靶标的强效结合至关重要，这意味着语言模型需要能够提示远离种系的关键突变：在这项研究中，我们探讨了种系偏倚的影响，研究了它对一般蛋白和抗体特异性语言模型的影响。我们开发并训练了一系列新的抗体特异性语言模型，这些模型针对预测非种系残基进行了优化。然后，我们将最终模型 AbLang-2 与当前模型进行了比较，并展示了它是如何以高累积概率提出一系列不同的有效突变的：AbLang-2 可在非配对数据和配对数据上进行训练，可在 https://github.com/oxpig/AbLang2.git.Supplementary 上免费获取：补充数据可从 Journal Name 在线获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Addressing the antibody germline bias and its effect on language models for improved antibody design.

Motivation: The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive, and time-consuming task, with the final antibody needing to not only have strong and specific binding but also be minimally impacted by developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a few nongermline mutations outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias toward germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.

Results: In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimized for predicting nongermline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability.

Availability and implementation: AbLang-2 is trained on both unpaired and paired data, and is freely available at https://github.com/oxpig/AbLang2.git.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量