Hoi Yee Chu, John H C Fong, Dawn G L Thean, Peng Zhou, Frederic K C Fung, Yuanhua Huang, Alan S L Wong
{"title":"通过低 N 挑选和验证机器学习准确发现顶级蛋白质变体。","authors":"Hoi Yee Chu, John H C Fong, Dawn G L Thean, Peng Zhou, Frederic K C Fung, Yuanhua Huang, Alan S L Wong","doi":"10.1016/j.cels.2024.01.002","DOIUrl":null,"url":null,"abstract":"<p><p>A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"193-203.e6"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accurate top protein variant discovery via low-N pick-and-validate machine learning.\",\"authors\":\"Hoi Yee Chu, John H C Fong, Dawn G L Thean, Peng Zhou, Frederic K C Fung, Yuanhua Huang, Alan S L Wong\",\"doi\":\"10.1016/j.cels.2024.01.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.</p>\",\"PeriodicalId\":93929,\"journal\":{\"name\":\"Cell systems\",\"volume\":\" \",\"pages\":\"193-203.e6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cell systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.cels.2024.01.002\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/2/9 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cell systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.cels.2024.01.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/2/9 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
一种能在广阔的组合突变景观中以最少的实验工作量获得最佳变体数量的策略,对于提高蛋白质工程的资源可生产性将大有裨益。为了实现这一目标,我们提出了一种简单有效的基于机器学习的策略,其效果优于其他最先进的方法。我们的策略整合了零次预测和多轮采样,通过仅对少数预测的顶级变异进行实验来指导主动学习。我们发现,通过对 12 个变体进行四轮低 N 挑选和验证采样来进行机器学习,在组合突变体库中选出真正的前 1%变体时,准确率最高可达 92.6%,而对 24 个变体进行两轮采样也是可行的。我们展示了我们的策略,它成功地从包括基于CRISPR的基因组编辑器在内的不同家族中发现了高性能蛋白质变体,支持了它在解决蛋白质工程任务中的可推广应用。本文透明的同行评审过程记录包含在补充信息中。
Accurate top protein variant discovery via low-N pick-and-validate machine learning.
A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.