CasGen: A Regularized Generative Model for CRISPR Cas Protein Design with Classification and Margin-Based Optimization.

Bharani Nammi, Vindi M Jayasinghe-Arachchige, Sita Sirisha Madugula, Maria Artiles, Charlene Norgan Radler, Tyler Pham, Jin Liu, Shouyi Wang
{"title":"CasGen: A Regularized Generative Model for CRISPR Cas Protein Design with Classification and Margin-Based Optimization.","authors":"Bharani Nammi, Vindi M Jayasinghe-Arachchige, Sita Sirisha Madugula, Maria Artiles, Charlene Norgan Radler, Tyler Pham, Jin Liu, Shouyi Wang","doi":"10.1101/2025.02.28.640911","DOIUrl":null,"url":null,"abstract":"<p><p>Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-associated proteins (Cas) systems have revolutionized genome editing by providing high precision and versatility. However, most genome editing applications rely on a limited number of well-characterized Cas9 and Cas12 variants, constraining the potential for broader genome engineering applications. In this study, we extensively explored Cas9 and Cas12 proteins and developed CasGen, a novel transformer-based deep generative model with margin-based latent space regularization to enhance the quality of newly generative Cas9 and Cas12 proteins. Specifically, CasGen employs a strategies that combine classification to filter out non-Cas sequences, Bayesian optimization of the latent space to guide functionally relevant designs, and thorough structural validation using AlphaFold-based analyses to ensure robust protein generation. We collected a comprehensive dataset with 3,021 Cas9, 597 Cas12, and 597 Non-Cas protein sequences from reputable biological databases such as InterPro and PDB. To validate the generated proteins, we performed sequence alignment using the BLAST tool to ensure novelty and filter out highly similar sequences to existing Cas proteins. Structural prediction using AlphaFold2 and AlphaFold3 confirmed that the generated proteins exhibit high structural similarity to known Cas9 and Cas12 variants, with TM-scores between 0.70 and 0.85 and root-mean-square deviation (RMSD) values below 2.00 Å. Sequence identity analysis further demonstrated that the generated Cas9 orthologs exhibited 28% to 55% identity with known variants, while Cas12a variants show up to 48% identity. Our results demonstrate that the proposed Cas generative model has significant potential to expand the genome editing toolkit by designing diverse Cas proteins that retain functional integrity. The developed deep generative approach offers a promising avenue for synthetic biology and therapeutic applications, enableling the development of more precise and versatile Cas-based genome editing tools.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11888460/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.02.28.640911","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-associated proteins (Cas) systems have revolutionized genome editing by providing high precision and versatility. However, most genome editing applications rely on a limited number of well-characterized Cas9 and Cas12 variants, constraining the potential for broader genome engineering applications. In this study, we extensively explored Cas9 and Cas12 proteins and developed CasGen, a novel transformer-based deep generative model with margin-based latent space regularization to enhance the quality of newly generative Cas9 and Cas12 proteins. Specifically, CasGen employs a strategies that combine classification to filter out non-Cas sequences, Bayesian optimization of the latent space to guide functionally relevant designs, and thorough structural validation using AlphaFold-based analyses to ensure robust protein generation. We collected a comprehensive dataset with 3,021 Cas9, 597 Cas12, and 597 Non-Cas protein sequences from reputable biological databases such as InterPro and PDB. To validate the generated proteins, we performed sequence alignment using the BLAST tool to ensure novelty and filter out highly similar sequences to existing Cas proteins. Structural prediction using AlphaFold2 and AlphaFold3 confirmed that the generated proteins exhibit high structural similarity to known Cas9 and Cas12 variants, with TM-scores between 0.70 and 0.85 and root-mean-square deviation (RMSD) values below 2.00 Å. Sequence identity analysis further demonstrated that the generated Cas9 orthologs exhibited 28% to 55% identity with known variants, while Cas12a variants show up to 48% identity. Our results demonstrate that the proposed Cas generative model has significant potential to expand the genome editing toolkit by designing diverse Cas proteins that retain functional integrity. The developed deep generative approach offers a promising avenue for synthetic biology and therapeutic applications, enableling the development of more precise and versatile Cas-based genome editing tools.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CasGen:基于分类和边际优化的CRISPR Cas蛋白设计正则化生成模型。
集群规则间隔短回文重复序列(CRISPR)相关蛋白(Cas)系统通过提供高精度和多功能性,彻底改变了基因组编辑。然而,大多数基因组编辑应用依赖于有限数量的特征良好的Cas9和Cas12变体,限制了更广泛的基因组工程应用的潜力。在本研究中,我们对Cas9和Cas12蛋白进行了广泛的探索,并开发了CasGen,这是一种基于变压器的新型深度生成模型,具有基于边缘的潜在空间正则化,以提高新生成的Cas9和Cas12蛋白的质量。具体来说,CasGen采用了一种策略,结合分类过滤掉非cas序列,贝叶斯潜空间优化指导功能相关设计,并使用基于alphafold的分析进行彻底的结构验证,以确保稳健的蛋白质生成。我们从InterPro和PDB等知名生物数据库中收集了3021个Cas9、597个Cas12和597个非cas蛋白序列的综合数据集。为了验证生成的蛋白,我们使用BLAST工具进行序列比对,以确保新颖性,并过滤掉与现有Cas蛋白高度相似的序列。使用AlphaFold2和AlphaFold3进行结构预测,证实生成的蛋白质与已知的Cas9和Cas12变体具有高度的结构相似性,tm得分在0.70到0.85之间,均方根偏差(RMSD)值低于2.00 Å。序列同一性分析进一步表明,生成的Cas9同源基因与已知变体的同源性为28%至55%,而Cas12a变体的同源性高达48%。我们的研究结果表明,所提出的Cas生成模型具有通过设计保留功能完整性的多种Cas蛋白来扩展基因组编辑工具包的巨大潜力。开发的深度生成方法为合成生物学和治疗应用提供了一条有前途的途径,使开发更精确和通用的基于cas的基因组编辑工具成为可能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Cellular coding of ingestion in the caudal brainstem. Depth-Sensitive Optical Property Characterization Using Multi-Frequency Laparoscopic SFDI. DiCoLo: Integration-free and cluster-free detection of localized differential gene co-expression in single-cell data. Comparing Multislice Projections of MD Simulations with CryoEM Exposes Structural Prediction Errors. Hormonal control of postmitotic neuronal identity.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1