Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum

IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Journal of Chemical Information and Modeling Pub Date : 2024-06-05 DOI:10.1021/acs.jcim.4c00625
Sita Sirisha Madugula, Pranav Pujar, Bharani Nammi, Shouyi Wang, Vindi M. Jayasinghe-Arachchige, Tyler Pham, Dominic Mashburn, Maria Artiles and Jin Liu*, 
{"title":"Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum","authors":"Sita Sirisha Madugula,&nbsp;Pranav Pujar,&nbsp;Bharani Nammi,&nbsp;Shouyi Wang,&nbsp;Vindi M. Jayasinghe-Arachchige,&nbsp;Tyler Pham,&nbsp;Dominic Mashburn,&nbsp;Maria Artiles and Jin Liu*,&nbsp;","doi":"10.1021/acs.jcim.4c00625","DOIUrl":null,"url":null,"abstract":"<p >The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations such as large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In this study, we aim to elucidate the unique protein features associated with Cas9 and Cas12 families and identify the features distinguishing each family from non-Cas proteins. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,494 features) encoding various physiochemical, topological, constitutional, and coevolutionary information on Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and non-Cas proteins. All the models were evaluated rigorously on the test and independent data sets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 92% and 95% on their respective independent data sets, while the multiclass classifier achieved an F1 score of close to 0.98. We observed that Quasi-Sequence-Order (QSO) descriptors like Schneider.lag and Composition descriptors like charge, volume, and polarizability are predominant in the Cas12 family. Conversely Amino Acid Composition descriptors, especially Tripeptide Composition (TPC), predominate the Cas9 family. Four of the top 10 descriptors identified in Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all Cas9 proteins and located within different catalytically important domains of the <i>Streptococcus pyogenes</i> Cas9 (SpCas9) structure. Among these, DHI and HHA are well-known to be involved in the DNA cleavage activity of the SpCas9 protein. Mutation studies have highlighted the significance of the PWN tripeptide in PAM recognition and DNA cleavage activity of SpCas9, while Y450 from the PYY tripeptide plays a crucial role in reducing off-target effects and improving the specificity in SpCas9. Leveraging our machine learning (ML) pipeline, we identified numerous Cas9 and Cas12 family-specific features. These features offer valuable insights for future experimental and computational studies aiming at designing Cas systems with enhanced gene-editing properties. These features suggest plausible structural modifications that can effectively guide the development of Cas proteins with improved editing capabilities.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.4c00625","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0

Abstract

The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations such as large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In this study, we aim to elucidate the unique protein features associated with Cas9 and Cas12 families and identify the features distinguishing each family from non-Cas proteins. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,494 features) encoding various physiochemical, topological, constitutional, and coevolutionary information on Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and non-Cas proteins. All the models were evaluated rigorously on the test and independent data sets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 92% and 95% on their respective independent data sets, while the multiclass classifier achieved an F1 score of close to 0.98. We observed that Quasi-Sequence-Order (QSO) descriptors like Schneider.lag and Composition descriptors like charge, volume, and polarizability are predominant in the Cas12 family. Conversely Amino Acid Composition descriptors, especially Tripeptide Composition (TPC), predominate the Cas9 family. Four of the top 10 descriptors identified in Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all Cas9 proteins and located within different catalytically important domains of the Streptococcus pyogenes Cas9 (SpCas9) structure. Among these, DHI and HHA are well-known to be involved in the DNA cleavage activity of the SpCas9 protein. Mutation studies have highlighted the significance of the PWN tripeptide in PAM recognition and DNA cleavage activity of SpCas9, while Y450 from the PYY tripeptide plays a crucial role in reducing off-target effects and improving the specificity in SpCas9. Leveraging our machine learning (ML) pipeline, we identified numerous Cas9 and Cas12 family-specific features. These features offer valuable insights for future experimental and computational studies aiming at designing Cas systems with enhanced gene-editing properties. These features suggest plausible structural modifications that can effectively guide the development of Cas proteins with improved editing capabilities.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Cas9 和 Cas12 蛋白家族特异性特征的鉴定:使用完整蛋白质特征谱的机器学习方法。
最近开发的 CRISPR-Cas 技术有望纠正遗传疾病的基因水平缺陷。CRISPR-Cas系统的关键元件是Cas蛋白,它是一种核酸酶,可在引导RNA的辅助下编辑相关基因。然而,这些 Cas 蛋白存在固有的局限性,如体积大、裂解效率低和脱靶效应,阻碍了它们作为基因编辑工具的广泛应用。因此,有必要鉴定具有更好编辑特性的新型 Cas 蛋白,为此有必要了解 Cas 家族的基本特征。在本研究中,我们旨在阐明与 Cas9 和 Cas12 家族相关的独特蛋白质特征,并确定每个家族区别于非 Cas 蛋白的特征。在此,我们利用完整的蛋白质特征谱(13,494 个特征)编码了 Cas 蛋白的各种理化、拓扑、结构和协同进化信息,建立了随机森林(RF)二元分类器,以区分 Cas12 和 Cas9 蛋白与非 Cas 蛋白。此外,我们还建立了区分 Cas9、Cas12 和非 Cas 蛋白的多类 RF 分类器。我们在测试数据集和独立数据集上对所有模型进行了严格评估。在各自的独立数据集上,Cas12 和 Cas9 二进制模型的总体准确率分别达到 92% 和 95%,而多分类器的 F1 分数接近 0.98。我们观察到,在 Cas12 家族中,Schneider.lag 等准序列序列(QSO)描述符以及电荷、体积和极化性等组成描述符占主导地位。相反,氨基酸组成描述符,尤其是三肽组成(TPC)在 Cas9 家族中占主导地位。在Cas9分类中发现的前10个描述符中有4个是三肽PWN、PYY、HHA和DHI,它们在所有Cas9蛋白中都是保守的,并且位于化脓性链球菌Cas9(SpCas9)结构的不同重要催化结构域中。众所周知,DHI 和 HHA 参与了 SpCas9 蛋白的 DNA 切割活动。突变研究强调了PWN三肽在SpCas9的PAM识别和DNA切割活性中的重要作用,而PYY三肽中的Y450则在减少脱靶效应和提高SpCas9的特异性方面发挥着关键作用。利用我们的机器学习(ML)管道,我们发现了许多 Cas9 和 Cas12 家族的特异性特征。这些特征为未来旨在设计具有更强基因编辑特性的 Cas 系统的实验和计算研究提供了宝贵的见解。这些特征提出了一些似是而非的结构修饰,可以有效地指导具有更强编辑能力的 Cas 蛋白的开发。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
9.80
自引率
10.70%
发文量
529
审稿时长
1.4 months
期刊介绍: The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.
期刊最新文献
Combatting Antibiotic-Resistant Staphylococcus aureus: Discovery of TST1N-224, a Potent Inhibitor Targeting Response Regulator VraRC, through Pharmacophore-Based Screening and Molecular Characterizations. Charge Relaying within a Phospho-Motif Rescue Binding Competency of a Disordered Transcription Factor. Fully Flexible Molecular Alignment Enables Accurate Ligand Structure Modeling. Integrating Prior Chemical Knowledge into the Graph Transformer Network to Predict the Stability Constants of Chelating Agents and Metal Ions. Prediction of Protein Allosteric Sites with Transfer Entropy and Spatial Neighbor-Based Evolutionary Information Learned by an Ensemble Model.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1