平衡训练集提高了基于深度学习的 CRISPR sgRNA 活性预测能力

IF 3.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS ACS Synthetic Biology Pub Date : 2024-11-04 DOI:10.1021/acssynbio.4c0054210.1021/acssynbio.4c00542

Varun Trivedi, Amirsadra Mohseni, Stefano Lonardi and Ian Wheeldon*,

{"title":"平衡训练集提高了基于深度学习的 CRISPR sgRNA 活性预测能力","authors":"Varun Trivedi, Amirsadra Mohseni, Stefano Lonardi and Ian Wheeldon*, ","doi":"10.1021/acssynbio.4c0054210.1021/acssynbio.4c00542","DOIUrl":null,"url":null,"abstract":"CRISPR-Cas systems have transformed the field of synthetic biology by providing a versatile method for genome editing. The efficiency of CRISPR systems is largely dependent on the sequence of the constituent sgRNA, necessitating the development of computational methods for designing active sgRNAs. While deep learning-based models have shown promise in predicting sgRNA activity, the accuracy of prediction is primarily governed by the data set used in model training. Here, we trained a convolutional neural network (CNN) model and a large language model (LLM) on balanced and imbalanced data sets generated from CRISPR-Cas12a screening data for the yeast Yarrowia lipolytica and evaluated their ability to predict high- and low-activity sgRNAs. We further tested whether prediction performance can be improved by training on imbalanced data sets augmented with synthetic sgRNAs. Lastly, we demonstrated that adding synthetic sgRNAs to inherently imbalanced CRISPR-Cas9 data sets from Y. lipolytica and Komagataella phaffii leads to improved performance in predicting sgRNA activity, thus underscoring the importance of employing balanced training sets for accurate sgRNA activity prediction.","PeriodicalId":26,"journal":{"name":"ACS Synthetic Biology","volume":"13 11","pages":"3774–3781 3774–3781"},"PeriodicalIF":3.7000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/epdf/10.1021/acssynbio.4c00542","citationCount":"0","resultStr":"{\"title\":\"Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity\",\"authors\":\"Varun Trivedi, Amirsadra Mohseni, Stefano Lonardi and Ian Wheeldon*, \",\"doi\":\"10.1021/acssynbio.4c0054210.1021/acssynbio.4c00542\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"CRISPR-Cas systems have transformed the field of synthetic biology by providing a versatile method for genome editing. The efficiency of CRISPR systems is largely dependent on the sequence of the constituent sgRNA, necessitating the development of computational methods for designing active sgRNAs. While deep learning-based models have shown promise in predicting sgRNA activity, the accuracy of prediction is primarily governed by the data set used in model training. Here, we trained a convolutional neural network (CNN) model and a large language model (LLM) on balanced and imbalanced data sets generated from CRISPR-Cas12a screening data for the yeast Yarrowia lipolytica and evaluated their ability to predict high- and low-activity sgRNAs. We further tested whether prediction performance can be improved by training on imbalanced data sets augmented with synthetic sgRNAs. Lastly, we demonstrated that adding synthetic sgRNAs to inherently imbalanced CRISPR-Cas9 data sets from Y. lipolytica and Komagataella phaffii leads to improved performance in predicting sgRNA activity, thus underscoring the importance of employing balanced training sets for accurate sgRNA activity prediction.\",\"PeriodicalId\":26,\"journal\":{\"name\":\"ACS Synthetic Biology\",\"volume\":\"13 11\",\"pages\":\"3774–3781 3774–3781\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://pubs.acs.org/doi/epdf/10.1021/acssynbio.4c00542\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Synthetic Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acssynbio.4c00542\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Synthetic Biology","FirstCategoryId":"99","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acssynbio.4c00542","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

CRISPR-Cas 系统为基因组编辑提供了一种多功能方法，从而改变了合成生物学领域。CRISPR 系统的效率在很大程度上取决于组成 sgRNA 的序列，因此有必要开发设计活性 sgRNA 的计算方法。虽然基于深度学习的模型已显示出预测 sgRNA 活性的前景，但预测的准确性主要取决于模型训练中使用的数据集。在这里，我们在从脂溶性酵母的 CRISPR-Cas12a 筛选数据中生成的平衡和不平衡数据集上训练了一个卷积神经网络（CNN）模型和一个大型语言模型（LLM），并评估了它们预测高活性和低活性 sgRNA 的能力。我们进一步测试了是否可以通过在不平衡数据集上训练合成 sgRNA 来提高预测性能。最后，我们证明在脂溶性酵母和 Komagataella phaffii 固有的不平衡 CRISPR-Cas9 数据集上添加合成 sgRNA 可提高预测 sgRNA 活性的性能，从而强调了采用平衡训练集进行准确 sgRNA 活性预测的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity

CRISPR-Cas systems have transformed the field of synthetic biology by providing a versatile method for genome editing. The efficiency of CRISPR systems is largely dependent on the sequence of the constituent sgRNA, necessitating the development of computational methods for designing active sgRNAs. While deep learning-based models have shown promise in predicting sgRNA activity, the accuracy of prediction is primarily governed by the data set used in model training. Here, we trained a convolutional neural network (CNN) model and a large language model (LLM) on balanced and imbalanced data sets generated from CRISPR-Cas12a screening data for the yeast Yarrowia lipolytica and evaluated their ability to predict high- and low-activity sgRNAs. We further tested whether prediction performance can be improved by training on imbalanced data sets augmented with synthetic sgRNAs. Lastly, we demonstrated that adding synthetic sgRNAs to inherently imbalanced CRISPR-Cas9 data sets from Y. lipolytica and Komagataella phaffii leads to improved performance in predicting sgRNA activity, thus underscoring the importance of employing balanced training sets for accurate sgRNA activity prediction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACS Synthetic Biology 生物-

CiteScore

8.00

自引率

10.60%

发文量

380

审稿时长

6-12 weeks

期刊介绍： The journal is particularly interested in studies on the design and synthesis of new genetic circuits and gene products; computational methods in the design of systems; and integrative applied approaches to understanding disease and metabolism. Topics may include, but are not limited to: Design and optimization of genetic systems Genetic circuit design and their principles for their organization into programs Computational methods to aid the design of genetic systems Experimental methods to quantify genetic parts, circuits, and metabolic fluxes Genetic parts libraries: their creation, analysis, and ontological representation Protein engineering including computational design Metabolic engineering and cellular manufacturing, including biomass conversion Natural product access, engineering, and production Creative and innovative applications of cellular programming Medical applications, tissue engineering, and the programming of therapeutic cells Minimal cell design and construction Genomics and genome replacement strategies Viral engineering Automated and robotic assembly platforms for synthetic biology DNA synthesis methodologies Metagenomics and synthetic metagenomic analysis Bioinformatics applied to gene discovery, chemoinformatics, and pathway construction Gene optimization Methods for genome-scale measurements of transcription and metabolomics Systems biology and methods to integrate multiple data sources in vitro and cell-free synthetic biology and molecular programming Nucleic acid engineering.