Prediction of Protein Expression and Growth Rates by Supervised Machine Learning

Simiao Zhao
{"title":"Prediction of Protein Expression and Growth Rates by Supervised Machine Learning","authors":"Simiao Zhao","doi":"10.4236/ns.2021.138025","DOIUrl":null,"url":null,"abstract":"The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R2 score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences.","PeriodicalId":19083,"journal":{"name":"Natural Science","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4236/ns.2021.138025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R2 score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用监督机器学习预测蛋白质表达和生长速度
生物体的DNA序列对其转录和翻译过程有重要影响,从而影响其蛋白质的产生和生长速度。由于DNA的复杂性,预测生物的宏观特征是极其困难的。然而,随着近年来机器学习的快速发展,使用强大的机器学习算法来处理和分析生物数据成为可能。基于一种特定微生物——大肠杆菌的合成DNA序列,我设计了一个过程来预测它的蛋白质产量和生长速度。通过观察先前工作构建的数据集的属性,我选择使用带有编码DNA序列的监督学习回归器作为输入特征来执行预测。在比较了不同的编码器和算法后,我选择了三个编码器来编码DNA序列作为输入,并训练了七个不同的回归器来预测输出。对具有最佳预测潜力的三个回归量进行了超参数优化。最后,我利用编码器捕捉DNA序列的潜在特征,成功预测了蛋白质产量和生长率,R2得分最高,分别为0.55和0.77。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Theoretical Analysis of Biogas Production from Septic Tanks: The Case of the City of Kinshasa Periodicities in Solar Activity, Solar Radiation and Their Links with Terrestrial Environment Structure of the Quarks and a New Model of Protons and Neutrons: Answer to Some Open Questions Child Neurodevelopment on Mars Potential Power of the Pyramidal Structure VII: Effects of Pyramid Power and Bio-Entanglement on the Circadian Rhythm of Biosensors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1