Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods

IF 1.9 4区 生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY The Protein Journal Pub Date : 2024-09-07 DOI:10.1007/s10930-024-10230-z
Feiming Huang, Qian Gao, XianChao Zhou, Wei Guo, KaiYan Feng, Lin Zhu, Tao Huang, Yu-Dong Cai
{"title":"Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods","authors":"Feiming Huang,&nbsp;Qian Gao,&nbsp;XianChao Zhou,&nbsp;Wei Guo,&nbsp;KaiYan Feng,&nbsp;Lin Zhu,&nbsp;Tao Huang,&nbsp;Yu-Dong Cai","doi":"10.1007/s10930-024-10230-z","DOIUrl":null,"url":null,"abstract":"<div><p>Protein solubility is a critical parameter that determines the stability, activity, and functionality of proteins, with broad and far-reaching implications in biotechnology and biochemistry. Accurate prediction and control of protein solubility are essential for successful protein expression and purification in research and industrial settings. This study gathered information on soluble and insoluble proteins. In characterizing the proteins, they were mapped to STRING and characterized by functional and structural features. All functional/structural features were integrated to create a 5768-dimensional binary vector to encode proteins. Seven feature-ranking algorithms were employed to analyze the functional/structural features, yielding seven feature lists. These lists were subjected to the incremental feature selection, incorporating four classification algorithms, one by one to build effective classification models and identify functional/structural features with classification-related importance. Some essential functional/structural features used to differentiate between soluble and insoluble proteins were identified, including GO:0009987 (intercellular communication) and GO:0022613 (ribonucleoprotein complex biogenesis). The best classification model using support vector machine as the classification algorithm and 295 optimized functional/structural features generated the F1 score of 0.825, which can be a powerful tool to differentiate soluble proteins from insoluble proteins.</p></div>","PeriodicalId":793,"journal":{"name":"The Protein Journal","volume":"43 5","pages":"983 - 996"},"PeriodicalIF":1.9000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Protein Journal","FirstCategoryId":"2","ListUrlMain":"https://link.springer.com/article/10.1007/s10930-024-10230-z","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Protein solubility is a critical parameter that determines the stability, activity, and functionality of proteins, with broad and far-reaching implications in biotechnology and biochemistry. Accurate prediction and control of protein solubility are essential for successful protein expression and purification in research and industrial settings. This study gathered information on soluble and insoluble proteins. In characterizing the proteins, they were mapped to STRING and characterized by functional and structural features. All functional/structural features were integrated to create a 5768-dimensional binary vector to encode proteins. Seven feature-ranking algorithms were employed to analyze the functional/structural features, yielding seven feature lists. These lists were subjected to the incremental feature selection, incorporating four classification algorithms, one by one to build effective classification models and identify functional/structural features with classification-related importance. Some essential functional/structural features used to differentiate between soluble and insoluble proteins were identified, including GO:0009987 (intercellular communication) and GO:0022613 (ribonucleoprotein complex biogenesis). The best classification model using support vector machine as the classification algorithm and 295 optimized functional/structural features generated the F1 score of 0.825, which can be a powerful tool to differentiate soluble proteins from insoluble proteins.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用机器学习方法,基于功能和结构特征预测蛋白质在大肠杆菌中的溶解度
蛋白质溶解度是决定蛋白质稳定性、活性和功能的关键参数,对生物技术和生物化学具有广泛而深远的影响。准确预测和控制蛋白质的溶解度对于在研究和工业环境中成功表达和纯化蛋白质至关重要。本研究收集了有关可溶性和不可溶性蛋白质的信息。在表征蛋白质时,它们被映射到 STRING 中,并根据功能和结构特征进行表征。整合所有功能/结构特征后,创建了一个 5768 维的二进制向量来编码蛋白质。在分析功能/结构特征时,采用了七种特征排序算法,得出了七个特征列表。这些列表经过增量特征选择,结合四种分类算法,逐一建立有效的分类模型,并识别出与分类相关的重要功能/结构特征。结果发现了一些用于区分可溶性和非可溶性蛋白质的基本功能/结构特征,包括 GO:0009987(细胞间通讯)和 GO:0022613(核糖核蛋白复合物生物生成)。使用支持向量机作为分类算法和 295 个优化的功能/结构特征的最佳分类模型产生了 0.825 的 F1 分数,这可以作为区分可溶性蛋白质和不溶性蛋白质的有力工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
The Protein Journal
The Protein Journal 生物-生化与分子生物学
CiteScore
5.20
自引率
0.00%
发文量
57
审稿时长
12 months
期刊介绍: The Protein Journal (formerly the Journal of Protein Chemistry) publishes original research work on all aspects of proteins and peptides. These include studies concerned with covalent or three-dimensional structure determination (X-ray, NMR, cryoEM, EPR/ESR, optical methods, etc.), computational aspects of protein structure and function, protein folding and misfolding, assembly, genetics, evolution, proteomics, molecular biology, protein engineering, protein nanotechnology, protein purification and analysis and peptide synthesis, as well as the elucidation and interpretation of the molecular bases of biological activities of proteins and peptides. We accept original research papers, reviews, mini-reviews, hypotheses, opinion papers, and letters to the editor.
期刊最新文献
Influence of Cataract Causing Mutations on αA-Crystallin: A Computational Approach Unraveling the interaction between a glycolytic regulator protein EhPpdk and an anaphase promoting complex protein EhApc10: yeast two hybrid screening, in vitro binding assays and molecular simulation study Unravelling the Significance of Seed Proteomics: Insights into Seed Development, Function, and Agricultural Applications HaloClass: Salt-Tolerant Protein Classification with Protein Language Models Exosomes with Engineered Brain Derived Neurotrophic Factor on Their Surfaces Can Proliferate Menstrual Blood Derived Mesenchymal Stem Cells: Targeted Delivery for a Protein Drug
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1