Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective

N. Rasool, Waqar Hussain, S. Mahmood
{"title":"Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective","authors":"N. Rasool, Waqar Hussain, S. Mahmood","doi":"10.4172/JPB.1000458","DOIUrl":null,"url":null,"abstract":"It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naive Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.","PeriodicalId":73911,"journal":{"name":"Journal of proteomics & bioinformatics","volume":"10 1","pages":"324-328"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4172/JPB.1000458","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naive Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用初级结构组成特征预测蛋白质溶解度:机器学习视角
使用体外方法获得足够浓度的可溶性蛋白质是一个反复出现的限制因素。溶解度是蛋白质的一个独立特性,可以在特定的实验条件下使用氨基酸组合物来确定。本研究旨在通过使用初级结构信息采用基于机器学习的方法来预测蛋白质溶解度。这些特征包括氨基酸的组成特征以及氨基酸的理化性质,即标准值、疏水性、溶解度指数和溶解度得分。对于6372个蛋白质序列(4850个可溶性蛋白质序列和1522个不溶性蛋白质序列)的数据集,计算了所有四个特征。利用计算值,基于多层感知器(MLP)、随机森林(RF)、决策树(DT)和朴素贝叶斯分类器(NBC)开发了四种不同的预测模型。对于性能评估,确定了MCC、F-measure、准确度、精密度和召回率。在所有四种预测模型中,MLP被认为是预测蛋白质溶解度最准确的模型,准确率为95.92%,其次是RF和NBC。所提出的基于MLP的模型可用于预测蛋白质溶解度,作为实验预测的预处理。该方法具有资源和时间效率,有助于预测蛋白质的溶解度,而不是费力和繁忙的实验工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Large Scale Screening and Quantitative Analysis of Site-Specific N-Glycopeptides from Human Serum in Early Alzheimer's Disease Using LC-HCD-PRM-MS. Drug Repurposing Approach Targeting Main Protease Using HTVS and Pharmacophoric Mapping: Exceptional Reassuring Itinerary for Most Insolvent Anti-SARS-CoV-2 Drug An Editorial on ActDES: A Curated Actinobacterial Database for Evolutionary Studies Sharing Data from an Academic Cancer Center Biospecimen and Proteomic Core Facilities through the Proteomics Data Commons Overview of Neuroproteomics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1