Protein sequence classification using natural language processing techniques

Huma PerveenSchool of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, Julie WeedsSchool of Engineering and Informatics, University of Sussex, Brighton, UK
{"title":"Protein sequence classification using natural language processing techniques","authors":"Huma PerveenSchool of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, Julie WeedsSchool of Engineering and Informatics, University of Sussex, Brighton, UK","doi":"arxiv-2409.04491","DOIUrl":null,"url":null,"abstract":"Proteins are essential to numerous biological functions, with their sequences\ndetermining their roles within organisms. Traditional methods for determining\nprotein function are time-consuming and labor-intensive. This study addresses\nthe increasing demand for precise, effective, and automated protein sequence\nclassification methods by employing natural language processing (NLP)\ntechniques on a dataset comprising 75 target protein classes. We explored\nvarious machine learning and deep learning models, including K-Nearest\nNeighbors (KNN), Multinomial Na\\\"ive Bayes, Logistic Regression, Multi-Layer\nPerceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking\nclassifiers, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM),\nand transformer models (BertForSequenceClassification, DistilBERT, and\nProtBert). Experiments were conducted using amino acid ranges of 1-4 grams for\nmachine learning models and different sequence lengths for CNN and LSTM models.\nThe KNN algorithm performed best on tri-gram data with 70.0% accuracy and a\nmacro F1 score of 63.0%. The Voting classifier achieved best performance with\n74.0% accuracy and an F1 score of 65.0%, while the Stacking classifier reached\n75.0% accuracy and an F1 score of 64.0%. ProtBert demonstrated the highest\nperformance among transformer models, with a accuracy 76.0% and F1 score 61.0%\nwhich is same for all three transformer models. Advanced NLP techniques,\nparticularly ensemble methods and transformer models, show great potential in\nprotein classification. Our results demonstrate that ensemble methods,\nparticularly Voting Soft classifiers, achieved superior results, highlighting\nthe importance of sufficient training data and addressing sequence similarity\nacross different classes.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Proteins are essential to numerous biological functions, with their sequences determining their roles within organisms. Traditional methods for determining protein function are time-consuming and labor-intensive. This study addresses the increasing demand for precise, effective, and automated protein sequence classification methods by employing natural language processing (NLP) techniques on a dataset comprising 75 target protein classes. We explored various machine learning and deep learning models, including K-Nearest Neighbors (KNN), Multinomial Na\"ive Bayes, Logistic Regression, Multi-Layer Perceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking classifiers, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and transformer models (BertForSequenceClassification, DistilBERT, and ProtBert). Experiments were conducted using amino acid ranges of 1-4 grams for machine learning models and different sequence lengths for CNN and LSTM models. The KNN algorithm performed best on tri-gram data with 70.0% accuracy and a macro F1 score of 63.0%. The Voting classifier achieved best performance with 74.0% accuracy and an F1 score of 65.0%, while the Stacking classifier reached 75.0% accuracy and an F1 score of 64.0%. ProtBert demonstrated the highest performance among transformer models, with a accuracy 76.0% and F1 score 61.0% which is same for all three transformer models. Advanced NLP techniques, particularly ensemble methods and transformer models, show great potential in protein classification. Our results demonstrate that ensemble methods, particularly Voting Soft classifiers, achieved superior results, highlighting the importance of sufficient training data and addressing sequence similarity across different classes.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用自然语言处理技术进行蛋白质序列分类
蛋白质对许多生物功能至关重要,其序列决定了它们在生物体内的作用。确定蛋白质功能的传统方法耗时耗力。本研究通过在包含 75 个目标蛋白质类别的数据集上采用自然语言处理(NLP)技术,满足了对精确、有效和自动化蛋白质序列分类方法日益增长的需求。我们探索了各种机器学习和深度学习模型,包括 K-NearestNeighbors (KNN)、Multinomial Na "ive Bayes、Logistic Regression、Multi-LayerPerceptron (MLP)、Decision Tree、Random Forest、XGBoost、Voting and Stackingclassifiers、Convolutional Neural Network (CNN)、Long Short-Term Memory (LSTM) 和转换器模型(BertForSequenceClassification、DistilBERT 和ProtBert)。实验中,机器学习模型使用了 1-4 克的氨基酸范围,CNN 和 LSTM 模型使用了不同的序列长度。投票分类器的准确率为 74.0%,F1 得分为 65.0%,而堆叠分类器的准确率为 75.0%,F1 得分为 64.0%。在所有三个变压器模型中,ProtBert 的准确率为 76.0%,F1 得分为 61.0%,是变压器模型中准确率和 F1 得分最高的。先进的 NLP 技术,尤其是集合方法和转换器模型,在蛋白质分类中显示出巨大的潜力。我们的研究结果表明,集合方法,尤其是 Voting Soft 分类器取得了优异的结果,这突出了充足的训练数据和解决不同类别序列相似性问题的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning A computational framework for optimal and Model Predictive Control of stochastic gene regulatory networks Active learning for energy-based antibody optimization and enhanced screening Comorbid anxiety symptoms predict lower odds of improvement in depression symptoms during smartphone-delivered psychotherapy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1