Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification

IF 2.6 4区 生物学 Q2 BIOLOGY Computational Biology and Chemistry Pub Date : 2024-07-26 DOI:10.1016/j.compbiolchem.2024.108163
{"title":"Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification","authors":"","doi":"10.1016/j.compbiolchem.2024.108163","DOIUrl":null,"url":null,"abstract":"<div><p>The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing <em>in silico</em> methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.</p></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124001518","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing in silico methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用蛋白质语言模型嵌入和逻辑回归实现高效、准确的嗜酸性蛋白质室内分类。
生物技术领域对生态友好型技术的需求与日俱增,这就需要高效、可持续的催化剂。嗜酸性蛋白质可在高酸性环境中发挥最佳功能,在食品生产、生物燃料和生物修复等各种应用领域大有可为。然而,对这些蛋白质的有限了解阻碍了对它们的探索。本研究通过利用计算工具和机器学习的硅学方法解决了这一问题。我们提出了一种利用蛋白质语言模型(PLMs)预测嗜酸性蛋白质的新方法,无需大量的实验室工作就能加速蛋白质的发现。我们的研究凸显了蛋白质语言模型在理解和利用嗜酸性蛋白质促进科学和工业进步方面的潜力。我们介绍了 ACE 模型,该模型将简单的逻辑回归模型与由 ProtT5 PLM 处理的蛋白质序列得出的嵌入相结合。该模型在独立测试集上取得了很高的性能,准确率(0.91)、F1 分数(0.93)和马修相关系数(0.76)。据我们所知,这是首次将预训练的 PLM 嵌入应用于嗜酸性蛋白质分类。ACE 模型是探索蛋白质嗜酸性的有力工具,为蛋白质设计和工程学的未来发展铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computational Biology and Chemistry
Computational Biology and Chemistry 生物-计算机:跨学科应用
CiteScore
6.10
自引率
3.20%
发文量
142
审稿时长
24 days
期刊介绍: Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered. Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered. Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.
期刊最新文献
ILYCROsite: Identification of lysine crotonylation sites based on FCM-GRNN undersampling technique A comprehensive bioinformatic analysis of the role of TGF-β1-stimulated activating transcription factor 3 by non-coding RNAs during breast cancer progression Unveiling therapeutic biomarkers and druggable targets in ALS: An integrative microarray analysis, molecular docking, and structural dynamic studies Accurately identifying positive and negative regulation of apoptosis using fusion features and machine learning methods Molecular descriptors and in silico studies of 4-((5-(decylthio)-4-methyl-4n-1,2,4-triazol-3-yl)methyl)morpholine as a potential drug for the treatment of fungal pathologies
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1