Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification

IF 2.6 4区生物学 Q2 BIOLOGY Computational Biology and Chemistry Pub Date : 2024-07-26 DOI:10.1016/j.compbiolchem.2024.108163

{"title":"Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification","authors":"","doi":"10.1016/j.compbiolchem.2024.108163","DOIUrl":null,"url":null,"abstract":"<div><p>The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing <em>in silico</em> methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.</p></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124001518","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing in silico methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用蛋白质语言模型嵌入和逻辑回归实现高效、准确的嗜酸性蛋白质室内分类。

生物技术领域对生态友好型技术的需求与日俱增，这就需要高效、可持续的催化剂。嗜酸性蛋白质可在高酸性环境中发挥最佳功能，在食品生产、生物燃料和生物修复等各种应用领域大有可为。然而，对这些蛋白质的有限了解阻碍了对它们的探索。本研究通过利用计算工具和机器学习的硅学方法解决了这一问题。我们提出了一种利用蛋白质语言模型（PLMs）预测嗜酸性蛋白质的新方法，无需大量的实验室工作就能加速蛋白质的发现。我们的研究凸显了蛋白质语言模型在理解和利用嗜酸性蛋白质促进科学和工业进步方面的潜力。我们介绍了 ACE 模型，该模型将简单的逻辑回归模型与由 ProtT5 PLM 处理的蛋白质序列得出的嵌入相结合。该模型在独立测试集上取得了很高的性能，准确率（0.91）、F1 分数（0.93）和马修相关系数（0.76）。据我们所知，这是首次将预训练的 PLM 嵌入应用于嗜酸性蛋白质分类。ACE 模型是探索蛋白质嗜酸性的有力工具，为蛋白质设计和工程学的未来发展铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Biology and Chemistry 生物-计算机：跨学科应用

CiteScore

6.10

自引率

3.20%

发文量

142

审稿时长

24 days

期刊介绍： Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered. Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered. Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.