LMCrot: An enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.

IF 4.4 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Bioinformatics Pub Date : 2024-04-25 DOI:10.1093/bioinformatics/btae290
Pawel Pratyush, Soufia Bahmani, Suresh Pokharel, Hamid D Ismail, Dukka B Kc
{"title":"LMCrot: An enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.","authors":"Pawel Pratyush, Soufia Bahmani, Suresh Pokharel, Hamid D Ismail, Dukka B Kc","doi":"10.1093/bioinformatics/btae290","DOIUrl":null,"url":null,"abstract":"MOTIVATION\nRecent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from Protein Language Models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted.\n\n\nRESULTS\nHerein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer's encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate-fusion stacked generalization approach, using an n-mer window sequence (or, peptide fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets.\n\n\nAVAILABILITY AND IMPLEMENTATION\nLMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae290","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

MOTIVATION Recent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from Protein Language Models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. RESULTS Herein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer's encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate-fusion stacked generalization approach, using an n-mer window sequence (or, peptide fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. AVAILABILITY AND IMPLEMENTATION LMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
LMCrot:通过利用基于转换器的蛋白质语言模型的可解释窗口级嵌入,增强蛋白质巴豆酰化位点预测器。
动机自然语言处理领域的最新进展凸显了蛋白质语言模型(pLMs)的全局上下文表征在众多下游任务中的有效性。结果在本文中,我们采用了一系列利用 pLM 的方法,尝试了不同的输入序列类型(全长蛋白质序列和窗口序列),评估了利用兴趣点的每残基嵌入以及以兴趣点为中心的窗口残基嵌入的影响。基于这些见解,我们开发了一种新型残差 ConvBiLSTM 网络,旨在处理 ProtT5-XL-UniRef50 pLM 使用全长序列作为输入生成的兴趣点窗口级嵌入。这个被称为 T5ResConvBiLSTM 的模型在三个不同数据集上的性能超过了现有的最先进 Kcr 预测器。为了验证我们利用基于全序列的窗口级嵌入的方法,我们还通过两种方式深入研究了 ProtT5 衍生的嵌入张量的可解释性:首先,我们仔细研究了从变换器编码器块中获得的注意力权重;其次,我们计算了这些张量的 SHAP 值,为预测结果提供了与模型无关的解释。此外,我们还通过中间融合堆叠泛化方法,使用 n 聚体窗口序列(或肽片段),加入了两个额外的局部表征,一个来自氨基酸特性,另一个来自监督嵌入层,从而增强了 ProtT5 的潜在表征。由此产生的堆叠模型被命名为 LMCrot,在测试数据集的预测性能方面有了更明显的提高。可用性和实施LMCrot 可在 https://github.com/KCLabMTU/LMCrot 网站上公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Bioinformatics
Bioinformatics 生物-生化研究方法
CiteScore
11.20
自引率
5.20%
发文量
753
审稿时长
2.1 months
期刊介绍: The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.
期刊最新文献
MEHunter: Transformer-based mobile element variant detection from long reads PQSDC: a parallel lossless compressor for quality scores data via sequences partition and Run-Length prediction mapping. MUSE-XAE: MUtational Signature Extraction with eXplainable AutoEncoder enhances tumour types classification. CopyVAE: a variational autoencoder-based approach for copy number variation inference using single-cell transcriptomics CORDAX web server: An online platform for the prediction and 3D visualization of aggregation motifs in protein sequences.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1