In the twilight zone of protein sequence homology: do protein language models learn protein structure?

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Bioinformatics advances Pub Date : 2024-08-17 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae119
Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu
{"title":"In the twilight zone of protein sequence homology: do protein language models learn protein structure?","authors":"Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu","doi":"10.1093/bioadv/vbae119","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.</p><p><strong>Results: </strong>We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the \"twilight zone\" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.</p><p><strong>Availability and implementation: </strong>We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11344590/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.

Results: We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.

Availability and implementation: We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
蛋白质序列同源性的黄昏地带:蛋白质语言模型能学习蛋白质结构吗?
动机:基于转换器架构的蛋白质语言模型在蛋白质预测任务(包括二级结构、亚细胞定位等)上的性能日益提高。尽管蛋白质语言模型只针对蛋白质序列进行训练,但它似乎能隐式地学习蛋白质结构。本文研究了蛋白质语言模型学习到的序列表征是否编码了结构信息以及编码的程度:我们通过评估远程同源预测中的蛋白质语言模型来解决这个问题,在远程同源预测中,仅从序列信息识别远程同源物需要结构知识,尤其是在序列同一性非常低的 "曙光地带"。通过在序列同一性逐渐降低的情况下进行严格的测试,我们对蛋白质语言模型的性能进行了剖析,其参数范围从数百万到数十亿不等。我们的研究结果表明,虽然基于变换器的蛋白质语言模型优于传统的序列比对方法,但它们在 "黄昏区 "仍然很吃力。这表明,目前的蛋白质语言模型还没有充分学习蛋白质结构,无法在序列信号较弱的情况下解决远程同源性预测问题:我们相信,这为进一步研究远程同源性预测以及学习蛋白质分子富含序列和结构的表征这一更广泛的目标开辟了道路。所有代码、数据和模型均可公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
期刊最新文献
motifbreakR v2: expanded variant analysis including indels and integrated evidence from transcription factor binding databases. TransAnnot-a fast transcriptome annotation pipeline. PatchProt: hydrophobic patch prediction using protein foundation models. Accelerating protein-protein interaction screens with reduced AlphaFold-Multimer sampling. CAPTVRED: an automated pipeline for viral tracking and discovery from capture-based metagenomics samples.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1