Protein domain embeddings for fast and accurate similarity search

IF 5.5 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Genome research Pub Date : 2024-09-05 DOI:10.1101/gr.279127.124

Benjamin Giovanni Iovino, Haixu Tang, Yuzhen Ye

{"title":"Protein domain embeddings for fast and accurate similarity search","authors":"Benjamin Giovanni Iovino, Haixu Tang, Yuzhen Ye","doi":"10.1101/gr.279127.124","DOIUrl":null,"url":null,"abstract":"Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark showed that DCTdomain was able to detect distant homologs by leveraging the structural information in the contextual embeddings.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"4 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/gr.279127.124","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark showed that DCTdomain was able to detect distant homologs by leveraging the structural information in the contextual embeddings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于快速准确相似性搜索的蛋白质结构域嵌入

最近开发的蛋白质语言模型利用其产生的蛋白质上下文嵌入实现了多种应用。通过平均单个残基的嵌入，或对残基嵌入矩阵应用矩阵变换技术（如离散余弦变换），可以得到每个蛋白质的表示（每个蛋白质表示为一个固定维度的向量）。这种蛋白质级嵌入已被用于快速搜索相似蛋白质，但也发现了一些局限性；例如，PROST 擅长检测全局同源物，但不擅长检测局部同源物；knnProtT5 擅长检测单结构域蛋白质，但不擅长检测多结构域蛋白质。在这里，我们提出了一种新方法，首先将蛋白质分割成域（或子域），然后将离散余弦变换应用于每个域中残基的矢量化嵌入，从而推断出域级上下文向量。我们的方法被称为 DCTdomain，它利用来自 ESM-2 的预测接触图进行结构域分割，这被表述为一个结构域分割问题，使用递归切割算法（简称 RecCut）可以在蛋白质长度的二次方时间内解决；相比之下，现有的结构域分割方法使用的是三次方时间算法。我们的研究表明，这种结构域级上下文向量（称为 DCT 指纹）能够快速准确地检测出具有全局相似性但共享结构域之间存在未定义扩展区域的蛋白质与仅具有局部相似性的蛋白质之间的相似性。此外，对数据库搜索基准的测试表明，DCTdomain 能够利用上下文嵌入中的结构信息来检测远处的同源物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Genome research 生物-生化与分子生物学

CiteScore

12.40

自引率

1.40%

发文量

140

审稿时长

6 months

期刊介绍： Launched in 1995, Genome Research is an international, continuously published, peer-reviewed journal that focuses on research that provides novel insights into the genome biology of all organisms, including advances in genomic medicine. Among the topics considered by the journal are genome structure and function, comparative genomics, molecular evolution, genome-scale quantitative and population genetics, proteomics, epigenomics, and systems biology. The journal also features exciting gene discoveries and reports of cutting-edge computational biology and high-throughput methodologies. New data in these areas are published as research papers, or methods and resource reports that provide novel information on technologies or tools that will be of interest to a broad readership. Complete data sets are presented electronically on the journal''s web site where appropriate. The journal also provides Reviews, Perspectives, and Insight/Outlook articles, which present commentary on the latest advances published both here and elsewhere, placing such progress in its broader biological context.