PROVAL: A framework for comparison of protein sequence embeddings

Journal of Computational Mathematics and Data Science Pub Date : 2022-06-01 DOI:10.1016/j.jcmds.2022.100044

Philipp Väth , Maximilian Münch , Christoph Raab , F.-M. Schleif

{"title":"PROVAL: A framework for comparison of protein sequence embeddings","authors":"Philipp Väth , Maximilian Münch , Christoph Raab , F.-M. Schleif","doi":"10.1016/j.jcmds.2022.100044","DOIUrl":null,"url":null,"abstract":"<div><p>High throughput sequencing technology leads to a significant increase in the number of generated protein sequences and the anchor database UniProt doubles approximately every two years. This large set of annotated data is used by many bioinformatics algorithms. Searching within these databases, typically without using any annotations, is challenging due to the variable lengths of the entries and the used non-standard comparison measures. A promising strategy to address these issues is to find fixed-length, information-preserving representations of the variable length protein sequences. A systematic algorithmic evaluation of the proposals is however surprisingly missing. In this work, we analyze how different algorithms perform in generating general protein sequence representations and provide a thorough evaluation framework PROVAL. The strategies range from a proximity representation using classical Smith–Waterman algorithm to state-of-the-art embedding techniques by means of transformer networks. The methods are evaluated by, e.g., the molecular function classification, embedding space visualization, computational complexity and the carbon footprint.</p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"3 ","pages":"Article 100044"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772415822000128/pdfft?md5=b870f0fa5ea53661bdacc49b6a2e71b8&pid=1-s2.0-S2772415822000128-main.pdf","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Mathematics and Data Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772415822000128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

High throughput sequencing technology leads to a significant increase in the number of generated protein sequences and the anchor database UniProt doubles approximately every two years. This large set of annotated data is used by many bioinformatics algorithms. Searching within these databases, typically without using any annotations, is challenging due to the variable lengths of the entries and the used non-standard comparison measures. A promising strategy to address these issues is to find fixed-length, information-preserving representations of the variable length protein sequences. A systematic algorithmic evaluation of the proposals is however surprisingly missing. In this work, we analyze how different algorithms perform in generating general protein sequence representations and provide a thorough evaluation framework PROVAL. The strategies range from a proximity representation using classical Smith–Waterman algorithm to state-of-the-art embedding techniques by means of transformer networks. The methods are evaluated by, e.g., the molecular function classification, embedding space visualization, computational complexity and the carbon footprint.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PROVAL：一个比较蛋白质序列嵌入的框架

高通量测序技术导致生成的蛋白质序列数量显著增加，锚定数据库UniProt大约每两年翻一番。许多生物信息学算法都使用这一大组注释数据。在这些数据库中搜索，通常不使用任何注释，由于条目的长度可变和使用的非标准比较度量，具有挑战性。解决这些问题的一个有前途的策略是找到可变长度蛋白质序列的固定长度、信息保存的表示。然而，令人惊讶的是，对提案缺乏系统的算法评估。在这项工作中，我们分析了不同的算法在生成通用蛋白质序列表示方面的表现，并提供了一个全面的评估框架PROVAL。策略范围从使用经典Smith–Waterman算法的邻近表示到通过变压器网络的最先进嵌入技术。这些方法通过分子函数分类、嵌入空间可视化、计算复杂性和碳足迹等进行评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Computational Mathematics and Data Science

CiteScore

3.00

自引率

0.00%

发文量