Improved selection of canonical proteins for reference proteomes.

IF 4 Q1 GENETICS & HEREDITY NAR Genomics and Bioinformatics Pub Date : 2024-06-08 eCollection Date: 2024-06-01 DOI:10.1093/nargab/lqae066
Giuseppe Insana, Maria J Martin, William R Pearson
{"title":"Improved selection of canonical proteins for reference proteomes.","authors":"Giuseppe Insana, Maria J Martin, William R Pearson","doi":"10.1093/nargab/lqae066","DOIUrl":null,"url":null,"abstract":"<p><p>The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae066"},"PeriodicalIF":4.0000,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11165316/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqae066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
改进参考蛋白质组的典型蛋白质选择。
UniProt 发布的 "典型 "蛋白质集被广泛用于相似性搜索以及功能和结构注释。对许多研究人员来说,典型序列是蛋白质的唯一研究对象。然而,高等真核生物往往从一个基因中编码多种蛋白质同工型。对于未审查的(UniProtKB/TrEMBL)蛋白质序列,以基因为中心的组中最长的序列被选为标准序列。这种选择可能会造成不一致,选择出长度相差很大但>95%相同的直向同源物,而这在生物学上是不可能的。我们介绍了 ortho2tree 管道,它可以检查来自同源蛋白质组的参考蛋白质组同源序列和异构体序列,建立多重比对,构建间距树,并识别长度相似的低成本异构体支系。在研究了 UniProtKB 第 2022_05 版中来自 8 种哺乳动物的 140,000 个蛋白质后,ortho2tree 为第 2023_01 版提出了 7804 个同源变化,同时确认了 53,434 个同源变化。正交2tree选择的同工酶的间隙分布与细菌和酵母排列中的间隙分布相似,生物体不受同工酶选择的影响,这表明正交2tree同工酶更准确地反映了真正的生物变异。82%的正交树拟议变异与MANE一致;92%的确认同义词与MANE一致。Ortho2tree 可以改进相同度大于 60% 的直向同源序列(包括脊椎动物和高等植物)的典型分配。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
8.00
自引率
2.20%
发文量
95
审稿时长
15 weeks
期刊最新文献
stana: an R package for metagenotyping analysis and interactive application based on clinical data. Long-read structural and epigenetic profiling of a kidney tumor-matched sample with nanopore sequencing and optical genome mapping. ProPr54 web server: predicting σ54 promoters and regulon with a hybrid convolutional and recurrent deep neural network. PSAURON: a tool for assessing protein annotation across a broad range of species. Specifying cellular context of transcription factor regulons for exploring context-specific gene regulation programs.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1