How to cluster protein sequences: tools, tips and commands

Georgios A. Pavlopoulos
{"title":"How to cluster protein sequences: tools, tips and commands","authors":"Georgios A. Pavlopoulos","doi":"10.15406/MOJPB.2017.05.00174","DOIUrl":null,"url":null,"abstract":"The protein landscape changes continuously as new and hypothetical proteins appear every day. IMG1 today hosts 55,482 Bacterial genomes, 1,580 Archaeal, 258 Eukaryotic, 1,222 Plasmids, 7,521 Viruses, 1,196 genome fragments and 14,265 private and public met genomes and meta transcriptomes. With a very approximate estimation, this corresponds to ~70Million non-redundant proteins at 100% similarity for the isolate side and ~3billion non-redundant proteins for the met genome/metatranscriptome side (coming from scaffolds of length ~500). Release 15-Feb-2017 of UniProtKB/ TrEMBL2 contains 77,483,538 sequence entries. This number corresponds to 1,465,039 (2%) Archaeal proteins, 49,717,238 (64%) Bacterial proteins, 22,299,253 (29%) Eukaryotic proteins, 2,918,867 (4%) Viral proteins and 1,083,141 (<1%) others. Moreover, Uniparc3 contains 148,791,725 protein entries. The UniProt Archive (UniParc) is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Protein families can be characterized by molecules which share significant sequence similarity.4 Notably, this biological problem is very difficult to solve and most available clustering techniques fail in the case of eukaryotic proteins, which contain large numbers of protein domains.5 Nevertheless, ongoing efforts in detecting the best and more accurate protein clustering are still a very active research field. PFAM6 version 31.0 for example, a database of a large collection of protein families, organizes proteins in families by similar domains and includes 16,712 entries. Several tools today, follow various methodologies and strategies to perform protein clustering.7 Outstanding tools such as the CD-HID,8 UCLUST,9 kClust10 and the newly developed MMSEQ/ LinClust11 follow a k-mer and dynamic programming-based sequence alignment approach whereas tools such as the MCL12 clustering algorithm and others a network topology based clustering.13–18 In the second case, prior to clustering, a pairwise similarity matrix is required. While such similarities can be calculated in various ways, BLAST+19 and LAST20 are the most widely used. In this article, in order to encourage users getting familiar with several tools and avoid troubleshooting, simple command lines to perform such analyses are provided.","PeriodicalId":18585,"journal":{"name":"MOJ proteomics & bioinformatics","volume":"124 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MOJ proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15406/MOJPB.2017.05.00174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The protein landscape changes continuously as new and hypothetical proteins appear every day. IMG1 today hosts 55,482 Bacterial genomes, 1,580 Archaeal, 258 Eukaryotic, 1,222 Plasmids, 7,521 Viruses, 1,196 genome fragments and 14,265 private and public met genomes and meta transcriptomes. With a very approximate estimation, this corresponds to ~70Million non-redundant proteins at 100% similarity for the isolate side and ~3billion non-redundant proteins for the met genome/metatranscriptome side (coming from scaffolds of length ~500). Release 15-Feb-2017 of UniProtKB/ TrEMBL2 contains 77,483,538 sequence entries. This number corresponds to 1,465,039 (2%) Archaeal proteins, 49,717,238 (64%) Bacterial proteins, 22,299,253 (29%) Eukaryotic proteins, 2,918,867 (4%) Viral proteins and 1,083,141 (<1%) others. Moreover, Uniparc3 contains 148,791,725 protein entries. The UniProt Archive (UniParc) is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Protein families can be characterized by molecules which share significant sequence similarity.4 Notably, this biological problem is very difficult to solve and most available clustering techniques fail in the case of eukaryotic proteins, which contain large numbers of protein domains.5 Nevertheless, ongoing efforts in detecting the best and more accurate protein clustering are still a very active research field. PFAM6 version 31.0 for example, a database of a large collection of protein families, organizes proteins in families by similar domains and includes 16,712 entries. Several tools today, follow various methodologies and strategies to perform protein clustering.7 Outstanding tools such as the CD-HID,8 UCLUST,9 kClust10 and the newly developed MMSEQ/ LinClust11 follow a k-mer and dynamic programming-based sequence alignment approach whereas tools such as the MCL12 clustering algorithm and others a network topology based clustering.13–18 In the second case, prior to clustering, a pairwise similarity matrix is required. While such similarities can be calculated in various ways, BLAST+19 and LAST20 are the most widely used. In this article, in order to encourage users getting familiar with several tools and avoid troubleshooting, simple command lines to perform such analyses are provided.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
如何聚类蛋白质序列:工具,技巧和命令
随着每天都有新的和假想的蛋白质出现,蛋白质的格局不断变化。IMG1目前拥有55,482个细菌基因组,1,580个古细菌基因组,258个真核生物基因组,1,222个质粒,7,521个病毒,1,196个基因组片段和14,265个私人和公共基因组和元转录组。根据非常近似的估计,这对应于分离侧约7000万个100%相似的非冗余蛋白和met基因组/超转录组侧约30亿个非冗余蛋白(来自长度约500的支架)。UniProtKB/ TrEMBL2于2017年2月15日发布,包含77,483,538个序列条目。这一数字对应于古细菌蛋白1,465,039(2%),细菌蛋白49,717,238(64%),真核蛋白222,299,253(29%),病毒蛋白2,918,867(4%)和1,083,141(<1%)其他。此外,Uniparc3含有148,791,725个蛋白质条目。UniProt Archive (UniParc)是一个全面的非冗余数据库,包含了世界上大多数公开可用的蛋白质序列。蛋白质家族可以通过具有显著序列相似性的分子来表征值得注意的是,这个生物学问题很难解决,而且大多数现有的聚类技术在含有大量蛋白质结构域的真核蛋白的情况下都失败了然而,检测最佳和更准确的蛋白质聚类仍然是一个非常活跃的研究领域。例如,PFAM6版本31.0是一个大量蛋白质家族的数据库,它按照相似的域对家族中的蛋白质进行组织,包含16,712个条目。今天有几种工具遵循不同的方法和策略来执行蛋白质聚类优秀的工具,如CD-HID,8 UCLUST,9 kClust10和新开发的MMSEQ/ linclusst11遵循k-mer和基于动态规划的序列排列方法,而工具,如MCL12聚类算法和其他基于网络拓扑的聚类。13-18在第二种情况下,在聚类之前,需要一个成对的相似性矩阵。虽然这种相似性可以通过各种方式计算,但BLAST+19和LAST20是使用最广泛的。在本文中,为了鼓励用户熟悉一些工具并避免故障排除,提供了执行此类分析的简单命令行。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Conception–why there is no rejection? – the lesson learned will revolutionize organ transplantation How common is homosexuality and what is causing it? Resurrection–myth or reality? Why should our science accept the fact that we have a quantum computer in our subconscious Core Pseudomonas genome from 10 pseudomonas species 
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1