{"title":"How to cluster protein sequences: tools, tips and commands","authors":"Georgios A. Pavlopoulos","doi":"10.15406/MOJPB.2017.05.00174","DOIUrl":null,"url":null,"abstract":"The protein landscape changes continuously as new and hypothetical proteins appear every day. IMG1 today hosts 55,482 Bacterial genomes, 1,580 Archaeal, 258 Eukaryotic, 1,222 Plasmids, 7,521 Viruses, 1,196 genome fragments and 14,265 private and public met genomes and meta transcriptomes. With a very approximate estimation, this corresponds to ~70Million non-redundant proteins at 100% similarity for the isolate side and ~3billion non-redundant proteins for the met genome/metatranscriptome side (coming from scaffolds of length ~500). Release 15-Feb-2017 of UniProtKB/ TrEMBL2 contains 77,483,538 sequence entries. This number corresponds to 1,465,039 (2%) Archaeal proteins, 49,717,238 (64%) Bacterial proteins, 22,299,253 (29%) Eukaryotic proteins, 2,918,867 (4%) Viral proteins and 1,083,141 (<1%) others. Moreover, Uniparc3 contains 148,791,725 protein entries. The UniProt Archive (UniParc) is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Protein families can be characterized by molecules which share significant sequence similarity.4 Notably, this biological problem is very difficult to solve and most available clustering techniques fail in the case of eukaryotic proteins, which contain large numbers of protein domains.5 Nevertheless, ongoing efforts in detecting the best and more accurate protein clustering are still a very active research field. PFAM6 version 31.0 for example, a database of a large collection of protein families, organizes proteins in families by similar domains and includes 16,712 entries. Several tools today, follow various methodologies and strategies to perform protein clustering.7 Outstanding tools such as the CD-HID,8 UCLUST,9 kClust10 and the newly developed MMSEQ/ LinClust11 follow a k-mer and dynamic programming-based sequence alignment approach whereas tools such as the MCL12 clustering algorithm and others a network topology based clustering.13–18 In the second case, prior to clustering, a pairwise similarity matrix is required. While such similarities can be calculated in various ways, BLAST+19 and LAST20 are the most widely used. In this article, in order to encourage users getting familiar with several tools and avoid troubleshooting, simple command lines to perform such analyses are provided.","PeriodicalId":18585,"journal":{"name":"MOJ proteomics & bioinformatics","volume":"124 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MOJ proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15406/MOJPB.2017.05.00174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
The protein landscape changes continuously as new and hypothetical proteins appear every day. IMG1 today hosts 55,482 Bacterial genomes, 1,580 Archaeal, 258 Eukaryotic, 1,222 Plasmids, 7,521 Viruses, 1,196 genome fragments and 14,265 private and public met genomes and meta transcriptomes. With a very approximate estimation, this corresponds to ~70Million non-redundant proteins at 100% similarity for the isolate side and ~3billion non-redundant proteins for the met genome/metatranscriptome side (coming from scaffolds of length ~500). Release 15-Feb-2017 of UniProtKB/ TrEMBL2 contains 77,483,538 sequence entries. This number corresponds to 1,465,039 (2%) Archaeal proteins, 49,717,238 (64%) Bacterial proteins, 22,299,253 (29%) Eukaryotic proteins, 2,918,867 (4%) Viral proteins and 1,083,141 (<1%) others. Moreover, Uniparc3 contains 148,791,725 protein entries. The UniProt Archive (UniParc) is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Protein families can be characterized by molecules which share significant sequence similarity.4 Notably, this biological problem is very difficult to solve and most available clustering techniques fail in the case of eukaryotic proteins, which contain large numbers of protein domains.5 Nevertheless, ongoing efforts in detecting the best and more accurate protein clustering are still a very active research field. PFAM6 version 31.0 for example, a database of a large collection of protein families, organizes proteins in families by similar domains and includes 16,712 entries. Several tools today, follow various methodologies and strategies to perform protein clustering.7 Outstanding tools such as the CD-HID,8 UCLUST,9 kClust10 and the newly developed MMSEQ/ LinClust11 follow a k-mer and dynamic programming-based sequence alignment approach whereas tools such as the MCL12 clustering algorithm and others a network topology based clustering.13–18 In the second case, prior to clustering, a pairwise similarity matrix is required. While such similarities can be calculated in various ways, BLAST+19 and LAST20 are the most widely used. In this article, in order to encourage users getting familiar with several tools and avoid troubleshooting, simple command lines to perform such analyses are provided.