用分层方法对蛋白质功能家族进行大规模聚类。

IF 4.5 3区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Protein Science Pub Date : 2024-09-01 DOI:10.1002/pro.5140
Nicola Bordin, Harry Scholes, Clemens Rauer, Joel Roca-Martínez, Ian Sillitoe, Christine Orengo
{"title":"用分层方法对蛋白质功能家族进行大规模聚类。","authors":"Nicola Bordin, Harry Scholes, Clemens Rauer, Joel Roca-Martínez, Ian Sillitoe, Christine Orengo","doi":"10.1002/pro.5140","DOIUrl":null,"url":null,"abstract":"<p><p>Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.</p>","PeriodicalId":20761,"journal":{"name":"Protein Science","volume":null,"pages":null},"PeriodicalIF":4.5000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11325189/pdf/","citationCount":"0","resultStr":"{\"title\":\"Clustering protein functional families at large scale with hierarchical approaches.\",\"authors\":\"Nicola Bordin, Harry Scholes, Clemens Rauer, Joel Roca-Martínez, Ian Sillitoe, Christine Orengo\",\"doi\":\"10.1002/pro.5140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.</p>\",\"PeriodicalId\":20761,\"journal\":{\"name\":\"Protein Science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11325189/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Protein Science\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1002/pro.5140\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Protein Science","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pro.5140","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

蛋白质是细胞活动的基础,通过其结构和序列揭示其功能和进化。CATH 功能家族(FunFams)是蛋白质结构域序列的连贯群集,其成员之间的功能是一致的。随着 MGnify 或 AlphaFold 数据库等大型资源库带来的蛋白质数据量和复杂性的不断增加,需要更强大的方法来扩展这些新资源的规模。在这项工作中,我们介绍了 MARC 和 FRAN,这两种算法是在 GeMMA/FunFHMMER 的基础上开发的,并解决了 GeMMA/FunFHMMER 的局限性。我们还介绍了 CATH-eMMA,它使用嵌入或 Foldseek 距离从距离矩阵中形成关系树,从而降低了计算要求并有效处理各种数据类型。CATH-eMMA 为蛋白质功能的大规模聚类提供了一种高度稳健且速度更快的工具,为未来蛋白质功能和进化研究提供了一种新工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Clustering protein functional families at large scale with hierarchical approaches.

Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Protein Science
Protein Science 生物-生化与分子生物学
CiteScore
12.40
自引率
1.20%
发文量
246
审稿时长
1 months
期刊介绍: Protein Science, the flagship journal of The Protein Society, is a publication that focuses on advancing fundamental knowledge in the field of protein molecules. The journal welcomes original reports and review articles that contribute to our understanding of protein function, structure, folding, design, and evolution. Additionally, Protein Science encourages papers that explore the applications of protein science in various areas such as therapeutics, protein-based biomaterials, bionanotechnology, synthetic biology, and bioelectronics. The journal accepts manuscript submissions in any suitable format for review, with the requirement of converting the manuscript to journal-style format only upon acceptance for publication. Protein Science is indexed and abstracted in numerous databases, including the Agricultural & Environmental Science Database (ProQuest), Biological Science Database (ProQuest), CAS: Chemical Abstracts Service (ACS), Embase (Elsevier), Health & Medical Collection (ProQuest), Health Research Premium Collection (ProQuest), Materials Science & Engineering Database (ProQuest), MEDLINE/PubMed (NLM), Natural Science Collection (ProQuest), and SciTech Premium Collection (ProQuest).
期刊最新文献
Biophysical characterization of RelA-p52 NF-κB dimer-A link between the canonical and the non-canonical NF-κB pathway. Correction to "Award Winners and Abstracts of the 30th Anniversary Symposium of the Protein Society, Baltimore, MD, July 16-19, 2016". On the humanization of VHHs: Prospective case studies, experimental and computational characterization of structural determinants for functionality. A coarse-grained model for disordered and multi-domain proteins. dUTP pyrophosphatases from hyperthermophilic eubacterium and archaeon: Structural and functional examinations on the suitability for PCR application.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1