FunPredCATH: An ensemble method for predicting protein function using CATH

IF 2.5 4区 生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY Biochimica et biophysica acta. Proteins and proteomics Pub Date : 2023-12-19 DOI:10.1016/j.bbapap.2023.140985
Joseph Bonello , Christine Orengo
{"title":"FunPredCATH: An ensemble method for predicting protein function using CATH","authors":"Joseph Bonello ,&nbsp;Christine Orengo","doi":"10.1016/j.bbapap.2023.140985","DOIUrl":null,"url":null,"abstract":"<div><h3>Motivation</h3><p>The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms.</p></div><div><h3>Methodology</h3><p>We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein.</p></div><div><h3>Methods</h3><p>In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge.</p></div><div><h3>Evaluation</h3><p>We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and F<sub>max</sub> metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks.</p></div><div><h3>Contributions</h3><p>FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher F<sub>max</sub> scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower F<sub>max</sub> score.</p></div>","PeriodicalId":8760,"journal":{"name":"Biochimica et biophysica acta. Proteins and proteomics","volume":null,"pages":null},"PeriodicalIF":2.5000,"publicationDate":"2023-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1570963923000997/pdfft?md5=0f4fa65f8e4df32c9a5a9a3d8e17897f&pid=1-s2.0-S1570963923000997-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biochimica et biophysica acta. Proteins and proteomics","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1570963923000997","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation

The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms.

Methodology

We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein.

Methods

In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge.

Evaluation

We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and Fmax metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks.

Contributions

FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher Fmax scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower Fmax score.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FunPredCATH:利用 CATH 预测蛋白质功能的集合方法
动机由于测序方法越来越高效,UniProt 中未注释蛋白质的数量每年都在以极快的速度增长。然而,蛋白质的实验注释是一个漫长而昂贵的过程。我们提出了一种组合方法,它结合了三种通用基础预测器,可从不同物种的序列中预测基因本体(BP、CC 和 MF)术语。我们在 UniProtGOA 注释数据上训练模型,并使用 CATH 领域资源来识别蛋白质家族。然后,我们根据功能家族中单个 GO 术语的普遍性计算出一个分数,在将 GO 术语分配给未表征蛋白质时,该分数将被用作置信度指标。在集合中,我们使用了一种基于统计的方法,该方法将 CATH FunFam 中出现的 GO 术语与由相同 GO 术语注释的蛋白质背景集进行对比评分。我们还开发了一种基于集合的方法,利用集合相交和集合联合来对同一 CATH FunFam 中出现的 GO 术语进行评分。最后,我们还使用了 FunFams-Plus,这是一种由加州大学洛杉矶分校 Orengo 小组开发的预测方法,用于预测 CAFA3 挑战赛中未表征蛋白质的 GO 术语。我们使用精确度、召回率和 Fmax 指标以及 CAFA3 中使用的基准数据集来评估我们的模型,并将它们与 CAFA3 的结果进行比较。我们的结果表明,FunPredCATH 在不同的本体和基准数据集上与 CAFA 的顶级方法相比都有很好的表现。我们的研究表明,非 IEA 模型比 IEA 对应模型获得了更高的 Fmax 分数,而包含 IEA 注释的模型则以较低的 Fmax 分数为代价获得了更高的覆盖率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
8.00
自引率
0.00%
发文量
55
审稿时长
33 days
期刊介绍: BBA Proteins and Proteomics covers protein structure conformation and dynamics; protein folding; protein-ligand interactions; enzyme mechanisms, models and kinetics; protein physical properties and spectroscopy; and proteomics and bioinformatics analyses of protein structure, protein function, or protein regulation.
期刊最新文献
A distinct co-expressed sulfurtransferase extends the physiological role of mercaptopropionate dioxygenase in Pseudomonas aeruginosa PAO1 CDR identification, epitope mapping and binding affinity determination of novel monoclonal antibodies generated against human apolipoprotein B-100 Deciphering the cleavage sites of 3C-like protease in Gammacoronaviruses and Deltacoronaviruses The role of proton-coupled electron transfer from protein to heme in dehaloperoxidase Incorporation of pyridoxal-5′-phosphate into the apoenzyme: A structural study of D-amino acid transaminase from Haliscomenobacter hydrossis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1