HMMeta

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Pub Date : 2020-09-21 DOI:10.1145/3388440.3414702

Sola Gbenro, Kyle Hippe, Renzhi Cao

{"title":"HMMeta","authors":"Sola Gbenro, Kyle Hippe, Renzhi Cao","doi":"10.1145/3388440.3414702","DOIUrl":null,"url":null,"abstract":"As the body of genomic product data increases at a much faster rate than can be annotated, computational analysis of protein function has never been more important. In this research, we introduce a novel protein function prediction method HMMeta, which is based on the prominent natural language prediction technique Hidden Markov Models (HMM). With a new representation of protein sequence as a language, we trained a unique HMM for each Gene Ontology (GO) term taken from the UniProt database, which in total has 27,451 unique GO IDs leading to the creation of 27,451 Hidden Markov Models. We employed data augmentation to artificially inflate the number of protein sequences associated with GO terms that have a limited amount in the database, and this helped to balance the number of protein sequences associated with each GO term. Predictions are made by running the sequence against each model created. The models within eighty percent of the top scoring model, or 75 models with the highest scores, whichever is less, represent the functions that are most associated with the given sequence. We benchmarked our method in the latest Critical Assessment of protein Function Annotation (CAFA 4) experiment as CaoLab2, and we also evaluated HMMeta against several other protein function prediction methods against a subset of the UniProt database. HMMeta achieved favorable results as a sequence-based method, and outperforms a few notable methods in some categories through our evaluation, which shows great potential for automated protein function prediction. The tool is available at https://github.com/KPHippe/HMM-For-Protein-Prediction.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414702","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

As the body of genomic product data increases at a much faster rate than can be annotated, computational analysis of protein function has never been more important. In this research, we introduce a novel protein function prediction method HMMeta, which is based on the prominent natural language prediction technique Hidden Markov Models (HMM). With a new representation of protein sequence as a language, we trained a unique HMM for each Gene Ontology (GO) term taken from the UniProt database, which in total has 27,451 unique GO IDs leading to the creation of 27,451 Hidden Markov Models. We employed data augmentation to artificially inflate the number of protein sequences associated with GO terms that have a limited amount in the database, and this helped to balance the number of protein sequences associated with each GO term. Predictions are made by running the sequence against each model created. The models within eighty percent of the top scoring model, or 75 models with the highest scores, whichever is less, represent the functions that are most associated with the given sequence. We benchmarked our method in the latest Critical Assessment of protein Function Annotation (CAFA 4) experiment as CaoLab2, and we also evaluated HMMeta against several other protein function prediction methods against a subset of the UniProt database. HMMeta achieved favorable results as a sequence-based method, and outperforms a few notable methods in some categories through our evaluation, which shows great potential for automated protein function prediction. The tool is available at https://github.com/KPHippe/HMM-For-Protein-Prediction.

查看原文