{"title":"基于基因本体的预训练嵌入蛋白功能标注","authors":"Thi Thuy Duong Vu, Jaehee Jung","doi":"10.1109/BIBM55620.2022.9995108","DOIUrl":null,"url":null,"abstract":"The Gene Ontology (GO) database contains approximately 40,000 classes of terms arranged in a hierarchical relationship. These terms mainly define protein functions and are used in bioinformatics to automatically predict protein functions using their sequences. Recently, several models have been studied, such as ProtBert and ProteinBERT, which predict protein functions by fine-tuning a pretrained model of the nucleotide sequence using a self-supervised deep method. We proposed two models to predict GO using protein features extracted by the ProtBert model to annotate proteins with their GO terms. Additionally, we customized the ProteinBERT model and fine-tuned it to predict GO terms. The experiment showed that protein embeddings created using pretrained transformer models can be used as a source of data for tasks involving sequence prediction, with a focus on protein functions. The suggested models allow flexible sequence lengths and provide improved performance compared to other comparison methods.","PeriodicalId":210337,"journal":{"name":"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gene Ontology based protein functional annotation using pretrained embeddings\",\"authors\":\"Thi Thuy Duong Vu, Jaehee Jung\",\"doi\":\"10.1109/BIBM55620.2022.9995108\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Gene Ontology (GO) database contains approximately 40,000 classes of terms arranged in a hierarchical relationship. These terms mainly define protein functions and are used in bioinformatics to automatically predict protein functions using their sequences. Recently, several models have been studied, such as ProtBert and ProteinBERT, which predict protein functions by fine-tuning a pretrained model of the nucleotide sequence using a self-supervised deep method. We proposed two models to predict GO using protein features extracted by the ProtBert model to annotate proteins with their GO terms. Additionally, we customized the ProteinBERT model and fine-tuned it to predict GO terms. The experiment showed that protein embeddings created using pretrained transformer models can be used as a source of data for tasks involving sequence prediction, with a focus on protein functions. The suggested models allow flexible sequence lengths and provide improved performance compared to other comparison methods.\",\"PeriodicalId\":210337,\"journal\":{\"name\":\"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBM55620.2022.9995108\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM55620.2022.9995108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Gene Ontology based protein functional annotation using pretrained embeddings
The Gene Ontology (GO) database contains approximately 40,000 classes of terms arranged in a hierarchical relationship. These terms mainly define protein functions and are used in bioinformatics to automatically predict protein functions using their sequences. Recently, several models have been studied, such as ProtBert and ProteinBERT, which predict protein functions by fine-tuning a pretrained model of the nucleotide sequence using a self-supervised deep method. We proposed two models to predict GO using protein features extracted by the ProtBert model to annotate proteins with their GO terms. Additionally, we customized the ProteinBERT model and fine-tuned it to predict GO terms. The experiment showed that protein embeddings created using pretrained transformer models can be used as a source of data for tasks involving sequence prediction, with a focus on protein functions. The suggested models allow flexible sequence lengths and provide improved performance compared to other comparison methods.