BERT:基于BERT的蛋白质结构域边界预测

IF 0.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computing and Informatics Pub Date : 2023-01-01 DOI:10.31577/cai_2023_3_667

Ahmad Haseeb, Maryam Bashir, Aamir Wali

{"title":"BERT:基于BERT的蛋白质结构域边界预测","authors":"Ahmad Haseeb, Maryam Bashir, Aamir Wali","doi":"10.31577/cai_2023_3_667","DOIUrl":null,"url":null,"abstract":". The domains of a protein provide an insight on the functions that the protein can perform. Delineation of proteins using high-throughput experimental methods is difficult and a time-consuming task. Template-free and sequence-based computational methods that mainly rely on machine learning techniques can be used. However, some of the drawbacks of computational methods are low accuracy and their limitation in predicting different types of multi-domain proteins. Biological language modeling and deep learning techniques can be useful in such situations. In this study, we propose BERTDom for segmenting protein sequences. BERTDOM uses BERT for feature representation and stacked bi-directional long short term memory for classification. We pre-train BERT from scratch on a corpus of protein sequences obtained from UniProt knowledge base with reference clusters. For comparison, we also used two other deep learning architectures: LSTM and feed-forward neural networks. We also experimented with protein-to-vector (Pro2Vec) feature representation that uses word2vec to encode protein bio-words. For testing, three other bench-marked datasets were used. The experimental re-sults on benchmarks datasets show that BERTDom produces the best F-score as compared to other template-based and template-free protein domain boundary prediction methods. Employing deep learning architectures can significantly improve domain boundary prediction. Furthermore, BERT used extensively in NLP for feature representation, has shown promising results when used for encoding bio-words. The code is available at https://github.com/maryam988/BERTDom-Code .","PeriodicalId":55215,"journal":{"name":"Computing and Informatics","volume":"42 1","pages":"667-689"},"PeriodicalIF":0.7000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BERTDom: Protein Domain Boundary Prediction Using BERT\",\"authors\":\"Ahmad Haseeb, Maryam Bashir, Aamir Wali\",\"doi\":\"10.31577/cai_2023_3_667\",\"DOIUrl\":null,\"url\":null,\"abstract\":\". The domains of a protein provide an insight on the functions that the protein can perform. Delineation of proteins using high-throughput experimental methods is difficult and a time-consuming task. Template-free and sequence-based computational methods that mainly rely on machine learning techniques can be used. However, some of the drawbacks of computational methods are low accuracy and their limitation in predicting different types of multi-domain proteins. Biological language modeling and deep learning techniques can be useful in such situations. In this study, we propose BERTDom for segmenting protein sequences. BERTDOM uses BERT for feature representation and stacked bi-directional long short term memory for classification. We pre-train BERT from scratch on a corpus of protein sequences obtained from UniProt knowledge base with reference clusters. For comparison, we also used two other deep learning architectures: LSTM and feed-forward neural networks. We also experimented with protein-to-vector (Pro2Vec) feature representation that uses word2vec to encode protein bio-words. For testing, three other bench-marked datasets were used. The experimental re-sults on benchmarks datasets show that BERTDom produces the best F-score as compared to other template-based and template-free protein domain boundary prediction methods. Employing deep learning architectures can significantly improve domain boundary prediction. Furthermore, BERT used extensively in NLP for feature representation, has shown promising results when used for encoding bio-words. The code is available at https://github.com/maryam988/BERTDom-Code .\",\"PeriodicalId\":55215,\"journal\":{\"name\":\"Computing and Informatics\",\"volume\":\"42 1\",\"pages\":\"667-689\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computing and Informatics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.31577/cai_2023_3_667\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computing and Informatics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.31577/cai_2023_3_667","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

．蛋白质的结构域提供了对蛋白质可以执行的功能的洞察。使用高通量实验方法描述蛋白质是一项困难且耗时的任务。可以使用主要依赖于机器学习技术的无模板和基于序列的计算方法。然而，计算方法的一些缺点是精度低，并且在预测不同类型的多结构域蛋白质方面存在局限性。生物语言建模和深度学习技术在这种情况下很有用。在这项研究中，我们提出了BERTDom来分割蛋白质序列。BERTDOM使用BERT进行特征表示，使用堆叠双向长短期记忆进行分类。我们在从UniProt知识库获得的蛋白质序列语料库上使用参考聚类从头开始预训练BERT。为了比较，我们还使用了另外两种深度学习架构:LSTM和前馈神经网络。我们还实验了蛋白质-载体(Pro2Vec)特征表示，使用word2vec编码蛋白质生物词。为了进行测试，使用了三个其他基准数据集。在基准数据集上的实验结果表明，与其他基于模板和无模板的蛋白质结构域边界预测方法相比，BERTDom产生了最好的f值。采用深度学习架构可以显著改善领域边界预测。此外，BERT在NLP中广泛用于特征表示，在用于编码生物词时显示出有希望的结果。代码可在https://github.com/maryam988/BERTDom-Code上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BERTDom: Protein Domain Boundary Prediction Using BERT

. The domains of a protein provide an insight on the functions that the protein can perform. Delineation of proteins using high-throughput experimental methods is difficult and a time-consuming task. Template-free and sequence-based computational methods that mainly rely on machine learning techniques can be used. However, some of the drawbacks of computational methods are low accuracy and their limitation in predicting different types of multi-domain proteins. Biological language modeling and deep learning techniques can be useful in such situations. In this study, we propose BERTDom for segmenting protein sequences. BERTDOM uses BERT for feature representation and stacked bi-directional long short term memory for classification. We pre-train BERT from scratch on a corpus of protein sequences obtained from UniProt knowledge base with reference clusters. For comparison, we also used two other deep learning architectures: LSTM and feed-forward neural networks. We also experimented with protein-to-vector (Pro2Vec) feature representation that uses word2vec to encode protein bio-words. For testing, three other bench-marked datasets were used. The experimental re-sults on benchmarks datasets show that BERTDom produces the best F-score as compared to other template-based and template-free protein domain boundary prediction methods. Employing deep learning architectures can significantly improve domain boundary prediction. Furthermore, BERT used extensively in NLP for feature representation, has shown promising results when used for encoding bio-words. The code is available at https://github.com/maryam988/BERTDom-Code .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computing and Informatics 工程技术-计算机：人工智能

CiteScore

1.60

自引率

14.30%

发文量

审稿时长

9 months

期刊介绍： Main Journal Topics: COMPUTER ARCHITECTURES AND NETWORKING PARALLEL AND DISTRIBUTED COMPUTING THEORETICAL FOUNDATIONS SOFTWARE ENGINEERING KNOWLEDGE AND INFORMATION ENGINEERING Apart from the main topics given above, the Editorial Board welcomes papers from other areas of computing and informatics.