deep-Sep: a deep learning-based method for fast and accurate prediction of selenoprotein genes in bacteria.

IF 4.6 2区生物学 Q1 MICROBIOLOGY mSystems Pub Date : 2025-04-22 Epub Date: 2025-03-10 DOI:10.1128/msystems.01258-24

Yao Xiao, Yan Zhang

{"title":"deep-Sep: a deep learning-based method for fast and accurate prediction of selenoprotein genes in bacteria.","authors":"Yao Xiao, Yan Zhang","doi":"10.1128/msystems.01258-24","DOIUrl":null,"url":null,"abstract":"Selenoproteins are a special group of proteins with major roles in cellular antioxidant defense. They contain the 21st amino acid selenocysteine (Sec) in the active sites, which is encoded by an in-frame UGA codon. Compared to eukaryotes, identification of selenoprotein genes in bacteria remains challenging due to the absence of an effective strategy for distinguishing the Sec-encoding UGA codon from a normal stop signal. In this study, we have developed a deep learning-based algorithm, deep-Sep, for quickly and precisely identifying selenoprotein genes in bacterial genomic sequences. This algorithm uses a Transformer-based neural network architecture to construct an optimal model for detecting Sec-encoding UGA codons and a homology search-based strategy to remove additional false positives. During the training and testing stages, deep-Sep has demonstrated commendable performance, including an F1 score of 0.939 and an area under the receiver operating characteristic curve of 0.987. Furthermore, when applied to 20 bacterial genomes as independent test data sets, deep-Sep exhibited remarkable capability in identifying both known and new selenoprotein genes, which significantly outperforms the existing state-of-the-art method. Our algorithm has proved to be a powerful tool for comprehensively characterizing selenoprotein genes in bacterial genomes, which should not only assist in accurate annotation of selenoprotein genes in genome sequencing projects but also provide new insights for a deeper understanding of the roles of selenium in bacteria.IMPORTANCESelenium is an essential micronutrient present in selenoproteins in the form of Sec, which is a rare amino acid encoded by the opal stop codon UGA. Identification of all selenoproteins is of vital importance for investigating the functions of selenium in nature. Previous strategies for predicting selenoprotein genes mainly relied on the identification of a special cis-acting Sec insertion sequence (SECIS) element within mRNAs. However, due to the complexity and variability of SECIS elements, recognition of all selenoprotein genes in bacteria is still a major challenge in the annotation of bacterial genomes. We have developed a deep learning-based algorithm to predict selenoprotein genes in bacterial genomic sequences, which demonstrates superior performance compared to currently available methods. This algorithm can be utilized in either web-based or local (standalone) modes, serving as a promising tool for identifying the complete set of selenoprotein genes in bacteria.","PeriodicalId":18819,"journal":{"name":"mSystems","volume":" ","pages":"e0125824"},"PeriodicalIF":4.6000,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12013277/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mSystems","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/msystems.01258-24","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/10 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Selenoproteins are a special group of proteins with major roles in cellular antioxidant defense. They contain the 21st amino acid selenocysteine (Sec) in the active sites, which is encoded by an in-frame UGA codon. Compared to eukaryotes, identification of selenoprotein genes in bacteria remains challenging due to the absence of an effective strategy for distinguishing the Sec-encoding UGA codon from a normal stop signal. In this study, we have developed a deep learning-based algorithm, deep-Sep, for quickly and precisely identifying selenoprotein genes in bacterial genomic sequences. This algorithm uses a Transformer-based neural network architecture to construct an optimal model for detecting Sec-encoding UGA codons and a homology search-based strategy to remove additional false positives. During the training and testing stages, deep-Sep has demonstrated commendable performance, including an F₁ score of 0.939 and an area under the receiver operating characteristic curve of 0.987. Furthermore, when applied to 20 bacterial genomes as independent test data sets, deep-Sep exhibited remarkable capability in identifying both known and new selenoprotein genes, which significantly outperforms the existing state-of-the-art method. Our algorithm has proved to be a powerful tool for comprehensively characterizing selenoprotein genes in bacterial genomes, which should not only assist in accurate annotation of selenoprotein genes in genome sequencing projects but also provide new insights for a deeper understanding of the roles of selenium in bacteria.IMPORTANCESelenium is an essential micronutrient present in selenoproteins in the form of Sec, which is a rare amino acid encoded by the opal stop codon UGA. Identification of all selenoproteins is of vital importance for investigating the functions of selenium in nature. Previous strategies for predicting selenoprotein genes mainly relied on the identification of a special cis-acting Sec insertion sequence (SECIS) element within mRNAs. However, due to the complexity and variability of SECIS elements, recognition of all selenoprotein genes in bacteria is still a major challenge in the annotation of bacterial genomes. We have developed a deep learning-based algorithm to predict selenoprotein genes in bacterial genomic sequences, which demonstrates superior performance compared to currently available methods. This algorithm can be utilized in either web-based or local (standalone) modes, serving as a promising tool for identifying the complete set of selenoprotein genes in bacteria.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

deep- sep：一种基于深度学习的快速准确预测细菌中硒蛋白基因的方法。

硒蛋白是一类特殊的蛋白质，在细胞抗氧化防御中起着重要作用。它们在活性位点含有第21个氨基酸硒代半胱氨酸（Sec），由帧内UGA密码子编码。与真核生物相比，细菌中硒蛋白基因的鉴定仍然具有挑战性，因为缺乏一种有效的策略来区分sec编码的UGA密码子和正常的停止信号。在这项研究中，我们开发了一种基于深度学习的算法deep- sep，用于快速准确地识别细菌基因组序列中的硒蛋白基因。该算法使用基于transformer的神经网络架构构建了一个最优模型来检测sec编码的UGA密码子，并使用基于同源性搜索的策略来去除额外的假阳性。在训练和测试阶段，deep-Sep表现出了良好的性能，F1得分为0.939，受者工作特征曲线下面积为0.987。此外，当将deep-Sep应用于20个细菌基因组作为独立的测试数据集时，它在识别已知和新的硒蛋白基因方面都表现出显著的能力，显著优于现有的最先进的方法。该算法已被证明是全面表征细菌基因组中硒蛋白基因的有力工具，不仅有助于基因组测序项目中硒蛋白基因的准确注释，而且为更深入地了解硒在细菌中的作用提供了新的见解。硒是一种必需的微量营养素，以硒蛋白的形式存在，硒蛋白是一种罕见的氨基酸，由蛋白石停止密码子UGA编码。硒蛋白的鉴定对研究自然界中硒的功能具有重要意义。先前预测硒蛋白基因的策略主要依赖于鉴定mrna中特殊的顺式作用SECIS元件。然而，由于SECIS元件的复杂性和可变性，细菌中所有硒蛋白基因的识别仍然是细菌基因组注释的主要挑战。我们开发了一种基于深度学习的算法来预测细菌基因组序列中的硒蛋白基因，与目前可用的方法相比，该算法表现出优越的性能。该算法可用于基于网络或本地（独立）模式，是一种有前途的工具，用于鉴定细菌中硒蛋白基因的完整集合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

mSystems Biochemistry, Genetics and Molecular Biology-Biochemistry

CiteScore

10.50

自引率

3.10%

发文量

308

审稿时长

13 weeks

期刊介绍： mSystems™ will publish preeminent work that stems from applying technologies for high-throughput analyses to achieve insights into the metabolic and regulatory systems at the scale of both the single cell and microbial communities. The scope of mSystems™ encompasses all important biological and biochemical findings drawn from analyses of large data sets, as well as new computational approaches for deriving these insights. mSystems™ will welcome submissions from researchers who focus on the microbiome, genomics, metagenomics, transcriptomics, metabolomics, proteomics, glycomics, bioinformatics, and computational microbiology. mSystems™ will provide streamlined decisions, while carrying on ASM''s tradition of rigorous peer review.