deep-Sep: a deep learning-based method for fast and accurate prediction of selenoprotein genes in bacteria.

IF 4.6 2区 生物学 Q1 MICROBIOLOGY mSystems Pub Date : 2025-04-22 Epub Date: 2025-03-10 DOI:10.1128/msystems.01258-24
Yao Xiao, Yan Zhang
{"title":"deep-Sep: a deep learning-based method for fast and accurate prediction of selenoprotein genes in bacteria.","authors":"Yao Xiao, Yan Zhang","doi":"10.1128/msystems.01258-24","DOIUrl":null,"url":null,"abstract":"<p><p>Selenoproteins are a special group of proteins with major roles in cellular antioxidant defense. They contain the 21st amino acid selenocysteine (Sec) in the active sites, which is encoded by an in-frame UGA codon. Compared to eukaryotes, identification of selenoprotein genes in bacteria remains challenging due to the absence of an effective strategy for distinguishing the Sec-encoding UGA codon from a normal stop signal. In this study, we have developed a deep learning-based algorithm, deep-Sep, for quickly and precisely identifying selenoprotein genes in bacterial genomic sequences. This algorithm uses a Transformer-based neural network architecture to construct an optimal model for detecting Sec-encoding UGA codons and a homology search-based strategy to remove additional false positives. During the training and testing stages, deep-Sep has demonstrated commendable performance, including an <i>F</i><sub>1</sub> score of 0.939 and an area under the receiver operating characteristic curve of 0.987. Furthermore, when applied to 20 bacterial genomes as independent test data sets, deep-Sep exhibited remarkable capability in identifying both known and new selenoprotein genes, which significantly outperforms the existing state-of-the-art method. Our algorithm has proved to be a powerful tool for comprehensively characterizing selenoprotein genes in bacterial genomes, which should not only assist in accurate annotation of selenoprotein genes in genome sequencing projects but also provide new insights for a deeper understanding of the roles of selenium in bacteria.IMPORTANCESelenium is an essential micronutrient present in selenoproteins in the form of Sec, which is a rare amino acid encoded by the opal stop codon UGA. Identification of all selenoproteins is of vital importance for investigating the functions of selenium in nature. Previous strategies for predicting selenoprotein genes mainly relied on the identification of a special <i>cis</i>-acting Sec insertion sequence (SECIS) element within mRNAs. However, due to the complexity and variability of SECIS elements, recognition of all selenoprotein genes in bacteria is still a major challenge in the annotation of bacterial genomes. We have developed a deep learning-based algorithm to predict selenoprotein genes in bacterial genomic sequences, which demonstrates superior performance compared to currently available methods. This algorithm can be utilized in either web-based or local (standalone) modes, serving as a promising tool for identifying the complete set of selenoprotein genes in bacteria.</p>","PeriodicalId":18819,"journal":{"name":"mSystems","volume":" ","pages":"e0125824"},"PeriodicalIF":4.6000,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12013277/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mSystems","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/msystems.01258-24","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/10 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Selenoproteins are a special group of proteins with major roles in cellular antioxidant defense. They contain the 21st amino acid selenocysteine (Sec) in the active sites, which is encoded by an in-frame UGA codon. Compared to eukaryotes, identification of selenoprotein genes in bacteria remains challenging due to the absence of an effective strategy for distinguishing the Sec-encoding UGA codon from a normal stop signal. In this study, we have developed a deep learning-based algorithm, deep-Sep, for quickly and precisely identifying selenoprotein genes in bacterial genomic sequences. This algorithm uses a Transformer-based neural network architecture to construct an optimal model for detecting Sec-encoding UGA codons and a homology search-based strategy to remove additional false positives. During the training and testing stages, deep-Sep has demonstrated commendable performance, including an F1 score of 0.939 and an area under the receiver operating characteristic curve of 0.987. Furthermore, when applied to 20 bacterial genomes as independent test data sets, deep-Sep exhibited remarkable capability in identifying both known and new selenoprotein genes, which significantly outperforms the existing state-of-the-art method. Our algorithm has proved to be a powerful tool for comprehensively characterizing selenoprotein genes in bacterial genomes, which should not only assist in accurate annotation of selenoprotein genes in genome sequencing projects but also provide new insights for a deeper understanding of the roles of selenium in bacteria.IMPORTANCESelenium is an essential micronutrient present in selenoproteins in the form of Sec, which is a rare amino acid encoded by the opal stop codon UGA. Identification of all selenoproteins is of vital importance for investigating the functions of selenium in nature. Previous strategies for predicting selenoprotein genes mainly relied on the identification of a special cis-acting Sec insertion sequence (SECIS) element within mRNAs. However, due to the complexity and variability of SECIS elements, recognition of all selenoprotein genes in bacteria is still a major challenge in the annotation of bacterial genomes. We have developed a deep learning-based algorithm to predict selenoprotein genes in bacterial genomic sequences, which demonstrates superior performance compared to currently available methods. This algorithm can be utilized in either web-based or local (standalone) modes, serving as a promising tool for identifying the complete set of selenoprotein genes in bacteria.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
deep- sep:一种基于深度学习的快速准确预测细菌中硒蛋白基因的方法。
硒蛋白是一类特殊的蛋白质,在细胞抗氧化防御中起着重要作用。它们在活性位点含有第21个氨基酸硒代半胱氨酸(Sec),由帧内UGA密码子编码。与真核生物相比,细菌中硒蛋白基因的鉴定仍然具有挑战性,因为缺乏一种有效的策略来区分sec编码的UGA密码子和正常的停止信号。在这项研究中,我们开发了一种基于深度学习的算法deep- sep,用于快速准确地识别细菌基因组序列中的硒蛋白基因。该算法使用基于transformer的神经网络架构构建了一个最优模型来检测sec编码的UGA密码子,并使用基于同源性搜索的策略来去除额外的假阳性。在训练和测试阶段,deep-Sep表现出了良好的性能,F1得分为0.939,受者工作特征曲线下面积为0.987。此外,当将deep-Sep应用于20个细菌基因组作为独立的测试数据集时,它在识别已知和新的硒蛋白基因方面都表现出显著的能力,显著优于现有的最先进的方法。该算法已被证明是全面表征细菌基因组中硒蛋白基因的有力工具,不仅有助于基因组测序项目中硒蛋白基因的准确注释,而且为更深入地了解硒在细菌中的作用提供了新的见解。硒是一种必需的微量营养素,以硒蛋白的形式存在,硒蛋白是一种罕见的氨基酸,由蛋白石停止密码子UGA编码。硒蛋白的鉴定对研究自然界中硒的功能具有重要意义。先前预测硒蛋白基因的策略主要依赖于鉴定mrna中特殊的顺式作用SECIS元件。然而,由于SECIS元件的复杂性和可变性,细菌中所有硒蛋白基因的识别仍然是细菌基因组注释的主要挑战。我们开发了一种基于深度学习的算法来预测细菌基因组序列中的硒蛋白基因,与目前可用的方法相比,该算法表现出优越的性能。该算法可用于基于网络或本地(独立)模式,是一种有前途的工具,用于鉴定细菌中硒蛋白基因的完整集合。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
mSystems
mSystems Biochemistry, Genetics and Molecular Biology-Biochemistry
CiteScore
10.50
自引率
3.10%
发文量
308
审稿时长
13 weeks
期刊介绍: mSystems™ will publish preeminent work that stems from applying technologies for high-throughput analyses to achieve insights into the metabolic and regulatory systems at the scale of both the single cell and microbial communities. The scope of mSystems™ encompasses all important biological and biochemical findings drawn from analyses of large data sets, as well as new computational approaches for deriving these insights. mSystems™ will welcome submissions from researchers who focus on the microbiome, genomics, metagenomics, transcriptomics, metabolomics, proteomics, glycomics, bioinformatics, and computational microbiology. mSystems™ will provide streamlined decisions, while carrying on ASM''s tradition of rigorous peer review.
期刊最新文献
Gut virome and metabolic associations in patients with acute pancreatitis. Molecules, microbes, and function: synchronizing depth-resolved molecular and microbial time series at BATS. Genetic and metabolic drivers of membrane remodeling in Clostridium thermocellum under alcohol stress. EEG and gut microbiota response patterns in high-altitude indigenous populations. Salt supplementation-induced metabolic reprogramming in Streptomyces coelicolor.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1