msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths.

IF 4.4 1区 生物学 Q1 BIOLOGY BMC Biology Pub Date : 2024-05-30 DOI:10.1186/s12915-024-01923-z
Yazi Li, Xiaoman Wei, Qinglin Yang, An Xiong, Xingfeng Li, Quan Zou, Feifei Cui, Zilong Zhang
{"title":"msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths.","authors":"Yazi Li, Xiaoman Wei, Qinglin Yang, An Xiong, Xingfeng Li, Quan Zou, Feifei Cui, Zilong Zhang","doi":"10.1186/s12915-024-01923-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches.</p><p><strong>Results: </strong>In this study, a two-stage integrated predictor called \"msBERT-Promoter\" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability.</p><p><strong>Conclusions: </strong>msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.</p>","PeriodicalId":9339,"journal":{"name":"BMC Biology","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12915-024-01923-z","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches.

Results: In this study, a two-stage integrated predictor called "msBERT-Promoter" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability.

Conclusions: msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
msBERT-Promoter:基于 BERT 预训练模型的多尺度集合预测器,用于对 DNA 启动子及其强度进行两阶段预测。
背景:启动子是 DNA 中具有转录调控功能的特定序列,在启动基因表达方面发挥作用。识别启动子及其强度可提供与人类疾病相关的宝贵信息。近年来,计算方法作为识别启动子的一种有效手段受到重视,为劳动密集型的生物学方法提供了一种更有效的替代方法:本研究提出了一种名为 "msBERT-Promoter "的两阶段综合预测器,用于识别启动子并预测其强度。该模型通过标记化策略整合了多尺度序列信息,并对 DNABERT 模型进行了微调。然后使用软投票来融合多尺度信息,有效解决了传统模型中 DNA 序列信息提取不足的问题。据我们所知,这是 DNABERT 模型首次采用集成方法进行启动子识别和强度预测。我们的模型对启动子识别的准确率达到 96.2%,对启动子强度预测的准确率达到 79.8%,明显优于现有方法。结论:msBERT-Promoter 提供了一种有效的工具,它能成功捕捉 DNA 启动子的序列相关属性,并能准确识别启动子和预测其强度。这项工作为人工智能在传统生物学中的应用铺平了一条新路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Biology
BMC Biology 生物-生物学
CiteScore
7.80
自引率
1.90%
发文量
260
审稿时长
3 months
期刊介绍: BMC Biology is a broad scope journal covering all areas of biology. Our content includes research articles, new methods and tools. BMC Biology also publishes reviews, Q&A, and commentaries.
期刊最新文献
Novel function of single-target regulator NorR involved in swarming motility and biofilm formation revealed in Vibrio alginolyticus. Hibernation reduces GABA signaling in the brainstem to enhance motor activity of breathing at cool temperatures. A powerful and versatile new fixation protocol for immunostaining and in situ hybridization that preserves delicate tissues. Bridging chemical structure and conceptual knowledge enables accurate prediction of compound-protein interaction. Evolutionary divergent clusters of transcribed extinct truncated retroposons drive low mRNA expression and developmental regulation in the protozoan Leishmania.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1