Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Bioinformatics advances Pub Date : 2024-08-12 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae117
Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal
{"title":"Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.","authors":"Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal","doi":"10.1093/bioadv/vbae117","DOIUrl":null,"url":null,"abstract":"<p><strong>Summary: </strong>This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.</p><p><strong>Availability and implementation: </strong>The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11341122/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Summary: This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.

Availability and implementation: The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用具有字节级精度的编码器-解码器基础模型理解 DNA 的自然语言。
摘要:本文介绍了组合核苷酸字节级编码器-解码器(ENBED)基础模型,利用编码器-解码器变换器架构分析字节级精度的 DNA 序列。ENBED利用注意力的亚二次方实现,开发出一种能够进行序列到序列转换的高效模型,从而推广了以往仅使用编码器或仅使用解码器架构的基因组模型。我们使用掩码语言建模技术(Masked Language Modeling),利用参考基因组序列对基础模型进行预训练,并将其应用于以下下游任务:(i) 识别增强子、启动子和剪接位点;(ii) 识别包含碱基调用错配和插入/删除错误的序列,这比涉及多个碱基对的标记化方案更有优势,因为后者失去了以字节级精度进行分析的能力;(iii) 识别基因组序列的生物功能注释;(iv) 使用编码器-解码器架构生成流感病毒的突变,并根据真实世界的观察结果对其进行验证。与现有的最先进成果相比,我们在上述每项任务中都取得了显著进步:用于开发和微调基础模型的源代码已在 Github 上发布(https://github.itap.purdue.edu/Clan-labs/ENBED)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
期刊最新文献
motifbreakR v2: expanded variant analysis including indels and integrated evidence from transcription factor binding databases. TransAnnot-a fast transcriptome annotation pipeline. PatchProt: hydrophobic patch prediction using protein foundation models. Accelerating protein-protein interaction screens with reduced AlphaFold-Multimer sampling. CAPTVRED: an automated pipeline for viral tracking and discovery from capture-based metagenomics samples.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1