利用具有字节级精度的编码器-解码器基础模型理解 DNA 的自然语言。

ArXiv Pub Date : 2024-08-22
Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal
{"title":"利用具有字节级精度的编码器-解码器基础模型理解 DNA 的自然语言。","authors":"Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10896356/pdf/","citationCount":"0","resultStr":"{\"title\":\"Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision.\",\"authors\":\"Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.</p>\",\"PeriodicalId\":93888,\"journal\":{\"name\":\"ArXiv\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10896356/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文介绍了组合核苷酸字节级编码器-解码器(ENBED)基础模型,利用编码器-解码器变换器架构分析字节级精度的 DNA 序列。ENBED 使用注意力的亚二次方实现,开发出一种能够进行序列到序列转换的高效模型,从而推广了之前仅使用编码器或仅使用解码器架构的基因组模型。我们使用掩码语言建模(Masked Language Modeling)技术,利用参考基因组序列对基础模型进行预训练,并将其应用于以下下游任务:(1)识别增强子、启动子和剪接位点;(2)识别包含碱基调用错配和插入/删除错误的序列,这比涉及多个碱基对的标记化方案更有优势,因为后者失去了以字节级精度进行分析的能力;(3)识别基因组序列的生物功能注释;(4)使用编码器-解码器架构生成流感病毒的突变,并根据现实世界的观察结果对其进行验证。在上述每项任务中,我们都展示了与现有先进成果相比的显著改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision.

This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Categorization of 33 computational methods to detect spatially variable genes from spatially resolved transcriptomics data. A Geometric Tension Dynamics Model of Epithelial Convergent Extension. Learning Molecular Representation in a Cell. Ankle Exoskeletons May Hinder Standing Balance in Simple Models of Older and Younger Adults. Nonparametric causal inference for optogenetics: sequential excursion effects for dynamic regimes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1