{"title":"利用具有字节级精度的编码器-解码器基础模型理解 DNA 的自然语言。","authors":"Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal","doi":"10.1093/bioadv/vbae117","DOIUrl":null,"url":null,"abstract":"<p><strong>Summary: </strong>This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.</p><p><strong>Availability and implementation: </strong>The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11341122/pdf/","citationCount":"0","resultStr":"{\"title\":\"Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.\",\"authors\":\"Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal\",\"doi\":\"10.1093/bioadv/vbae117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Summary: </strong>This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.</p><p><strong>Availability and implementation: </strong>The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).</p>\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11341122/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbae117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
摘要:本文介绍了组合核苷酸字节级编码器-解码器(ENBED)基础模型,利用编码器-解码器变换器架构分析字节级精度的 DNA 序列。ENBED利用注意力的亚二次方实现,开发出一种能够进行序列到序列转换的高效模型,从而推广了以往仅使用编码器或仅使用解码器架构的基因组模型。我们使用掩码语言建模技术(Masked Language Modeling),利用参考基因组序列对基础模型进行预训练,并将其应用于以下下游任务:(i) 识别增强子、启动子和剪接位点;(ii) 识别包含碱基调用错配和插入/删除错误的序列,这比涉及多个碱基对的标记化方案更有优势,因为后者失去了以字节级精度进行分析的能力;(iii) 识别基因组序列的生物功能注释;(iv) 使用编码器-解码器架构生成流感病毒的突变,并根据真实世界的观察结果对其进行验证。与现有的最先进成果相比,我们在上述每项任务中都取得了显著进步:用于开发和微调基础模型的源代码已在 Github 上发布(https://github.itap.purdue.edu/Clan-labs/ENBED)。
Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.
Summary: This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
Availability and implementation: The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).