Embed-Search-Align: DNA sequence alignment using Transformer models.

Pavan Holur, K C Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S Bouchard, Matteo Pellegrini, Vwani Roychowdhury
{"title":"Embed-Search-Align: DNA sequence alignment using Transformer models.","authors":"Pavan Holur, K C Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S Bouchard, Matteo Pellegrini, Vwani Roychowdhury","doi":"10.1093/bioinformatics/btaf041","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>DNA sequence alignment, an important genomic task, involves assigning short DNA reads to the most probable locations on an extensive reference genome. Conventional methods tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have encoded DNA sequences into vectors using Transformers and have shown promising results in tasks involving classification of short DNA sequences. Performance at sequence classification tasks does not, however, guarantee sequence alignment, where it is necessary to conduct a genome-wide search to align every read successfully, a significantly longer-range task by comparison.</p><p><strong>Results: </strong>We bridge this gap by developing a \"Embed-Search-Align\" (ESA) framework, where a novel Reference-Free DNA Embedding (RDE) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity. ESA introduces: (i) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (ii) a DNA vector store to enable search across fragments on a global scale. RDE is 99% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), rivaling conventional algorithmic sequence alignment methods such as Bowtie and BWA-Mem. RDE far exceeds the performance of six recent DNA-Transformer model baselines such as Nucleotide Transformer, Hyena-DNA, and shows task transfer across chromosomes and species.</p><p><strong>Availability and implementation: </strong>Please see https://anonymous.4open.science/r/dna2vec-7E4E/readme.md.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11919449/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: DNA sequence alignment, an important genomic task, involves assigning short DNA reads to the most probable locations on an extensive reference genome. Conventional methods tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have encoded DNA sequences into vectors using Transformers and have shown promising results in tasks involving classification of short DNA sequences. Performance at sequence classification tasks does not, however, guarantee sequence alignment, where it is necessary to conduct a genome-wide search to align every read successfully, a significantly longer-range task by comparison.

Results: We bridge this gap by developing a "Embed-Search-Align" (ESA) framework, where a novel Reference-Free DNA Embedding (RDE) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity. ESA introduces: (i) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (ii) a DNA vector store to enable search across fragments on a global scale. RDE is 99% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), rivaling conventional algorithmic sequence alignment methods such as Bowtie and BWA-Mem. RDE far exceeds the performance of six recent DNA-Transformer model baselines such as Nucleotide Transformer, Hyena-DNA, and shows task transfer across chromosomes and species.

Availability and implementation: Please see https://anonymous.4open.science/r/dna2vec-7E4E/readme.md.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
嵌入-搜索-对齐:使用变压器模型的DNA序列对齐。
动机:DNA序列比对是一项重要的基因组任务,涉及到在广泛的参考基因组中分配短DNA读取到最可能的位置。传统的方法通过两个步骤来解决这一挑战:基因组索引,然后进行有效的搜索,以定位给定reads的可能位置。基于大型语言模型(LLM)在将文本编码为嵌入(其中距离度量捕获语义相似性)方面的成功,最近的努力已经使用transformer将DNA序列编码为向量,并在涉及短DNA序列分类的任务中显示出有希望的结果。然而,序列分类任务的性能并不能保证序列比对,其中需要进行全基因组搜索以成功地比对每个读取,相比之下,这是一项明显更长的任务。结果:我们通过开发“嵌入-搜索-对齐”(ESA)框架弥合了这一差距,其中一种新颖的无参考DNA嵌入(RDE) Transformer模型在共享矢量空间中生成读取和参考片段的矢量嵌入;然后使用读片段距离度量作为序列相似性的替代。ESA介绍:(1)DNA序列表示的自我监督训练的对比损失,促进丰富的无参考,序列级嵌入;(2)DNA矢量存储,以便在全球范围内跨片段进行搜索。RDE在将250个长度的reads与人类参考基因组的3千兆酶(单倍体)比对时,准确率达到99%,与传统的算法序列比对方法(如Bowtie和BWA-Mem)相媲美。RDE的表现远远超过了最近的6种DNA-Transformer模型基线,如Nucleotide Transformer、Hyena-DNA,并显示了跨染色体和物种的任务转移。可用性和信息:请参见https://anonymous.4open.science/r/dna2vec-7E4E/readme.md.Supplementary信息:请参见附件。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Tractor Workflow: A Scalable Nextflow Framework for Local Ancestry-Aware Genome-Wide Association Studies. Identification of autosomal and sex chromosome aneuploidies using next generation sequencing. HaDeX2: multi-dimensional analysis of Hydrogen-Deuterium Exchange Mass Spectrometry data. Topological model selection: a case-study in tumour-induced angiogenesis. Finding low-complexity DNA sequences with longdust.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1