Nucleotide-resolution DNA foundation models of prokaryotic genomes

IF 31.7 1区生物学 Q1 GENETICS & HEREDITY Nature genetics Pub Date : 2025-01-15 DOI:10.1038/s41588-024-02062-5

Michael Fletcher

{"title":"Nucleotide-resolution DNA foundation models of prokaryotic genomes","authors":"Michael Fletcher","doi":"10.1038/s41588-024-02062-5","DOIUrl":null,"url":null,"abstract":"Machine learning models have recently generated excitement for their potential in a broad range of domain applications, including genomics. However, owing to their complexity, they are prohibitively expensive to train for the large genomic contexts of DNA language models, resulting in limited receptive fields and/or n-mer sequence tokenization. Nguyen et al. present a step forward for the field with Evo, a foundation model that applies the efficient, hybrid StripedHyena architecture trained on 80,000 prokaryotic and millions of phage and plasmid sequences at single-nucleotide resolution. In benchmarking, Evo shows equivalent or improved performance against state-of-the-art nucleotide and language models for variant fitness, promoter activity and protein expression prediction. Impressively, Evo can be used to generate novel, experimentally validated CRISPR–Cas and transposon systems and predict gene essentiality by premature stop codon insertion; it also shows some promise for generating synthetic whole genomes. Foundation DNA language models that are applicable to many tasks would be of broad utility and Evo underlines their promise. However, it must be noted that the training datasets are small compared to the genomes of eukaryotes, and the still-limited 131-kb context and next-token prediction will need to be further adapted for the increased complexity of multicellular life, showing there is still much to do.Original reference: Science 386, eado9336 (2024)","PeriodicalId":18985,"journal":{"name":"Nature genetics","volume":"13 1","pages":""},"PeriodicalIF":31.7000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1038/s41588-024-02062-5","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning models have recently generated excitement for their potential in a broad range of domain applications, including genomics. However, owing to their complexity, they are prohibitively expensive to train for the large genomic contexts of DNA language models, resulting in limited receptive fields and/or n-mer sequence tokenization. Nguyen et al. present a step forward for the field with Evo, a foundation model that applies the efficient, hybrid StripedHyena architecture trained on 80,000 prokaryotic and millions of phage and plasmid sequences at single-nucleotide resolution. In benchmarking, Evo shows equivalent or improved performance against state-of-the-art nucleotide and language models for variant fitness, promoter activity and protein expression prediction. Impressively, Evo can be used to generate novel, experimentally validated CRISPR–Cas and transposon systems and predict gene essentiality by premature stop codon insertion; it also shows some promise for generating synthetic whole genomes. Foundation DNA language models that are applicable to many tasks would be of broad utility and Evo underlines their promise. However, it must be noted that the training datasets are small compared to the genomes of eukaryotes, and the still-limited 131-kb context and next-token prediction will need to be further adapted for the increased complexity of multicellular life, showing there is still much to do.

Original reference: Science 386, eado9336 (2024)

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Nature genetics 生物-遗传学

CiteScore

43.00

自引率

2.60%

发文量

241

审稿时长

3 months

期刊介绍： Nature Genetics publishes the very highest quality research in genetics. It encompasses genetic and functional genomic studies on human and plant traits and on other model organisms. Current emphasis is on the genetic basis for common and complex diseases and on the functional mechanism, architecture and evolution of gene networks, studied by experimental perturbation. Integrative genetic topics comprise, but are not limited to: -Genes in the pathology of human disease -Molecular analysis of simple and complex genetic traits -Cancer genetics -Agricultural genomics -Developmental genetics -Regulatory variation in gene expression -Strategies and technologies for extracting function from genomic data -Pharmacological genomics -Genome evolution

期刊最新文献

Brain metastasis prediction Nucleotide-resolution DNA foundation models of prokaryotic genomes Mutations in healthy breast tissue Behavioral insights from single-nucleus neuronal transcriptomics Mutagenesis and analysis of contrasting wheat lines do not support a role for PFT in Fusarium head blight resistance