{"title":"Nucleotide-resolution DNA foundation models of prokaryotic genomes","authors":"Michael Fletcher","doi":"10.1038/s41588-024-02062-5","DOIUrl":null,"url":null,"abstract":"<p>Machine learning models have recently generated excitement for their potential in a broad range of domain applications, including genomics. However, owing to their complexity, they are prohibitively expensive to train for the large genomic contexts of DNA language models, resulting in limited receptive fields and/or <i>n-</i>mer sequence tokenization. Nguyen et al. present a step forward for the field with Evo, a foundation model that applies the efficient, hybrid StripedHyena architecture trained on 80,000 prokaryotic and millions of phage and plasmid sequences at single-nucleotide resolution. In benchmarking, Evo shows equivalent or improved performance against state-of-the-art nucleotide and language models for variant fitness, promoter activity and protein expression prediction. Impressively, Evo can be used to generate novel, experimentally validated CRISPR–Cas and transposon systems and predict gene essentiality by premature stop codon insertion; it also shows some promise for generating synthetic whole genomes. Foundation DNA language models that are applicable to many tasks would be of broad utility and Evo underlines their promise. However, it must be noted that the training datasets are small compared to the genomes of eukaryotes, and the still-limited 131-kb context and next-token prediction will need to be further adapted for the increased complexity of multicellular life, showing there is still much to do.</p><p><b>Original reference:</b> <i>Science</i> <b>386</b>, eado9336 (2024)</p>","PeriodicalId":18985,"journal":{"name":"Nature genetics","volume":"13 1","pages":""},"PeriodicalIF":31.7000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1038/s41588-024-02062-5","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning models have recently generated excitement for their potential in a broad range of domain applications, including genomics. However, owing to their complexity, they are prohibitively expensive to train for the large genomic contexts of DNA language models, resulting in limited receptive fields and/or n-mer sequence tokenization. Nguyen et al. present a step forward for the field with Evo, a foundation model that applies the efficient, hybrid StripedHyena architecture trained on 80,000 prokaryotic and millions of phage and plasmid sequences at single-nucleotide resolution. In benchmarking, Evo shows equivalent or improved performance against state-of-the-art nucleotide and language models for variant fitness, promoter activity and protein expression prediction. Impressively, Evo can be used to generate novel, experimentally validated CRISPR–Cas and transposon systems and predict gene essentiality by premature stop codon insertion; it also shows some promise for generating synthetic whole genomes. Foundation DNA language models that are applicable to many tasks would be of broad utility and Evo underlines their promise. However, it must be noted that the training datasets are small compared to the genomes of eukaryotes, and the still-limited 131-kb context and next-token prediction will need to be further adapted for the increased complexity of multicellular life, showing there is still much to do.
期刊介绍:
Nature Genetics publishes the very highest quality research in genetics. It encompasses genetic and functional genomic studies on human and plant traits and on other model organisms. Current emphasis is on the genetic basis for common and complex diseases and on the functional mechanism, architecture and evolution of gene networks, studied by experimental perturbation.
Integrative genetic topics comprise, but are not limited to:
-Genes in the pathology of human disease
-Molecular analysis of simple and complex genetic traits
-Cancer genetics
-Agricultural genomics
-Developmental genetics
-Regulatory variation in gene expression
-Strategies and technologies for extracting function from genomic data
-Pharmacological genomics
-Genome evolution