DNA language model GROVER learns sequence context in the human genome

IF 23.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Nature Machine Intelligence Pub Date : 2024-07-23 DOI:10.1038/s42256-024-00872-0

Melissa Sanabria, Jonas Hirsch, Pierre M. Joubert, Anna R. Poetsch

{"title":"DNA language model GROVER learns sequence context in the human genome","authors":"Melissa Sanabria, Jonas Hirsch, Pierre M. Joubert, Anna R. Poetsch","doi":"10.1038/s42256-024-00872-0","DOIUrl":null,"url":null,"abstract":"Deep-learning models that learn a sense of language on DNA have achieved a high level of performance on genome biological tasks. Genome sequences follow rules similar to natural language but are distinct in the absence of a concept of words. We established byte-pair encoding on the human genome and trained a foundation language model called GROVER (Genome Rules Obtained Via Extracted Representations) with the vocabulary selected via a custom task, next-k-mer prediction. The defined dictionary of tokens in the human genome carries best the information content for GROVER. Analysing learned representations, we observed that trained token embeddings primarily encode information related to frequency, sequence content and length. Some tokens are primarily localized in repeats, whereas the majority widely distribute over the genome. GROVER also learns context and lexical ambiguity. Average trained embeddings of genomic regions relate to functional genomics annotation and thus indicate learning of these structures purely from the contextual relationships of tokens. This highlights the extent of information content encoded by the sequence that can be grasped by GROVER. On fine-tuning tasks addressing genome biology with questions of genome element identification and protein–DNA binding, GROVER exceeds other models’ performance. GROVER learns sequence context, a sense for structure and language rules. Extracting this knowledge can be used to compose a grammar book for the code of life. Genomes can be modelled with language approaches by treating nucleotide bases A, C, G and T like text, but there is no natural concept of what the words would be and whether there is even a ‘language’ to be learned this way. Sanabria et al. have developed a language model called GROVER that learns with a ‘vocabulary’ of genome sequences with byte-pair encoding, a method from text compression, and shows good performance on genome biological tasks.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"6 8","pages":"911-923"},"PeriodicalIF":23.9000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s42256-024-00872-0.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.nature.com/articles/s42256-024-00872-0","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Deep-learning models that learn a sense of language on DNA have achieved a high level of performance on genome biological tasks. Genome sequences follow rules similar to natural language but are distinct in the absence of a concept of words. We established byte-pair encoding on the human genome and trained a foundation language model called GROVER (Genome Rules Obtained Via Extracted Representations) with the vocabulary selected via a custom task, next-k-mer prediction. The defined dictionary of tokens in the human genome carries best the information content for GROVER. Analysing learned representations, we observed that trained token embeddings primarily encode information related to frequency, sequence content and length. Some tokens are primarily localized in repeats, whereas the majority widely distribute over the genome. GROVER also learns context and lexical ambiguity. Average trained embeddings of genomic regions relate to functional genomics annotation and thus indicate learning of these structures purely from the contextual relationships of tokens. This highlights the extent of information content encoded by the sequence that can be grasped by GROVER. On fine-tuning tasks addressing genome biology with questions of genome element identification and protein–DNA binding, GROVER exceeds other models’ performance. GROVER learns sequence context, a sense for structure and language rules. Extracting this knowledge can be used to compose a grammar book for the code of life. Genomes can be modelled with language approaches by treating nucleotide bases A, C, G and T like text, but there is no natural concept of what the words would be and whether there is even a ‘language’ to be learned this way. Sanabria et al. have developed a language model called GROVER that learns with a ‘vocabulary’ of genome sequences with byte-pair encoding, a method from text compression, and shows good performance on genome biological tasks.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DNA 语言模型 GROVER 学习人类基因组中的序列上下文

在 DNA 上学习语感的深度学习模型在基因组生物任务上取得了很高的性能。基因组序列遵循与自然语言类似的规则，但由于没有单词概念而与自然语言截然不同。我们在人类基因组上建立了字节对编码，并训练了一个名为 GROVER（通过提取表征获得的基因组规则）的基础语言模型，其词汇量是通过自定义任务--下一个 k-mer 预测--选择的。人类基因组中定义的词库为 GROVER 提供了最好的信息内容。通过分析学习到的表征，我们发现训练有素的标记嵌入主要编码与频率、序列内容和长度相关的信息。有些标记主要集中在重复序列中，而大多数标记则广泛分布在基因组中。GROVER 还能学习上下文和词汇歧义。基因组区域的平均训练嵌入与功能基因组注释有关，因此表明这些结构的学习纯粹来自于标记的上下文关系。这凸显了 GROVER 所能掌握的序列编码信息内容的范围。在针对基因组生物学的微调任务中，GROVER 在基因组元素识别和蛋白质-DNA 结合方面的表现超过了其他模型。GROVER 可以学习序列上下文、结构感和语言规则。提取这些知识可用于编写生命代码语法书。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Nature Machine Intelligence Multiple-

CiteScore

36.90

自引率

2.10%

发文量

127

期刊介绍： Nature Machine Intelligence is a distinguished publication that presents original research and reviews on various topics in machine learning, robotics, and AI. Our focus extends beyond these fields, exploring their profound impact on other scientific disciplines, as well as societal and industrial aspects. We recognize limitless possibilities wherein machine intelligence can augment human capabilities and knowledge in domains like scientific exploration, healthcare, medical diagnostics, and the creation of safe and sustainable cities, transportation, and agriculture. Simultaneously, we acknowledge the emergence of ethical, social, and legal concerns due to the rapid pace of advancements. To foster interdisciplinary discussions on these far-reaching implications, Nature Machine Intelligence serves as a platform for dialogue facilitated through Comments, News Features, News & Views articles, and Correspondence. Our goal is to encourage a comprehensive examination of these subjects. Similar to all Nature-branded journals, Nature Machine Intelligence operates under the guidance of a team of skilled editors. We adhere to a fair and rigorous peer-review process, ensuring high standards of copy-editing and production, swift publication, and editorial independence.