Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S Song
{"title":"Genomic language models: opportunities and challenges.","authors":"Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S Song","doi":"10.1016/j.tig.2024.11.013","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.</p>","PeriodicalId":54413,"journal":{"name":"Trends in Genetics","volume":" ","pages":""},"PeriodicalIF":13.6000,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Trends in Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.tig.2024.11.013","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
期刊介绍:
Launched in 1985, Trends in Genetics swiftly established itself as a "must-read" for geneticists, offering concise, accessible articles covering a spectrum of topics from developmental biology to evolution. This reputation endures, making TiG a cherished resource in the genetic research community. While evolving with the field, the journal now embraces new areas like genomics, epigenetics, and computational genetics, alongside its continued coverage of traditional subjects such as transcriptional regulation, population genetics, and chromosome biology.
Despite expanding its scope, the core objective of TiG remains steadfast: to furnish researchers and students with high-quality, innovative reviews, commentaries, and discussions, fostering an appreciation for advances in genetic research. Each issue of TiG presents lively and up-to-date Reviews and Opinions, alongside shorter articles like Science & Society and Spotlight pieces. Invited from leading researchers, Reviews objectively chronicle recent developments, Opinions provide a forum for debate and hypothesis, and shorter articles explore the intersection of genetics with science and policy, as well as emerging ideas in the field. All articles undergo rigorous peer-review.