Deciphering genomic codes using advanced NLP techniques: a scoping review.

ArXiv Pub Date : 2024-11-25
Shuyan Cheng, Yishu Wei, Yiliang Zhou, Zihan Xu, Drew N Wright, Jinze Liu, Yifan Peng
{"title":"Deciphering genomic codes using advanced NLP techniques: a scoping review.","authors":"Shuyan Cheng, Yishu Wei, Yiliang Zhou, Zihan Xu, Drew N Wright, Jinze Liu, Yifan Peng","doi":"","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. This review aims to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data.</p><p><strong>Methods: </strong>Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type.</p><p><strong>Results: </strong>A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility.</p><p><strong>Discussion: </strong>The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while providing a better understanding of its complex structures. It can potentially drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is needed to discuss and overcome limitations, enhancing model transparency and applicability.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11623714/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. This review aims to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data.

Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type.

Results: A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility.

Discussion: The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while providing a better understanding of its complex structures. It can potentially drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is needed to discuss and overcome limitations, enhancing model transparency and applicability.

分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用先进的自然语言处理技术破译基因组密码:范围回顾。
目的:人类基因组测序数据的庞大和复杂的性质提出了有效分析的挑战。本文旨在研究自然语言处理(NLP)技术,特别是大型语言模型(llm)和转换器架构在破译基因组密码中的应用,重点是标记化、转换器模型和监管注释预测。本综述的目的是评估最新文献中的数据和模型可及性,更好地了解这些工具在处理基因组测序数据方面的现有能力和限制。方法:根据系统评价和荟萃分析(PRISMA)指南的首选报告项目,我们的范围审查在PubMed, Medline, Scopus, Web of Science, Embase和ACM数字图书馆进行。如果研究集中于应用于基因组测序数据分析的自然语言处理方法,则不受发表日期或文章类型的限制。结果:共选择了2021年至2024年4月期间发表的26项研究进行综述。这篇综述强调了标记化和转换模型增强了对基因组数据的处理和理解,在预测转录因子结合位点和染色质可及性等调控注释方面的应用。讨论:NLP和llm在基因组测序数据解释中的应用是一个很有前途的领域,可以帮助简化大规模基因组数据的处理,同时也提供了对其复杂结构的更好理解。它有潜力通过提供更有效和可扩展的基因组分析解决方案来推动个性化医疗的进步。还需要进一步的研究来讨论和克服现有的局限性,提高模型的透明度和适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Grade Inflation in Generative Models. A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options. A Systematic Computational Framework for Practical Identifiability Analysis in Mathematical Models Arising from Biology. Back to the Continuous Attractor. Inferring resource competition in microbial communities from time series.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1