mRNA2vec:在 5'UTR-CDS 中嵌入语言模型,进行 mRNA 设计

Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai
{"title":"mRNA2vec:在 5'UTR-CDS 中嵌入语言模型,进行 mRNA 设计","authors":"Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai","doi":"arxiv-2408.09048","DOIUrl":null,"url":null,"abstract":"Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new\ndrugs and revolutionizing the pharmaceutical industry. However, selecting\nparticular mRNA sequences for vaccines and therapeutics from extensive mRNA\nlibraries is costly. Effective mRNA therapeutics require carefully designed\nsequences with optimized expression levels and stability. This paper proposes a\nnovel contextual language model (LM)-based embedding method: mRNA2vec. In\ncontrast to existing mRNA embedding approaches, our method is based on the\nself-supervised teacher-student learning framework of data2vec. We jointly use\nthe 5' untranslated region (UTR) and coding sequence (CDS) region as the input\nsequences. We adapt our LM-based approach specifically to mRNA by 1)\nconsidering the importance of location on the mRNA sequence with probabilistic\nmasking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure\n(SS) classification as additional pretext tasks. mRNA2vec demonstrates\nsignificant improvements in translation efficiency (TE) and expression level\n(EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also\ngives a competitive performance in mRNA stability and protein production level\ntasks in CDS such as CodonBERT.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design\",\"authors\":\"Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai\",\"doi\":\"arxiv-2408.09048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new\\ndrugs and revolutionizing the pharmaceutical industry. However, selecting\\nparticular mRNA sequences for vaccines and therapeutics from extensive mRNA\\nlibraries is costly. Effective mRNA therapeutics require carefully designed\\nsequences with optimized expression levels and stability. This paper proposes a\\nnovel contextual language model (LM)-based embedding method: mRNA2vec. In\\ncontrast to existing mRNA embedding approaches, our method is based on the\\nself-supervised teacher-student learning framework of data2vec. We jointly use\\nthe 5' untranslated region (UTR) and coding sequence (CDS) region as the input\\nsequences. We adapt our LM-based approach specifically to mRNA by 1)\\nconsidering the importance of location on the mRNA sequence with probabilistic\\nmasking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure\\n(SS) classification as additional pretext tasks. mRNA2vec demonstrates\\nsignificant improvements in translation efficiency (TE) and expression level\\n(EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also\\ngives a competitive performance in mRNA stability and protein production level\\ntasks in CDS such as CodonBERT.\",\"PeriodicalId\":501266,\"journal\":{\"name\":\"arXiv - QuanBio - Quantitative Methods\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Quantitative Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

以信使核糖核酸(mRNA)为基础的疫苗正在加速新药的发现,并给制药业带来革命性的变化。然而,从庞大的 mRNA 库中挑选用于疫苗和治疗的特定 mRNA 序列成本高昂。有效的 mRNA 疗法需要精心设计的具有优化表达水平和稳定性的序列。本文提出了一种基于上下文语言模型(LM)的嵌入方法:mRNA2vec。与现有的 mRNA 嵌入方法不同,我们的方法基于 data2vec 的自我监督师生学习框架。我们共同使用 5' 非翻译区(UTR)和编码序列(CDS)区域作为输入序列。与 UTR-LM 等 SOTA 方法相比,mRNA2vec 在 UTR 的翻译效率(TE)和表达水平(EL)预测任务上有显著提高。此外,它在 CDS(如 CodonBERT)中的 mRNA 稳定性和蛋白质生产水平任务方面的表现也很有竞争力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design
Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5' untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protein production level tasks in CDS such as CodonBERT.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning A computational framework for optimal and Model Predictive Control of stochastic gene regulatory networks Active learning for energy-based antibody optimization and enhanced screening Comorbid anxiety symptoms predict lower odds of improvement in depression symptoms during smartphone-delivered psychotherapy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1