Latent Diffusion Models for Controllable RNA Sequence Generation

Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang
{"title":"Latent Diffusion Models for Controllable RNA Sequence Generation","authors":"Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang","doi":"arxiv-2409.09828","DOIUrl":null,"url":null,"abstract":"This paper presents RNAdiffusion, a latent diffusion model for generating and\noptimizing discrete RNA sequences. RNA is a particularly dynamic and versatile\nmolecule in biological processes. RNA sequences exhibit high variability and\ndiversity, characterized by their variable lengths, flexible three-dimensional\nstructures, and diverse functions. We utilize pretrained BERT-type models to\nencode raw RNAs into token-level biologically meaningful representations. A\nQ-Former is employed to compress these representations into a fixed-length set\nof latent vectors, with an autoregressive decoder trained to reconstruct RNA\nsequences from these latent variables. We then develop a continuous diffusion\nmodel within this latent space. To enable optimization, we train reward\nnetworks to estimate functional properties of RNA from the latent variables. We\nemploy gradient-based guidance during the backward diffusion process, aiming to\ngenerate RNA sequences that are optimized for higher rewards. Empirical\nexperiments confirm that RNAdiffusion generates non-coding RNAs that align with\nnatural distributions across various biological indicators. We fine-tuned the\ndiffusion model on untranslated regions (UTRs) of mRNA and optimize sample\nsequences for protein translation efficiencies. Our guided diffusion model\neffectively generates diverse UTR sequences with high Mean Ribosome Loading\n(MRL) and Translation Efficiency (TE), surpassing baselines. These results hold\npromise for studies on RNA sequence-function relationships, protein synthesis,\nand enhancing therapeutic RNA design.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09828","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences. RNA is a particularly dynamic and versatile molecule in biological processes. RNA sequences exhibit high variability and diversity, characterized by their variable lengths, flexible three-dimensional structures, and diverse functions. We utilize pretrained BERT-type models to encode raw RNAs into token-level biologically meaningful representations. A Q-Former is employed to compress these representations into a fixed-length set of latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we train reward networks to estimate functional properties of RNA from the latent variables. We employ gradient-based guidance during the backward diffusion process, aiming to generate RNA sequences that are optimized for higher rewards. Empirical experiments confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological indicators. We fine-tuned the diffusion model on untranslated regions (UTRs) of mRNA and optimize sample sequences for protein translation efficiencies. Our guided diffusion model effectively generates diverse UTR sequences with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), surpassing baselines. These results hold promise for studies on RNA sequence-function relationships, protein synthesis, and enhancing therapeutic RNA design.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
可控 RNA 序列生成的潜在扩散模型
本文介绍了 RNA 扩散,这是一种用于生成和优化离散 RNA 序列的潜在扩散模型。在生物过程中,RNA 是一种特别活跃且用途广泛的分子。RNA 序列具有高度的可变性和多样性,其特点是长度可变、三维结构灵活、功能多样。我们利用预训练的 BERT 型模型将原始 RNA 编码为具有生物意义的标记级表示。我们使用 AQ-Former 将这些表示压缩成一组固定长度的潜在向量,并训练自回归解码器从这些潜在变量中重建 RNA 序列。然后,我们在这个潜在空间内建立了一个连续扩散模型。为了实现优化,我们训练奖励网络(rewardnetworks),以便从潜在变量中估计 RNA 的功能特性。在后向扩散过程中,我们采用了基于梯度的引导,旨在生成可获得更高回报的 RNA 序列。经验实验证实,RNA 扩散生成的非编码 RNA 符合各种生物指标的自然分布。我们在 mRNA 的非翻译区(UTR)上对扩散模型进行了微调,并优化了蛋白质翻译效率的样本序列。我们的引导扩散模式有效地生成了不同的 UTR 序列,其平均核糖体载荷(MRL)和翻译效率(TE)均超过了基线。这些结果有望用于研究 RNA 序列与功能的关系、蛋白质合成以及加强治疗 RNA 的设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning A computational framework for optimal and Model Predictive Control of stochastic gene regulatory networks Active learning for energy-based antibody optimization and enhanced screening Comorbid anxiety symptoms predict lower odds of improvement in depression symptoms during smartphone-delivered psychotherapy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1