Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang
{"title":"可控 RNA 序列生成的潜在扩散模型","authors":"Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang","doi":"arxiv-2409.09828","DOIUrl":null,"url":null,"abstract":"This paper presents RNAdiffusion, a latent diffusion model for generating and\noptimizing discrete RNA sequences. RNA is a particularly dynamic and versatile\nmolecule in biological processes. RNA sequences exhibit high variability and\ndiversity, characterized by their variable lengths, flexible three-dimensional\nstructures, and diverse functions. We utilize pretrained BERT-type models to\nencode raw RNAs into token-level biologically meaningful representations. A\nQ-Former is employed to compress these representations into a fixed-length set\nof latent vectors, with an autoregressive decoder trained to reconstruct RNA\nsequences from these latent variables. We then develop a continuous diffusion\nmodel within this latent space. To enable optimization, we train reward\nnetworks to estimate functional properties of RNA from the latent variables. We\nemploy gradient-based guidance during the backward diffusion process, aiming to\ngenerate RNA sequences that are optimized for higher rewards. Empirical\nexperiments confirm that RNAdiffusion generates non-coding RNAs that align with\nnatural distributions across various biological indicators. We fine-tuned the\ndiffusion model on untranslated regions (UTRs) of mRNA and optimize sample\nsequences for protein translation efficiencies. Our guided diffusion model\neffectively generates diverse UTR sequences with high Mean Ribosome Loading\n(MRL) and Translation Efficiency (TE), surpassing baselines. These results hold\npromise for studies on RNA sequence-function relationships, protein synthesis,\nand enhancing therapeutic RNA design.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Latent Diffusion Models for Controllable RNA Sequence Generation\",\"authors\":\"Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang\",\"doi\":\"arxiv-2409.09828\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents RNAdiffusion, a latent diffusion model for generating and\\noptimizing discrete RNA sequences. RNA is a particularly dynamic and versatile\\nmolecule in biological processes. RNA sequences exhibit high variability and\\ndiversity, characterized by their variable lengths, flexible three-dimensional\\nstructures, and diverse functions. We utilize pretrained BERT-type models to\\nencode raw RNAs into token-level biologically meaningful representations. A\\nQ-Former is employed to compress these representations into a fixed-length set\\nof latent vectors, with an autoregressive decoder trained to reconstruct RNA\\nsequences from these latent variables. We then develop a continuous diffusion\\nmodel within this latent space. To enable optimization, we train reward\\nnetworks to estimate functional properties of RNA from the latent variables. We\\nemploy gradient-based guidance during the backward diffusion process, aiming to\\ngenerate RNA sequences that are optimized for higher rewards. Empirical\\nexperiments confirm that RNAdiffusion generates non-coding RNAs that align with\\nnatural distributions across various biological indicators. We fine-tuned the\\ndiffusion model on untranslated regions (UTRs) of mRNA and optimize sample\\nsequences for protein translation efficiencies. Our guided diffusion model\\neffectively generates diverse UTR sequences with high Mean Ribosome Loading\\n(MRL) and Translation Efficiency (TE), surpassing baselines. These results hold\\npromise for studies on RNA sequence-function relationships, protein synthesis,\\nand enhancing therapeutic RNA design.\",\"PeriodicalId\":501266,\"journal\":{\"name\":\"arXiv - QuanBio - Quantitative Methods\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Quantitative Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09828\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09828","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Latent Diffusion Models for Controllable RNA Sequence Generation
This paper presents RNAdiffusion, a latent diffusion model for generating and
optimizing discrete RNA sequences. RNA is a particularly dynamic and versatile
molecule in biological processes. RNA sequences exhibit high variability and
diversity, characterized by their variable lengths, flexible three-dimensional
structures, and diverse functions. We utilize pretrained BERT-type models to
encode raw RNAs into token-level biologically meaningful representations. A
Q-Former is employed to compress these representations into a fixed-length set
of latent vectors, with an autoregressive decoder trained to reconstruct RNA
sequences from these latent variables. We then develop a continuous diffusion
model within this latent space. To enable optimization, we train reward
networks to estimate functional properties of RNA from the latent variables. We
employ gradient-based guidance during the backward diffusion process, aiming to
generate RNA sequences that are optimized for higher rewards. Empirical
experiments confirm that RNAdiffusion generates non-coding RNAs that align with
natural distributions across various biological indicators. We fine-tuned the
diffusion model on untranslated regions (UTRs) of mRNA and optimize sample
sequences for protein translation efficiencies. Our guided diffusion model
effectively generates diverse UTR sequences with high Mean Ribosome Loading
(MRL) and Translation Efficiency (TE), surpassing baselines. These results hold
promise for studies on RNA sequence-function relationships, protein synthesis,
and enhancing therapeutic RNA design.