Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B. Burkhardt, Andrea Califano, Jonah Cool, Abby F. Dernburg, Kirsty Ewing, Emily B. Fox, Matthias Haury, Amy E. Herr, Eric Horvitz, Patrick D. Hsu, Viren Jain, Gregory R. Johnson, Thomas Kalil, David R. Kelley, Shana O. Kelley, Anna Kreshuk, Tim Mitchison, Stephani Otte, Jay Shendure, Nicholas J. Sofroniew, Fabian Theis, Christina V. Theodoris, Srigokul Upadhyayula, Marc Valer, Bo Wang, Eric Xing, Serena Yeung-Levy, Marinka Zitnik, Theofanis Karaletsos, Aviv Regev, Emma Lundberg, Jure Leskovec, Stephen R. Quake
The cell is arguably the smallest unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision of AI-powered Virtual Cells, where robust representations of cells and cellular systems under different conditions are directly learned from growing biological data across measurements and scales. We discuss desired capabilities of AI Virtual Cells, including generating universal representations of biological entities across scales, and facilitating interpretable in silico experiments to predict and understand their behavior using Virtual Instruments. We further address the challenges, opportunities and requirements to realize this vision including data needs, evaluation strategies, and community standards and engagement to ensure biological accuracy and broad utility. We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration. With open science collaborations across the biomedical ecosystem that includes academia, philanthropy, and the biopharma and AI industries, a comprehensive predictive understanding of cell mechanisms and interactions is within reach.
{"title":"How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities","authors":"Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B. Burkhardt, Andrea Califano, Jonah Cool, Abby F. Dernburg, Kirsty Ewing, Emily B. Fox, Matthias Haury, Amy E. Herr, Eric Horvitz, Patrick D. Hsu, Viren Jain, Gregory R. Johnson, Thomas Kalil, David R. Kelley, Shana O. Kelley, Anna Kreshuk, Tim Mitchison, Stephani Otte, Jay Shendure, Nicholas J. Sofroniew, Fabian Theis, Christina V. Theodoris, Srigokul Upadhyayula, Marc Valer, Bo Wang, Eric Xing, Serena Yeung-Levy, Marinka Zitnik, Theofanis Karaletsos, Aviv Regev, Emma Lundberg, Jure Leskovec, Stephen R. Quake","doi":"arxiv-2409.11654","DOIUrl":"https://doi.org/arxiv-2409.11654","url":null,"abstract":"The cell is arguably the smallest unit of life and is central to\u0000understanding biology. Accurate modeling of cells is important for this\u0000understanding as well as for determining the root causes of disease. Recent\u0000advances in artificial intelligence (AI), combined with the ability to generate\u0000large-scale experimental data, present novel opportunities to model cells. Here\u0000we propose a vision of AI-powered Virtual Cells, where robust representations\u0000of cells and cellular systems under different conditions are directly learned\u0000from growing biological data across measurements and scales. We discuss desired\u0000capabilities of AI Virtual Cells, including generating universal\u0000representations of biological entities across scales, and facilitating\u0000interpretable in silico experiments to predict and understand their behavior\u0000using Virtual Instruments. We further address the challenges, opportunities and\u0000requirements to realize this vision including data needs, evaluation\u0000strategies, and community standards and engagement to ensure biological\u0000accuracy and broad utility. We envision a future where AI Virtual Cells help\u0000identify new drug targets, predict cellular responses to perturbations, as well\u0000as scale hypothesis exploration. With open science collaborations across the\u0000biomedical ecosystem that includes academia, philanthropy, and the biopharma\u0000and AI industries, a comprehensive predictive understanding of cell mechanisms\u0000and interactions is within reach.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proton pencil beam scanning (PBS) treatment planning for head and neck (H&N) cancers is a time-consuming and experience-demanding task where a large number of planning objectives are involved. Deep reinforcement learning (DRL) has recently been introduced to the planning processes of intensity-modulated radiation therapy and brachytherapy for prostate, lung, and cervical cancers. However, existing approaches are built upon the Q-learning framework and weighted linear combinations of clinical metrics, suffering from poor scalability and flexibility and only capable of adjusting a limited number of planning objectives in discrete action spaces. We propose an automatic treatment planning model using the proximal policy optimization (PPO) algorithm and a dose distribution-based reward function for proton PBS treatment planning of H&N cancers. Specifically, a set of empirical rules is used to create auxiliary planning structures from target volumes and organs-at-risk (OARs), along with their associated planning objectives. These planning objectives are fed into an in-house optimization engine to generate the spot monitor unit (MU) values. A decision-making policy network trained using PPO is developed to iteratively adjust the involved planning objective parameters in a continuous action space and refine the PBS treatment plans using a novel dose distribution-based reward function. Proton H&N treatment plans generated by the model show improved OAR sparing with equal or superior target coverage when compared with human-generated plans. Moreover, additional experiments on liver cancer demonstrate that the proposed method can be successfully generalized to other treatment sites. To the best of our knowledge, this is the first DRL-based automatic treatment planning model capable of achieving human-level performance for H&N cancers.
{"title":"Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning","authors":"Qingqing Wang, Chang Chang","doi":"arxiv-2409.11576","DOIUrl":"https://doi.org/arxiv-2409.11576","url":null,"abstract":"Proton pencil beam scanning (PBS) treatment planning for head and neck (H&N)\u0000cancers is a time-consuming and experience-demanding task where a large number\u0000of planning objectives are involved. Deep reinforcement learning (DRL) has\u0000recently been introduced to the planning processes of intensity-modulated\u0000radiation therapy and brachytherapy for prostate, lung, and cervical cancers.\u0000However, existing approaches are built upon the Q-learning framework and\u0000weighted linear combinations of clinical metrics, suffering from poor\u0000scalability and flexibility and only capable of adjusting a limited number of\u0000planning objectives in discrete action spaces. We propose an automatic\u0000treatment planning model using the proximal policy optimization (PPO) algorithm\u0000and a dose distribution-based reward function for proton PBS treatment planning\u0000of H&N cancers. Specifically, a set of empirical rules is used to create\u0000auxiliary planning structures from target volumes and organs-at-risk (OARs),\u0000along with their associated planning objectives. These planning objectives are\u0000fed into an in-house optimization engine to generate the spot monitor unit (MU)\u0000values. A decision-making policy network trained using PPO is developed to\u0000iteratively adjust the involved planning objective parameters in a continuous\u0000action space and refine the PBS treatment plans using a novel dose\u0000distribution-based reward function. Proton H&N treatment plans generated by the\u0000model show improved OAR sparing with equal or superior target coverage when\u0000compared with human-generated plans. Moreover, additional experiments on liver\u0000cancer demonstrate that the proposed method can be successfully generalized to\u0000other treatment sites. To the best of our knowledge, this is the first\u0000DRL-based automatic treatment planning model capable of achieving human-level\u0000performance for H&N cancers.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate prediction and optimization of protein-protein binding affinity is crucial for therapeutic antibody development. Although machine learning-based prediction methods $DeltaDelta G$ are suitable for large-scale mutant screening, they struggle to predict the effects of multiple mutations for targets without existing binders. Energy function-based methods, though more accurate, are time consuming and not ideal for large-scale screening. To address this, we propose an active learning workflow that efficiently trains a deep learning model to learn energy functions for specific targets, combining the advantages of both approaches. Our method integrates the RDE-Network deep learning model with Rosetta's energy function-based Flex ddG to efficiently explore mutants that bind to Flex ddG. In a case study targeting HER2-binding Trastuzumab mutants, our approach significantly improved the screening performance over random selection and demonstrated the ability to identify mutants with better binding properties without experimental $DeltaDelta G$ data. This workflow advances computational antibody design by combining machine learning, physics-based computations, and active learning to achieve more efficient antibody development.
{"title":"Active learning for energy-based antibody optimization and enhanced screening","authors":"Kairi Furui, Masahito Ohue","doi":"arxiv-2409.10964","DOIUrl":"https://doi.org/arxiv-2409.10964","url":null,"abstract":"Accurate prediction and optimization of protein-protein binding affinity is\u0000crucial for therapeutic antibody development. Although machine learning-based\u0000prediction methods $DeltaDelta G$ are suitable for large-scale mutant\u0000screening, they struggle to predict the effects of multiple mutations for\u0000targets without existing binders. Energy function-based methods, though more\u0000accurate, are time consuming and not ideal for large-scale screening. To\u0000address this, we propose an active learning workflow that efficiently trains a\u0000deep learning model to learn energy functions for specific targets, combining\u0000the advantages of both approaches. Our method integrates the RDE-Network deep\u0000learning model with Rosetta's energy function-based Flex ddG to efficiently\u0000explore mutants that bind to Flex ddG. In a case study targeting HER2-binding\u0000Trastuzumab mutants, our approach significantly improved the screening\u0000performance over random selection and demonstrated the ability to identify\u0000mutants with better binding properties without experimental $DeltaDelta G$\u0000data. This workflow advances computational antibody design by combining machine\u0000learning, physics-based computations, and active learning to achieve more\u0000efficient antibody development.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Engineering biology requires precise control of biomolecular circuits, and Cybergenetics is the field dedicated to achieving this goal. A significant challenge in developing controllers for cellular functions is designing systems that can effectively manage molecular noise. To address this, there has been increasing effort to develop model-based controllers for stochastic biomolecular systems, where a major difficulty lies in accurately solving the chemical master equation. In this work we develop a framework for optimal and Model Predictive Control of stochastic gene regulatory networks with three key advantageous features: high computational efficiency, the capacity to control the overall probability density function enabling the fine-tuning of the cell population to obtain complex shapes and behaviors (including bimodality and other emergent properties), and the capacity to handle high levels of intrinsic molecular noise. Our method exploits an efficient approximation of the Chemical Master Equation using Partial Integro-Differential Equations, which additionally enables the development of an effective adjoint-based optimization. We illustrate the performance of the methods presented through two relevant studies in Synthetic Biology: shaping bimodal cell populations and tracking moving target distributions via inducible gene regulatory circuits.
{"title":"A computational framework for optimal and Model Predictive Control of stochastic gene regulatory networks","authors":"Hamza Faquir, Manuel Pájaro, Irene Otero-Muras","doi":"arxiv-2409.11036","DOIUrl":"https://doi.org/arxiv-2409.11036","url":null,"abstract":"Engineering biology requires precise control of biomolecular circuits, and\u0000Cybergenetics is the field dedicated to achieving this goal. A significant\u0000challenge in developing controllers for cellular functions is designing systems\u0000that can effectively manage molecular noise. To address this, there has been\u0000increasing effort to develop model-based controllers for stochastic\u0000biomolecular systems, where a major difficulty lies in accurately solving the\u0000chemical master equation. In this work we develop a framework for optimal and\u0000Model Predictive Control of stochastic gene regulatory networks with three key\u0000advantageous features: high computational efficiency, the capacity to control\u0000the overall probability density function enabling the fine-tuning of the cell\u0000population to obtain complex shapes and behaviors (including bimodality and\u0000other emergent properties), and the capacity to handle high levels of intrinsic\u0000molecular noise. Our method exploits an efficient approximation of the Chemical\u0000Master Equation using Partial Integro-Differential Equations, which\u0000additionally enables the development of an effective adjoint-based\u0000optimization. We illustrate the performance of the methods presented through\u0000two relevant studies in Synthetic Biology: shaping bimodal cell populations and\u0000tracking moving target distributions via inducible gene regulatory circuits.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Morgan B. Talbot, Omar Costilla-Reyes, Jessica M. Lipschitz
Comorbid anxiety disorders are common among patients with major depressive disorder (MDD), and numerous studies have identified an association between comorbid anxiety and resistance to pharmacological depression treatment. However, less is known regarding the effect of anxiety on non-pharmacological therapies for MDD. We apply machine learning techniques to analyze MDD treatment responses in a large-scale clinical trial (n=754), in which participants with MDD were recruited online and randomized to different smartphone-based depression treatments. We find that a baseline GAD-7 questionnaire score in the "moderate" to "severe" range (>10) predicts greatly reduced probability of responding to treatment across treatment groups. Our findings suggest that depressed individuals with comorbid anxiety face lower odds of substantial improvement in the context of smartphone-based therapeutic interventions for MDD. Our work highlights a simple methodology for identifying clinically useful "rules of thumb" in treatment response prediction using interpretable machine learning models and a forward variable selection process.
{"title":"Comorbid anxiety symptoms predict lower odds of improvement in depression symptoms during smartphone-delivered psychotherapy","authors":"Morgan B. Talbot, Omar Costilla-Reyes, Jessica M. Lipschitz","doi":"arxiv-2409.11183","DOIUrl":"https://doi.org/arxiv-2409.11183","url":null,"abstract":"Comorbid anxiety disorders are common among patients with major depressive\u0000disorder (MDD), and numerous studies have identified an association between\u0000comorbid anxiety and resistance to pharmacological depression treatment.\u0000However, less is known regarding the effect of anxiety on non-pharmacological\u0000therapies for MDD. We apply machine learning techniques to analyze MDD\u0000treatment responses in a large-scale clinical trial (n=754), in which\u0000participants with MDD were recruited online and randomized to different\u0000smartphone-based depression treatments. We find that a baseline GAD-7\u0000questionnaire score in the \"moderate\" to \"severe\" range (>10) predicts greatly\u0000reduced probability of responding to treatment across treatment groups. Our\u0000findings suggest that depressed individuals with comorbid anxiety face lower\u0000odds of substantial improvement in the context of smartphone-based therapeutic\u0000interventions for MDD. Our work highlights a simple methodology for identifying\u0000clinically useful \"rules of thumb\" in treatment response prediction using\u0000interpretable machine learning models and a forward variable selection process.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anna GamżaThe Roslin Institute- University of Edinburgh- Edinburgh- UK, Samantha LycettThe Roslin Institute- University of Edinburgh- Edinburgh- UK, Will HarveyThe Roslin Institute- University of Edinburgh- Edinburgh- UK, Joseph HughesMRC-University of Glasgow Centre for Virus Research- Glasgow- UK, Sema NickbakhshPublic Health Scotland- Glasgow- UK, David L RobertsonMRC-University of Glasgow Centre for Virus Research- Glasgow- UK, Alison Smith PalmerPublic Health Scotland- Glasgow- UK, Anthony WoodThe Roslin Institute- University of Edinburgh- Edinburgh- UK, Rowland KaoThe Roslin Institute- University of Edinburgh- Edinburgh- UKSchool of Physics and Astronomy- University of Edinburgh- Edinburgh- UK
Characterising drivers of SARS-CoV-2 circulation is crucial for understanding COVID-19 because of the severity of control measures adopted during the pandemic. Whole genome sequence data augmented with demographic metadata provides the best opportunity to do this. We use Random Forest Decision Tree models to analyse a combination of over 4000 SARS-CoV2 sequences from a densely sampled, mixed urban and rural population (Tayside) in Scotland in the period from August 2020 to July 2021, with fine scale geographical and socio-demographic metadata. Comparing periods in versus out of "lockdown" restrictions, we show using genetic distance relationships that individuals from more deprived areas are more likely to get infected during lockdown but less likely to spread the infection further. As disadvantaged communities were the most affected by both COVID-19 and its restrictions, our finding has important implications for informing future approaches to control future pandemics driven by similar respiratory infections.
{"title":"Infector characteristics exposed by spatial analysis of SARS-CoV-2 sequence and demographic data analysed at fine geographical scales","authors":"Anna GamżaThe Roslin Institute- University of Edinburgh- Edinburgh- UK, Samantha LycettThe Roslin Institute- University of Edinburgh- Edinburgh- UK, Will HarveyThe Roslin Institute- University of Edinburgh- Edinburgh- UK, Joseph HughesMRC-University of Glasgow Centre for Virus Research- Glasgow- UK, Sema NickbakhshPublic Health Scotland- Glasgow- UK, David L RobertsonMRC-University of Glasgow Centre for Virus Research- Glasgow- UK, Alison Smith PalmerPublic Health Scotland- Glasgow- UK, Anthony WoodThe Roslin Institute- University of Edinburgh- Edinburgh- UK, Rowland KaoThe Roslin Institute- University of Edinburgh- Edinburgh- UKSchool of Physics and Astronomy- University of Edinburgh- Edinburgh- UK","doi":"arxiv-2409.10436","DOIUrl":"https://doi.org/arxiv-2409.10436","url":null,"abstract":"Characterising drivers of SARS-CoV-2 circulation is crucial for understanding\u0000COVID-19 because of the severity of control measures adopted during the\u0000pandemic. Whole genome sequence data augmented with demographic metadata\u0000provides the best opportunity to do this. We use Random Forest Decision Tree\u0000models to analyse a combination of over 4000 SARS-CoV2 sequences from a densely\u0000sampled, mixed urban and rural population (Tayside) in Scotland in the period\u0000from August 2020 to July 2021, with fine scale geographical and\u0000socio-demographic metadata. Comparing periods in versus out of \"lockdown\"\u0000restrictions, we show using genetic distance relationships that individuals\u0000from more deprived areas are more likely to get infected during lockdown but\u0000less likely to spread the infection further. As disadvantaged communities were\u0000the most affected by both COVID-19 and its restrictions, our finding has\u0000important implications for informing future approaches to control future\u0000pandemics driven by similar respiratory infections.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Per- and polyfluoroalkyl substances (PFAS) are persistent environmental pollutants with known toxicity and bioaccumulation issues. Their widespread industrial use and resistance to degradation have led to global environmental contamination and significant health concerns. While a minority of PFAS have been extensively studied, the toxicity of many PFAS remains poorly understood due to limited direct toxicological data. This study advances the predictive modeling of PFAS toxicity by combining semi-supervised graph convolutional networks (GCNs) with molecular descriptors and fingerprints. We propose a novel approach to enhance the prediction of PFAS binding affinities by isolating molecular fingerprints to construct graphs where then descriptors are set as the node features. This approach specifically captures the structural, physicochemical, and topological features of PFAS without overfitting due to an abundance of features. Unsupervised clustering then identifies representative compounds for detailed binding studies. Our results provide a more accurate ability to estimate PFAS hepatotoxicity to provide guidance in chemical discovery of new PFAS and the development of new safety regulations.
{"title":"Uncovering the Mechanism of Hepatotoxiciy of PFAS Targeting L-FABP Using GCN and Computational Modeling","authors":"Lucas Jividen, Tibo Duran, Xi-Zhi Niu, Jun Bai","doi":"arxiv-2409.10370","DOIUrl":"https://doi.org/arxiv-2409.10370","url":null,"abstract":"Per- and polyfluoroalkyl substances (PFAS) are persistent environmental\u0000pollutants with known toxicity and bioaccumulation issues. Their widespread\u0000industrial use and resistance to degradation have led to global environmental\u0000contamination and significant health concerns. While a minority of PFAS have\u0000been extensively studied, the toxicity of many PFAS remains poorly understood\u0000due to limited direct toxicological data. This study advances the predictive\u0000modeling of PFAS toxicity by combining semi-supervised graph convolutional\u0000networks (GCNs) with molecular descriptors and fingerprints. We propose a novel\u0000approach to enhance the prediction of PFAS binding affinities by isolating\u0000molecular fingerprints to construct graphs where then descriptors are set as\u0000the node features. This approach specifically captures the structural,\u0000physicochemical, and topological features of PFAS without overfitting due to an\u0000abundance of features. Unsupervised clustering then identifies representative\u0000compounds for detailed binding studies. Our results provide a more accurate\u0000ability to estimate PFAS hepatotoxicity to provide guidance in chemical\u0000discovery of new PFAS and the development of new safety regulations.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binghao Yan, Yunbi Nam, Lingyao Li, Rebecca A. Deek, Hongzhe Li, Siyuan Ma
Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a language of life, enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein/genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.
{"title":"Recent advances in deep learning and language models for studying the microbiome","authors":"Binghao Yan, Yunbi Nam, Lingyao Li, Rebecca A. Deek, Hongzhe Li, Siyuan Ma","doi":"arxiv-2409.10579","DOIUrl":"https://doi.org/arxiv-2409.10579","url":null,"abstract":"Recent advancements in deep learning, particularly large language models\u0000(LLMs), made a significant impact on how researchers study microbiome and\u0000metagenomics data. Microbial protein and genomic sequences, like natural\u0000languages, form a language of life, enabling the adoption of LLMs to extract\u0000useful insights from complex microbial ecologies. In this paper, we review\u0000applications of deep learning and language models in analyzing microbiome and\u0000metagenomics data. We focus on problem formulations, necessary datasets, and\u0000the integration of language modeling techniques. We provide an extensive\u0000overview of protein/genomic language modeling and their contributions to\u0000microbiome studies. We also discuss applications such as novel viromics\u0000language modeling, biosynthetic gene cluster prediction, and knowledge\u0000integration for metagenomics studies.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang
This paper presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences. RNA is a particularly dynamic and versatile molecule in biological processes. RNA sequences exhibit high variability and diversity, characterized by their variable lengths, flexible three-dimensional structures, and diverse functions. We utilize pretrained BERT-type models to encode raw RNAs into token-level biologically meaningful representations. A Q-Former is employed to compress these representations into a fixed-length set of latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we train reward networks to estimate functional properties of RNA from the latent variables. We employ gradient-based guidance during the backward diffusion process, aiming to generate RNA sequences that are optimized for higher rewards. Empirical experiments confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological indicators. We fine-tuned the diffusion model on untranslated regions (UTRs) of mRNA and optimize sample sequences for protein translation efficiencies. Our guided diffusion model effectively generates diverse UTR sequences with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), surpassing baselines. These results hold promise for studies on RNA sequence-function relationships, protein synthesis, and enhancing therapeutic RNA design.
{"title":"Latent Diffusion Models for Controllable RNA Sequence Generation","authors":"Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang","doi":"arxiv-2409.09828","DOIUrl":"https://doi.org/arxiv-2409.09828","url":null,"abstract":"This paper presents RNAdiffusion, a latent diffusion model for generating and\u0000optimizing discrete RNA sequences. RNA is a particularly dynamic and versatile\u0000molecule in biological processes. RNA sequences exhibit high variability and\u0000diversity, characterized by their variable lengths, flexible three-dimensional\u0000structures, and diverse functions. We utilize pretrained BERT-type models to\u0000encode raw RNAs into token-level biologically meaningful representations. A\u0000Q-Former is employed to compress these representations into a fixed-length set\u0000of latent vectors, with an autoregressive decoder trained to reconstruct RNA\u0000sequences from these latent variables. We then develop a continuous diffusion\u0000model within this latent space. To enable optimization, we train reward\u0000networks to estimate functional properties of RNA from the latent variables. We\u0000employ gradient-based guidance during the backward diffusion process, aiming to\u0000generate RNA sequences that are optimized for higher rewards. Empirical\u0000experiments confirm that RNAdiffusion generates non-coding RNAs that align with\u0000natural distributions across various biological indicators. We fine-tuned the\u0000diffusion model on untranslated regions (UTRs) of mRNA and optimize sample\u0000sequences for protein translation efficiencies. Our guided diffusion model\u0000effectively generates diverse UTR sequences with high Mean Ribosome Loading\u0000(MRL) and Translation Efficiency (TE), surpassing baselines. These results hold\u0000promise for studies on RNA sequence-function relationships, protein synthesis,\u0000and enhancing therapeutic RNA design.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sudam Surasinghe, Swathi Nachiar Manivannan, Samuel V. Scarpino, Lorin Crawford, C. Brandon Ogbunugafor
Mathematical modelling has served a central role in studying how infectious disease transmission manifests at the population level. These models have demonstrated the importance of population-level factors like social network heterogeneity on structuring epidemic risk and are now routinely used in public health for decision support. One barrier to broader utility of mathematical models is that the existing canon does not readily accommodate the social determinants of health as distinct, formal drivers of transmission dynamics. Given the decades of empirical support for the organizational effect of social determinants on health burden more generally and infectious disease risk more specially, addressing this modelling gap is of critical importance. In this study, we build on prior efforts to integrate social forces into mathematical epidemiology by introducing several new metrics, principally structural causal influence (SCI). Here, SCI leverages causal analysis to provide a measure of the relative vulnerability of subgroups within a susceptible population, which are crafted by differences in healthcare, exposure to disease, and other determinants. We develop our metrics using a general case and apply it to specific one of public health importance: Hepatitis C virus in a population of persons who inject drugs. Our use of the SCI reveals that, under specific parameters in a multi-community model, the "less vulnerable" community may sustain a basic reproduction number below one when isolated, ensuring disease extinction. However, even minimal transmission between less and more vulnerable communities can elevate this number, leading to sustained epidemics within both communities. Summarizing, we reflect on our findings in light of conversations surrounding the importance of social inequalities and how their consideration can influence the study and practice of mathematical epidemiology.
{"title":"Structural causal influence (SCI) captures the forces of social inequality in models of disease dynamics","authors":"Sudam Surasinghe, Swathi Nachiar Manivannan, Samuel V. Scarpino, Lorin Crawford, C. Brandon Ogbunugafor","doi":"arxiv-2409.09096","DOIUrl":"https://doi.org/arxiv-2409.09096","url":null,"abstract":"Mathematical modelling has served a central role in studying how infectious\u0000disease transmission manifests at the population level. These models have\u0000demonstrated the importance of population-level factors like social network\u0000heterogeneity on structuring epidemic risk and are now routinely used in public\u0000health for decision support. One barrier to broader utility of mathematical\u0000models is that the existing canon does not readily accommodate the social\u0000determinants of health as distinct, formal drivers of transmission dynamics.\u0000Given the decades of empirical support for the organizational effect of social\u0000determinants on health burden more generally and infectious disease risk more\u0000specially, addressing this modelling gap is of critical importance. In this\u0000study, we build on prior efforts to integrate social forces into mathematical\u0000epidemiology by introducing several new metrics, principally structural causal\u0000influence (SCI). Here, SCI leverages causal analysis to provide a measure of\u0000the relative vulnerability of subgroups within a susceptible population, which\u0000are crafted by differences in healthcare, exposure to disease, and other\u0000determinants. We develop our metrics using a general case and apply it to\u0000specific one of public health importance: Hepatitis C virus in a population of\u0000persons who inject drugs. Our use of the SCI reveals that, under specific\u0000parameters in a multi-community model, the \"less vulnerable\" community may\u0000sustain a basic reproduction number below one when isolated, ensuring disease\u0000extinction. However, even minimal transmission between less and more vulnerable\u0000communities can elevate this number, leading to sustained epidemics within both\u0000communities. Summarizing, we reflect on our findings in light of conversations\u0000surrounding the importance of social inequalities and how their consideration\u0000can influence the study and practice of mathematical epidemiology.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}