The rapid advance of large-scale atlas-level single cell RNA sequences and single-cell chromatin accessibility data provide extraordinary avenues to broad and deep insight into complex biological mechanism. Leveraging the datasets and transfering labels from scRNA-seq to scATAC-seq will empower the exploration of single-cell omics data. However, the current label transfer methods have limited performance, largely due to the lower capable of preserving fine-grained cell populations and intrinsic or extrinsic heterogeneity between datasets. Here, we present a robust deep transfer model based graph convolutional network, scTGCN, which achieves versatile performance in preserving biological variation, while achieving integration hundreds of thousands cells in minutes with low memory consumption. We show that scTGCN is powerful to the integration of mouse atlas data and multimodal data generated from APSA-seq and CITE-seq. Thus, scTGCN shows high label transfer accuracy and effectively knowledge transfer across different modalities.
{"title":"Integration of unpaired single cell omics data by deep transfer graph convolutional network.","authors":"Yulong Kan, Yunjing Qi, Zhongxiao Zhang, Xikeng Liang, Weihao Wang, Shuilin Jin","doi":"10.1371/journal.pcbi.1012625","DOIUrl":"10.1371/journal.pcbi.1012625","url":null,"abstract":"<p><p>The rapid advance of large-scale atlas-level single cell RNA sequences and single-cell chromatin accessibility data provide extraordinary avenues to broad and deep insight into complex biological mechanism. Leveraging the datasets and transfering labels from scRNA-seq to scATAC-seq will empower the exploration of single-cell omics data. However, the current label transfer methods have limited performance, largely due to the lower capable of preserving fine-grained cell populations and intrinsic or extrinsic heterogeneity between datasets. Here, we present a robust deep transfer model based graph convolutional network, scTGCN, which achieves versatile performance in preserving biological variation, while achieving integration hundreds of thousands cells in minutes with low memory consumption. We show that scTGCN is powerful to the integration of mouse atlas data and multimodal data generated from APSA-seq and CITE-seq. Thus, scTGCN shows high label transfer accuracy and effectively knowledge transfer across different modalities.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012625"},"PeriodicalIF":3.8,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11778791/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143010333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-14eCollection Date: 2025-01-01DOI: 10.1371/journal.pcbi.1012138
Luis L Fonseca, Lucas Böttcher, Borna Mehrad, Reinhard C Laubenbacher
This paper describes and validates an algorithm to solve optimal control problems for agent-based models (ABMs). For a given ABM and a given optimal control problem, the algorithm derives a surrogate model, typically lower-dimensional, in the form of a system of ordinary differential equations (ODEs), solves the control problem for the surrogate model, and then transfers the solution back to the original ABM. It applies to quite general ABMs and offers several options for the ODE structure, depending on what information about the ABM is to be used. There is a broad range of applications for such an algorithm, since ABMs are used widely in the life sciences, such as ecology, epidemiology, and biomedicine and healthcare, areas where optimal control is an important purpose for modeling, such as for medical digital twin technology.
{"title":"Optimal control of agent-based models via surrogate modeling.","authors":"Luis L Fonseca, Lucas Böttcher, Borna Mehrad, Reinhard C Laubenbacher","doi":"10.1371/journal.pcbi.1012138","DOIUrl":"10.1371/journal.pcbi.1012138","url":null,"abstract":"<p><p>This paper describes and validates an algorithm to solve optimal control problems for agent-based models (ABMs). For a given ABM and a given optimal control problem, the algorithm derives a surrogate model, typically lower-dimensional, in the form of a system of ordinary differential equations (ODEs), solves the control problem for the surrogate model, and then transfers the solution back to the original ABM. It applies to quite general ABMs and offers several options for the ODE structure, depending on what information about the ABM is to be used. There is a broad range of applications for such an algorithm, since ABMs are used widely in the life sciences, such as ecology, epidemiology, and biomedicine and healthcare, areas where optimal control is an important purpose for modeling, such as for medical digital twin technology.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012138"},"PeriodicalIF":3.8,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11790234/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142984424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-14eCollection Date: 2025-01-01DOI: 10.1371/journal.pcbi.1012234
Alexander G Ginsberg, Scott F Lempka, Bo Duan, Victoria Booth, Jennifer Crodelle
Chronic pain is a wide-spread condition that is debilitating and expensive to manage, costing the United States alone around $600 billion in 2010. In a common symptom of chronic pain called allodynia, non-painful stimuli produce painful responses with highly variable presentations across individuals. While the specific mechanisms remain unclear, allodynia is hypothesized to be caused by the dysregulation of excitatory-inhibitory (E-I) balance in pain-processing neural circuitry in the dorsal horn of the spinal cord. In this work, we analyze biophysically-motivated subcircuit structures that represent common motifs in neural circuits in laminae I-II of the dorsal horn. These circuits are hypothesized to be part of the neural pathways that mediate two different types of allodynia: static and dynamic. We use neural firing rate models to describe the activity of populations of excitatory and inhibitory interneurons within each subcircuit. By accounting for experimentally-observed responses under healthy conditions, we specify model parameters defining populations of subcircuits that yield typical behavior under normal conditions. Then, we implement a sensitivity analysis approach to identify the mechanisms most likely to cause allodynia-producing dysregulation of the subcircuit's E-I signaling. We find that disruption of E-I balance generally occurs either due to downregulation of inhibitory signaling so that excitatory neurons are "released" from inhibitory control, or due to upregulation of excitatory neuron responses so that excitatory neurons "escape" their inhibitory control. Which of these mechanisms is most likely to occur, the subcircuit components involved in the mechanism, and the proportion of subcircuits exhibiting the mechanism can vary depending on the subcircuit structure. These results suggest specific hypotheses about diverse mechanisms that may be most likely responsible for allodynia, thus offering predictions for the high interindividual variability observed in allodynia and identifying targets for further experimental studies on the underlying mechanisms of this chronic pain symptom.
{"title":"Mechanisms for dysregulation of excitatory-inhibitory balance underlying allodynia in dorsal horn neural subcircuits.","authors":"Alexander G Ginsberg, Scott F Lempka, Bo Duan, Victoria Booth, Jennifer Crodelle","doi":"10.1371/journal.pcbi.1012234","DOIUrl":"10.1371/journal.pcbi.1012234","url":null,"abstract":"<p><p>Chronic pain is a wide-spread condition that is debilitating and expensive to manage, costing the United States alone around $600 billion in 2010. In a common symptom of chronic pain called allodynia, non-painful stimuli produce painful responses with highly variable presentations across individuals. While the specific mechanisms remain unclear, allodynia is hypothesized to be caused by the dysregulation of excitatory-inhibitory (E-I) balance in pain-processing neural circuitry in the dorsal horn of the spinal cord. In this work, we analyze biophysically-motivated subcircuit structures that represent common motifs in neural circuits in laminae I-II of the dorsal horn. These circuits are hypothesized to be part of the neural pathways that mediate two different types of allodynia: static and dynamic. We use neural firing rate models to describe the activity of populations of excitatory and inhibitory interneurons within each subcircuit. By accounting for experimentally-observed responses under healthy conditions, we specify model parameters defining populations of subcircuits that yield typical behavior under normal conditions. Then, we implement a sensitivity analysis approach to identify the mechanisms most likely to cause allodynia-producing dysregulation of the subcircuit's E-I signaling. We find that disruption of E-I balance generally occurs either due to downregulation of inhibitory signaling so that excitatory neurons are \"released\" from inhibitory control, or due to upregulation of excitatory neuron responses so that excitatory neurons \"escape\" their inhibitory control. Which of these mechanisms is most likely to occur, the subcircuit components involved in the mechanism, and the proportion of subcircuits exhibiting the mechanism can vary depending on the subcircuit structure. These results suggest specific hypotheses about diverse mechanisms that may be most likely responsible for allodynia, thus offering predictions for the high interindividual variability observed in allodynia and identifying targets for further experimental studies on the underlying mechanisms of this chronic pain symptom.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012234"},"PeriodicalIF":3.8,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771949/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142984058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13eCollection Date: 2025-01-01DOI: 10.1371/journal.pcbi.1012683
Obaï Bin Ka'b Ali, Alexandre Vidal, Christophe Grova, Habib Benali
Astrocytes critically shape whole-brain structure and function by forming extensive gap junctional networks that intimately and actively interact with neurons. Despite their importance, existing computational models of whole-brain activity ignore the roles of astrocytes while primarily focusing on neurons. Addressing this oversight, we introduce a biophysical neural mass network model, designed to capture the dynamic interplay between astrocytes and neurons via glutamatergic and GABAergic transmission pathways. This network model proposes that neural dynamics are constrained by a two-layered structural network interconnecting both astrocytic and neuronal populations, allowing us to investigate astrocytes' modulatory influences on whole-brain activity and emerging functional connectivity patterns. By developing a simulation methodology, informed by bifurcation and multilayer network theories, we demonstrate that the dialogue between astrocytic and neuronal networks manifests over fast-slow fluctuation mechanisms as well as through phase-amplitude connectivity processes. The findings from our research represent a significant leap forward in the modeling of glial-neuronal collaboration, promising deeper insights into their collaborative roles across health and disease states.
{"title":"Dialogue mechanisms between astrocytic and neuronal networks: A whole-brain modelling approach.","authors":"Obaï Bin Ka'b Ali, Alexandre Vidal, Christophe Grova, Habib Benali","doi":"10.1371/journal.pcbi.1012683","DOIUrl":"10.1371/journal.pcbi.1012683","url":null,"abstract":"<p><p>Astrocytes critically shape whole-brain structure and function by forming extensive gap junctional networks that intimately and actively interact with neurons. Despite their importance, existing computational models of whole-brain activity ignore the roles of astrocytes while primarily focusing on neurons. Addressing this oversight, we introduce a biophysical neural mass network model, designed to capture the dynamic interplay between astrocytes and neurons via glutamatergic and GABAergic transmission pathways. This network model proposes that neural dynamics are constrained by a two-layered structural network interconnecting both astrocytic and neuronal populations, allowing us to investigate astrocytes' modulatory influences on whole-brain activity and emerging functional connectivity patterns. By developing a simulation methodology, informed by bifurcation and multilayer network theories, we demonstrate that the dialogue between astrocytic and neuronal networks manifests over fast-slow fluctuation mechanisms as well as through phase-amplitude connectivity processes. The findings from our research represent a significant leap forward in the modeling of glial-neuronal collaboration, promising deeper insights into their collaborative roles across health and disease states.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012683"},"PeriodicalIF":3.8,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11730384/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142979708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13eCollection Date: 2025-01-01DOI: 10.1371/journal.pcbi.1012143
Evan Gorstein, Rosa Aghdam, Claudia Solís-Lemus
High-dimensional mixed-effects models are an increasingly important form of regression in which the number of covariates rivals or exceeds the number of samples, which are collected in groups or clusters. The penalized likelihood approach to fitting these models relies on a coordinate descent algorithm that lacks guarantees of convergence to a global optimum. Here, we empirically study the behavior of this algorithm on simulated and real examples of three types of data that are common in modern biology: transcriptome, genome-wide association, and microbiome data. Our simulations provide new insights into the algorithm's behavior in these settings, and, comparing the performance of two popular penalties, we demonstrate that the smoothly clipped absolute deviation (SCAD) penalty consistently outperforms the least absolute shrinkage and selection operator (LASSO) penalty in terms of both variable selection and estimation accuracy across omics data. To empower researchers in biology and other fields to fit models with the SCAD penalty, we implement the algorithm in a Julia package, HighDimMixedModels.jl.
{"title":"HighDimMixedModels.jl: Robust high-dimensional mixed-effects models across omics data.","authors":"Evan Gorstein, Rosa Aghdam, Claudia Solís-Lemus","doi":"10.1371/journal.pcbi.1012143","DOIUrl":"10.1371/journal.pcbi.1012143","url":null,"abstract":"<p><p>High-dimensional mixed-effects models are an increasingly important form of regression in which the number of covariates rivals or exceeds the number of samples, which are collected in groups or clusters. The penalized likelihood approach to fitting these models relies on a coordinate descent algorithm that lacks guarantees of convergence to a global optimum. Here, we empirically study the behavior of this algorithm on simulated and real examples of three types of data that are common in modern biology: transcriptome, genome-wide association, and microbiome data. Our simulations provide new insights into the algorithm's behavior in these settings, and, comparing the performance of two popular penalties, we demonstrate that the smoothly clipped absolute deviation (SCAD) penalty consistently outperforms the least absolute shrinkage and selection operator (LASSO) penalty in terms of both variable selection and estimation accuracy across omics data. To empower researchers in biology and other fields to fit models with the SCAD penalty, we implement the algorithm in a Julia package, HighDimMixedModels.jl.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012143"},"PeriodicalIF":3.8,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11761659/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142979711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13eCollection Date: 2025-01-01DOI: 10.1371/journal.pcbi.1012737
Alejandro Feito, Ignacio Sanchez-Burgos, Ignacio Tejero, Eduardo Sanz, Antonio Rey, Rosana Collepardo-Guevara, Andrés R Tejedor, Jorge R Espinosa
Intracellular liquid-liquid phase separation (LLPS) of proteins and nucleic acids is a fundamental mechanism by which cells compartmentalize their components and perform essential biological functions. Molecular simulations play a crucial role in providing microscopic insights into the physicochemical processes driving this phenomenon. In this study, we systematically compare six state-of-the-art sequence-dependent residue-resolution models to evaluate their performance in reproducing the phase behaviour and material properties of condensates formed by seven variants of the low-complexity domain (LCD) of the hnRNPA1 protein (A1-LCD)-a protein implicated in the pathological liquid-to-solid transition of stress granules. Specifically, we assess the HPS, HPS-cation-π, HPS-Urry, CALVADOS2, Mpipi, and Mpipi-Recharged models in their predictions of the condensate saturation concentration, critical solution temperature, and condensate viscosity of the A1-LCD variants. Our analyses demonstrate that, among the tested models, Mpipi, Mpipi-Recharged, and CALVADOS2 provide accurate descriptions of the critical solution temperatures and saturation concentrations for the multiple A1-LCD variants tested. Regarding the prediction of material properties for condensates of A1-LCD and its variants, Mpipi-Recharged stands out as the most reliable model. Overall, this study benchmarks a range of residue-resolution coarse-grained models for the study of the thermodynamic stability and material properties of condensates and establishes a direct link between their performance and the ranking of intermolecular interactions these models consider.
{"title":"Benchmarking residue-resolution protein coarse-grained models for simulations of biomolecular condensates.","authors":"Alejandro Feito, Ignacio Sanchez-Burgos, Ignacio Tejero, Eduardo Sanz, Antonio Rey, Rosana Collepardo-Guevara, Andrés R Tejedor, Jorge R Espinosa","doi":"10.1371/journal.pcbi.1012737","DOIUrl":"10.1371/journal.pcbi.1012737","url":null,"abstract":"<p><p>Intracellular liquid-liquid phase separation (LLPS) of proteins and nucleic acids is a fundamental mechanism by which cells compartmentalize their components and perform essential biological functions. Molecular simulations play a crucial role in providing microscopic insights into the physicochemical processes driving this phenomenon. In this study, we systematically compare six state-of-the-art sequence-dependent residue-resolution models to evaluate their performance in reproducing the phase behaviour and material properties of condensates formed by seven variants of the low-complexity domain (LCD) of the hnRNPA1 protein (A1-LCD)-a protein implicated in the pathological liquid-to-solid transition of stress granules. Specifically, we assess the HPS, HPS-cation-π, HPS-Urry, CALVADOS2, Mpipi, and Mpipi-Recharged models in their predictions of the condensate saturation concentration, critical solution temperature, and condensate viscosity of the A1-LCD variants. Our analyses demonstrate that, among the tested models, Mpipi, Mpipi-Recharged, and CALVADOS2 provide accurate descriptions of the critical solution temperatures and saturation concentrations for the multiple A1-LCD variants tested. Regarding the prediction of material properties for condensates of A1-LCD and its variants, Mpipi-Recharged stands out as the most reliable model. Overall, this study benchmarks a range of residue-resolution coarse-grained models for the study of the thermodynamic stability and material properties of condensates and establishes a direct link between their performance and the ranking of intermolecular interactions these models consider.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012737"},"PeriodicalIF":3.8,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11844903/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142979706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13eCollection Date: 2025-01-01DOI: 10.1371/journal.pcbi.1012680
Arif Badrou, Crystal A Mariano, Gustavo O Ramirez, Matthew Shankel, Nuno Rebelo, Mona Eskandari
Respiratory diseases represent a significant healthcare burden, as evidenced by the devastating impact of COVID-19. Biophysical models offer the possibility to anticipate system behavior and provide insights into physiological functions, advancements which are comparatively and notably nascent when it comes to pulmonary mechanics research. In this context, an Inverse Finite Element Analysis (IFEA) pipeline is developed to construct the first continuously ventilated three-dimensional structurally representative pulmonary model informed by both organ- and tissue-level breathing experiments from a cadaveric human lung. Here we construct a generalizable computational framework directly validated by pressure, volume, and strain measurements using a novel inflating apparatus interfaced with adapted, lung-specific, digital image correlation techniques. The parenchyma, pleura, and airways are represented with a poroelastic formulation to simulate pressure flows within the lung lobes, calibrating the model's material properties with the global pressure-volume response and local tissue deformations strains. The optimization yielded the following shear moduli: parenchyma (2.8 kPa), airways (0.2 kPa), and pleura (1.7 Pa). The proposed complex multi-material model with multi-experimental inputs was successfully developed using human lung data, and reproduced the shape of the inflating pressure-volume curve and strain distribution values associated with pulmonary deformation. This advancement marks a significant step towards creating a generalizable human lung model for broad applications across animal models, such as porcine, mouse, and rat lungs to reproduce pathological states and improve performance investigations regarding medical therapeutics and intervention.
{"title":"Towards constructing a generalized structural 3D breathing human lung model based on experimental volumes, pressures, and strains.","authors":"Arif Badrou, Crystal A Mariano, Gustavo O Ramirez, Matthew Shankel, Nuno Rebelo, Mona Eskandari","doi":"10.1371/journal.pcbi.1012680","DOIUrl":"10.1371/journal.pcbi.1012680","url":null,"abstract":"<p><p>Respiratory diseases represent a significant healthcare burden, as evidenced by the devastating impact of COVID-19. Biophysical models offer the possibility to anticipate system behavior and provide insights into physiological functions, advancements which are comparatively and notably nascent when it comes to pulmonary mechanics research. In this context, an Inverse Finite Element Analysis (IFEA) pipeline is developed to construct the first continuously ventilated three-dimensional structurally representative pulmonary model informed by both organ- and tissue-level breathing experiments from a cadaveric human lung. Here we construct a generalizable computational framework directly validated by pressure, volume, and strain measurements using a novel inflating apparatus interfaced with adapted, lung-specific, digital image correlation techniques. The parenchyma, pleura, and airways are represented with a poroelastic formulation to simulate pressure flows within the lung lobes, calibrating the model's material properties with the global pressure-volume response and local tissue deformations strains. The optimization yielded the following shear moduli: parenchyma (2.8 kPa), airways (0.2 kPa), and pleura (1.7 Pa). The proposed complex multi-material model with multi-experimental inputs was successfully developed using human lung data, and reproduced the shape of the inflating pressure-volume curve and strain distribution values associated with pulmonary deformation. This advancement marks a significant step towards creating a generalizable human lung model for broad applications across animal models, such as porcine, mouse, and rat lungs to reproduce pathological states and improve performance investigations regarding medical therapeutics and intervention.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012680"},"PeriodicalIF":3.8,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11729960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142979737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10eCollection Date: 2025-01-01DOI: 10.1371/journal.pcbi.1012702
Ismael Kherroubi Garcia, Christopher Erdmann, Sandra Gesing, Michael Barton, Lauren Cadwallader, Geerten Hengeveld, Christine R Kirkpatrick, Kathryn Knight, Carsten Lemmen, Rebecca Ringuette, Qing Zhan, Melissa Harrison, Feilim Mac Gabhann, Natalie Meyers, Cailean Osborne, Charlotte Till, Paul Brenner, Matt Buys, Min Chen, Allen Lee, Jason Papin, Yuhan Rao
Computational models are complex scientific constructs that have become essential for us to better understand the world. Many models are valuable for peers within and beyond disciplinary boundaries. However, there are no widely agreed-upon standards for sharing models. This paper suggests 10 simple rules for you to both (i) ensure you share models in a way that is at least "good enough," and (ii) enable others to lead the change towards better model-sharing practices.
{"title":"Ten simple rules for good model-sharing practices.","authors":"Ismael Kherroubi Garcia, Christopher Erdmann, Sandra Gesing, Michael Barton, Lauren Cadwallader, Geerten Hengeveld, Christine R Kirkpatrick, Kathryn Knight, Carsten Lemmen, Rebecca Ringuette, Qing Zhan, Melissa Harrison, Feilim Mac Gabhann, Natalie Meyers, Cailean Osborne, Charlotte Till, Paul Brenner, Matt Buys, Min Chen, Allen Lee, Jason Papin, Yuhan Rao","doi":"10.1371/journal.pcbi.1012702","DOIUrl":"10.1371/journal.pcbi.1012702","url":null,"abstract":"<p><p>Computational models are complex scientific constructs that have become essential for us to better understand the world. Many models are valuable for peers within and beyond disciplinary boundaries. However, there are no widely agreed-upon standards for sharing models. This paper suggests 10 simple rules for you to both (i) ensure you share models in a way that is at least \"good enough,\" and (ii) enable others to lead the change towards better model-sharing practices.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012702"},"PeriodicalIF":3.8,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723533/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142962330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10eCollection Date: 2025-01-01DOI: 10.1371/journal.pcbi.1012755
Ahmed Daoud, Asa Ben-Hur
Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources. Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources. We argue that these models are the equivalent of foundation models in natural language processing in their utility, as they encode within them chromatin state in its different aspects, providing useful representations that allow quick deployment of accurate models of gene regulation. We demonstrate this premise by leveraging the recently created Sei model to develop simple, interpretable models of intron retention, and demonstrate their advantage over models based on the DNA language model DNABERT-2. Our work also demonstrates the impact of chromatin state on the regulation of intron retention. Using representations learned by Sei, our model is able to discover the involvement of transcription factors and chromatin marks in regulating intron retention, providing better accuracy than a recently published custom model developed for this purpose.
{"title":"The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models.","authors":"Ahmed Daoud, Asa Ben-Hur","doi":"10.1371/journal.pcbi.1012755","DOIUrl":"10.1371/journal.pcbi.1012755","url":null,"abstract":"<p><p>Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources. Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources. We argue that these models are the equivalent of foundation models in natural language processing in their utility, as they encode within them chromatin state in its different aspects, providing useful representations that allow quick deployment of accurate models of gene regulation. We demonstrate this premise by leveraging the recently created Sei model to develop simple, interpretable models of intron retention, and demonstrate their advantage over models based on the DNA language model DNABERT-2. Our work also demonstrates the impact of chromatin state on the regulation of intron retention. Using representations learned by Sei, our model is able to discover the involvement of transcription factors and chromatin marks in regulating intron retention, providing better accuracy than a recently published custom model developed for this purpose.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012755"},"PeriodicalIF":3.8,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11756788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142962311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transfer learning aims to integrate useful information from multi-source datasets to improve the learning performance of target data. This can be effectively applied in genomics when we learn the gene associations in a target tissue, and data from other tissues can be integrated. However, heavy-tail distribution and outliers are common in genomics data, which poses challenges to the effectiveness of current transfer learning approaches. In this paper, we study the transfer learning problem under high-dimensional linear models with t-distributed error (Trans-PtLR), which aims to improve the estimation and prediction of target data by borrowing information from useful source data and offering robustness to accommodate complex data with heavy tails and outliers. In the oracle case with known transferable source datasets, a transfer learning algorithm based on penalized maximum likelihood and expectation-maximization algorithm is established. To avoid including non-informative sources, we propose to select the transferable sources based on cross-validation. Extensive simulation experiments as well as an application demonstrate that Trans-PtLR demonstrates robustness and better performance of estimation and prediction when heavy-tail and outliers exist compared to transfer learning for linear regression model with normal error distribution. Data integration, Variable selection, T distribution, Expectation maximization algorithm, Genotype-Tissue Expression, Cross validation.
{"title":"A robust transfer learning approach for high-dimensional linear regression to support integration of multi-source gene expression data.","authors":"Lulu Pan, Qian Gao, Kecheng Wei, Yongfu Yu, Guoyou Qin, Tong Wang","doi":"10.1371/journal.pcbi.1012739","DOIUrl":"10.1371/journal.pcbi.1012739","url":null,"abstract":"<p><p>Transfer learning aims to integrate useful information from multi-source datasets to improve the learning performance of target data. This can be effectively applied in genomics when we learn the gene associations in a target tissue, and data from other tissues can be integrated. However, heavy-tail distribution and outliers are common in genomics data, which poses challenges to the effectiveness of current transfer learning approaches. In this paper, we study the transfer learning problem under high-dimensional linear models with t-distributed error (Trans-PtLR), which aims to improve the estimation and prediction of target data by borrowing information from useful source data and offering robustness to accommodate complex data with heavy tails and outliers. In the oracle case with known transferable source datasets, a transfer learning algorithm based on penalized maximum likelihood and expectation-maximization algorithm is established. To avoid including non-informative sources, we propose to select the transferable sources based on cross-validation. Extensive simulation experiments as well as an application demonstrate that Trans-PtLR demonstrates robustness and better performance of estimation and prediction when heavy-tail and outliers exist compared to transfer learning for linear regression model with normal error distribution. Data integration, Variable selection, T distribution, Expectation maximization algorithm, Genotype-Tissue Expression, Cross validation.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 1","pages":"e1012739"},"PeriodicalIF":3.8,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11756795/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142962504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}