Xiaowen Chen, Kyle Crocker, Seppe Kuehn, Aleksandra M Walczak, Thierry Mora
The competition for resources is a defining feature of microbial communities. In many contexts, from soils to host-associated communities, highly diverse microbes are organized into metabolic groups or guilds with similar resource preferences. The resource preferences of individual taxa that give rise to these guilds are critical for understanding fluxes of resources through the community and the structure of diversity in the system. However, inferring the metabolic capabilities of individual taxa, and their competition with other taxa, within a community is challenging and unresolved. Here we address this gap in knowledge by leveraging dynamic measurements of abundances in communities. We show that simple correlations are often misleading in predicting resource competition. We show that spectral methods such as the cross-power spectral density (CPSD) and coherence that account for time-delayed effects are superior metrics for inferring the structure of resource competition in communities. We first demonstrate this fact on synthetic data generated from consumer-resource models with time-dependent resource availability, where taxa are organized into groups or guilds with similar resource preferences. By applying spectral methods to oceanic plankton time-series data, we demonstrate that these methods detect interaction structures among species with similar genomic sequences. Our results indicate that analyzing temporal data across multiple timescales can reveal the underlying structure of resource competition within communities.
{"title":"Inferring resource competition in microbial communities from time series.","authors":"Xiaowen Chen, Kyle Crocker, Seppe Kuehn, Aleksandra M Walczak, Thierry Mora","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The competition for resources is a defining feature of microbial communities. In many contexts, from soils to host-associated communities, highly diverse microbes are organized into metabolic groups or guilds with similar resource preferences. The resource preferences of individual taxa that give rise to these guilds are critical for understanding fluxes of resources through the community and the structure of diversity in the system. However, inferring the metabolic capabilities of individual taxa, and their competition with other taxa, within a community is challenging and unresolved. Here we address this gap in knowledge by leveraging dynamic measurements of abundances in communities. We show that simple correlations are often misleading in predicting resource competition. We show that spectral methods such as the cross-power spectral density (CPSD) and coherence that account for time-delayed effects are superior metrics for inferring the structure of resource competition in communities. We first demonstrate this fact on synthetic data generated from consumer-resource models with time-dependent resource availability, where taxa are organized into groups or guilds with similar resource preferences. By applying spectral methods to oceanic plankton time-series data, we demonstrate that these methods detect interaction structures among species with similar genomic sequences. Our results indicate that analyzing temporal data across multiple timescales can reveal the underlying structure of resource competition within communities.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11759850/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stability in recurrent neural models poses a significant challenge, particularly in developing biologically plausible neurodynamical models that can be seamlessly trained. Traditional cortical circuit models are notoriously difficult to train due to expansive nonlinearities in the dynamical system, leading to an optimization problem with nonlinear stability constraints that are difficult to impose. Conversely, recurrent neural networks (RNNs) excel in tasks involving sequential data but lack biological plausibility and interpretability. In this work, we address these challenges by linking dynamic divisive normalization (DN) to the stability of "oscillatory recurrent gated neural integrator circuits" (ORGaNICs), a biologically plausible recurrent cortical circuit model that dynamically achieves DN and that has been shown to simulate a wide range of neurophysiological phenomena. By using the indirect method of Lyapunov, we prove the remarkable property of unconditional local stability for an arbitrary-dimensional ORGaNICs circuit when the recurrent weight matrix is the identity. We thus connect ORGaNICs to a system of coupled damped harmonic oscillators, which enables us to derive the circuit's energy function, providing a normative principle of what the circuit, and individual neurons, aim to accomplish. Further, for a generic recurrent weight matrix, we prove the stability of the 2D model and demonstrate empirically that stability holds in higher dimensions. Finally, we show that ORGaNICs can be trained by backpropagation through time without gradient clipping/scaling, thanks to its intrinsic stability property and adaptive time constants, which address the problems of exploding, vanishing, and oscillating gradients. By evaluating the model's performance on RNN benchmarks, we find that ORGaNICs outperform alternative neurodynamical models on static image classification tasks and perform comparably to LSTMs on sequential tasks.
{"title":"Unconditional stability of a recurrent neural circuit implementing divisive normalization.","authors":"Shivang Rawat, David J Heeger, Stefano Martiniani","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Stability in recurrent neural models poses a significant challenge, particularly in developing biologically plausible neurodynamical models that can be seamlessly trained. Traditional cortical circuit models are notoriously difficult to train due to expansive nonlinearities in the dynamical system, leading to an optimization problem with nonlinear stability constraints that are difficult to impose. Conversely, recurrent neural networks (RNNs) excel in tasks involving sequential data but lack biological plausibility and interpretability. In this work, we address these challenges by linking dynamic divisive normalization (DN) to the stability of \"oscillatory recurrent gated neural integrator circuits\" (ORGaNICs), a biologically plausible recurrent cortical circuit model that dynamically achieves DN and that has been shown to simulate a wide range of neurophysiological phenomena. By using the indirect method of Lyapunov, we prove the remarkable property of unconditional local stability for an arbitrary-dimensional ORGaNICs circuit when the recurrent weight matrix is the identity. We thus connect ORGaNICs to a system of coupled damped harmonic oscillators, which enables us to derive the circuit's energy function, providing a normative principle of what the circuit, and individual neurons, aim to accomplish. Further, for a generic recurrent weight matrix, we prove the stability of the 2D model and demonstrate empirically that stability holds in higher dimensions. Finally, we show that ORGaNICs can be trained by backpropagation through time without gradient clipping/scaling, thanks to its intrinsic stability property and adaptive time constants, which address the problems of exploding, vanishing, and oscillating gradients. By evaluating the model's performance on RNN benchmarks, we find that ORGaNICs outperform alternative neurodynamical models on static image classification tasks and perform comparably to LSTMs on sequential tasks.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11469413/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandra N Busch, Roberto C Budzinski, Federico W Pasini, Ján Mináč, Jonathan A Michaels, Megan Roussy, Roberto A Gulli, Benjamin W Corrigan, J Andrew Pruszynski, Julio Martinez-Trujillo, Lyle E Muller
Recent advances in neural recording technology allow simultaneously recording action potentials from hundreds to thousands of neurons in awake, behaving animals. However, characterizing spike patterns in the resulting data, and linking these patterns to behaviour, remains a challenging task. The lack of a rigorous mathematical language for variable numbers of events (spikes) emitted by multiple agents (neurons) is an important limiting factor. We introduce a new mathematical operation to decompose complex spike patterns into a set of simple, structured elements. This creates a mathematical language that allows comparing spike patterns across trials, detecting sub-patterns, and making links to behaviour via a clear distance measure. We first demonstrate the method using Neuropixel recordings from macaque motor cortex. We then apply the method to dual Utah array recordings from macaque prefrontal cortex, where this technique reveals previously unseen structure that can predict both memory-guided decisions and errors in a virtual-reality working memory task. These results demonstrate that this technique provides a powerful new approach to understand structure in the spike times of neural populations, at a scale that will continue to grow more and more rapidly in upcoming years.
{"title":"A mathematical language for linking fine-scale structure in spikes from hundreds to thousands of neurons with behaviour.","authors":"Alexandra N Busch, Roberto C Budzinski, Federico W Pasini, Ján Mináč, Jonathan A Michaels, Megan Roussy, Roberto A Gulli, Benjamin W Corrigan, J Andrew Pruszynski, Julio Martinez-Trujillo, Lyle E Muller","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Recent advances in neural recording technology allow simultaneously recording action potentials from hundreds to thousands of neurons in awake, behaving animals. However, characterizing spike patterns in the resulting data, and linking these patterns to behaviour, remains a challenging task. The lack of a rigorous mathematical language for variable numbers of events (spikes) emitted by multiple agents (neurons) is an important limiting factor. We introduce a new mathematical operation to decompose complex spike patterns into a set of simple, structured elements. This creates a mathematical language that allows comparing spike patterns across trials, detecting sub-patterns, and making links to behaviour via a clear distance measure. We first demonstrate the method using Neuropixel recordings from macaque motor cortex. We then apply the method to dual Utah array recordings from macaque prefrontal cortex, where this technique reveals previously unseen structure that can predict both memory-guided decisions and errors in a virtual-reality working memory task. These results demonstrate that this technique provides a powerful new approach to understand structure in the spike times of neural populations, at a scale that will continue to grow more and more rapidly in upcoming years.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11643227/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142831101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Franklin Y Ruan, Aiwei Zhang, Jenny Y Oh, SouYoung Jin, Nicholas C Jacobson
Pretrained foundation models and transformer architectures have driven the success of large language models (LLMs) and other modern AI breakthroughs. However, similar advancements in health data modeling remain limited due to the need for innovative adaptations. Wearable movement data offers a valuable avenue for exploration, as it's a core feature in nearly all commercial smartwatches, well established in clinical and mental health research, and the sequential nature of the data shares similarities to language. We introduce the Pretrained Actigraphy Transformer (PAT), the first open source foundation model designed for time-series wearable movement data. Leveraging transformer-based architectures and novel techniques, such as patch embeddings, and pretraining on data from 29,307 participants in a national U.S. sample, PAT achieves state-of-the-art performance in several mental health prediction tasks. PAT is also lightweight and easily interpretable, making it a robust tool for mental health research. GitHub: https://github.com/njacobsonlab/Pretrained-Actigraphy-Transformer/.
{"title":"AI Foundation Models for Wearable Movement Data in Mental Health Research.","authors":"Franklin Y Ruan, Aiwei Zhang, Jenny Y Oh, SouYoung Jin, Nicholas C Jacobson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Pretrained foundation models and transformer architectures have driven the success of large language models (LLMs) and other modern AI breakthroughs. However, similar advancements in health data modeling remain limited due to the need for innovative adaptations. Wearable movement data offers a valuable avenue for exploration, as it's a core feature in nearly all commercial smartwatches, well established in clinical and mental health research, and the sequential nature of the data shares similarities to language. We introduce the Pretrained Actigraphy Transformer (PAT), the first open source foundation model designed for time-series wearable movement data. Leveraging transformer-based architectures and novel techniques, such as patch embeddings, and pretraining on data from 29,307 participants in a national U.S. sample, PAT achieves state-of-the-art performance in several mental health prediction tasks. PAT is also lightweight and easily interpretable, making it a robust tool for mental health research. GitHub: https://github.com/njacobsonlab/Pretrained-Actigraphy-Transformer/.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11623705/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142804036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roman Bushuiev, Anton Bushuiev, Niek F de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo Wang, David S Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D Mak, Soha Hassoun, Florian Huber, Justin J J van der Hooft, Michael A Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: textit{de novo} molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at url{https://github.com/pluskal-lab/MassSpecGym}.
{"title":"MassSpecGym: A benchmark for the discovery and identification of molecules.","authors":"Roman Bushuiev, Anton Bushuiev, Niek F de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo Wang, David S Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D Mak, Soha Hassoun, Florian Huber, Justin J J van der Hooft, Michael A Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: textit{de novo} molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at url{https://github.com/pluskal-lab/MassSpecGym}.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11581121/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142689948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Practical identifiability is a critical concern in data-driven modeling of mathematical systems. In this paper, we propose a novel framework for practical identifiability analysis to evaluate parameter identifiability in mathematical models of biological systems. Starting with a rigorous mathematical definition of practical identifiability, we demonstrate its equivalence to the invertibility of the Fisher Information Matrix. Our framework establishes the relationship between practical identifiability and coordinate identifiability, introducing a novel metric that simplifies and accelerates the evaluation of parameter identifiability compared to the profile likelihood method. Additionally, we introduce new regularization terms to address non-identifiable parameters, enabling uncertainty quantification and improving model reliability. To guide experimental design, we present an optimal data collection algorithm that ensures all model parameters are practically identifiable. Applications to Hill functions, neural networks, and dynamic biological models demonstrate the feasibility and efficiency of the proposed computational framework in uncovering critical biological processes and identifying key observable variables.
{"title":"A Systematic Computational Method for Practical Identifiability Analysis in Mathematical Models Arising from Biology.","authors":"Shun Wang, Wenrui Hao","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Practical identifiability is a critical concern in data-driven modeling of mathematical systems. In this paper, we propose a novel framework for practical identifiability analysis to evaluate parameter identifiability in mathematical models of biological systems. Starting with a rigorous mathematical definition of practical identifiability, we demonstrate its equivalence to the invertibility of the Fisher Information Matrix. Our framework establishes the relationship between practical identifiability and coordinate identifiability, introducing a novel metric that simplifies and accelerates the evaluation of parameter identifiability compared to the profile likelihood method. Additionally, we introduce new regularization terms to address non-identifiable parameters, enabling uncertainty quantification and improving model reliability. To guide experimental design, we present an optimal data collection algorithm that ensures all model parameters are practically identifiable. Applications to Hill functions, neural networks, and dynamic biological models demonstrate the feasibility and efficiency of the proposed computational framework in uncovering critical biological processes and identifying key observable variables.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11722522/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jakub R Kaczmarzyk, Rishul Sharma, Peter K Koo, Joel H Saltz
Foundation models for computational pathology have shown great promise for specimen-level tasks and are increasingly accessible to researchers. However, specimen-level models built on these foundation models remain largely unavailable, hindering their broader utility and impact. To address this gap, we developed SpinPath, a toolkit designed to democratize specimen-level deep learning by providing a zoo of pretrained specimen-level models, a Python-based inference engine, and a JavaScript-based inference platform. We demonstrate the utility of SpinPath in metastasis detection tasks across nine foundation models. SpinPath may foster reproducibility, simplify experimentation, and accelerate the adoption of specimen-level deep learning in computational pathology research.
{"title":"Reusable specimen-level inference in computational pathology.","authors":"Jakub R Kaczmarzyk, Rishul Sharma, Peter K Koo, Joel H Saltz","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Foundation models for computational pathology have shown great promise for specimen-level tasks and are increasingly accessible to researchers. However, specimen-level models built on these foundation models remain largely unavailable, hindering their broader utility and impact. To address this gap, we developed SpinPath, a toolkit designed to democratize specimen-level deep learning by providing a zoo of pretrained specimen-level models, a Python-based inference engine, and a JavaScript-based inference platform. We demonstrate the utility of SpinPath in metastasis detection tasks across nine foundation models. SpinPath may foster reproducibility, simplify experimentation, and accelerate the adoption of specimen-level deep learning in computational pathology research.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11759856/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maxwell Sanderford, Sudip Sharma, Glen Stecher, Jun Liu, Jieping Ye, Sudhir Kumar
Evolutionary sparse learning (ESL) uses a supervised machine learning approach, Least Absolute Shrinkage and Selection Operator (LASSO), to build models explaining the relationship between a hypothesis and the variation across genomic features (e.g., sites) in sequence alignments. ESL employs sparsity between and within the groups of genomic features (e.g., genomic loci) by using sparse-group LASSO. Although some software packages are available for performing sparse group LASSO, we found them less well-suited for processing and analyzing genome-scale data containing millions of features, such as bases. MyESL software fills the need for open-source software for conducting ESL analyses with facilities to pre-process the input hypotheses and large alignments, make LASSO flexible and computationally efficient, and post-process the output model to produce different metrics useful in functional or evolutionary genomics. MyESL can take phylogenetic trees and sequence alignments as input and transform them into numeric responses and features, respecetively. The model outputs are processed into user-friendly text and graphical files. The computational core of MyESL is written in C++, which offers model building with or without group sparsity, while the pre- and post-processing of inputs and model outputs is performed using customized functions written in Python. One of its applications in phylogenomics showcases the utility of MyESL. Our analysis of empirical genome-scale datasets shows that MyESL can build evolutionary models quickly and efficiently on a personal desktop, while other computational packages were unable due to their prohibitive requirements of computational resources and time. MyESL is available for Python environments on Linux and distributed as a standalone application for Windows and macOS. It is available from https://github.com/kumarlabgit/MyESL.
{"title":"MyESL: Sparse learning in molecular evolution and phylogenetic analysis.","authors":"Maxwell Sanderford, Sudip Sharma, Glen Stecher, Jun Liu, Jieping Ye, Sudhir Kumar","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Evolutionary sparse learning (ESL) uses a supervised machine learning approach, Least Absolute Shrinkage and Selection Operator (LASSO), to build models explaining the relationship between a hypothesis and the variation across genomic features (e.g., sites) in sequence alignments. ESL employs sparsity between and within the groups of genomic features (e.g., genomic loci) by using sparse-group LASSO. Although some software packages are available for performing sparse group LASSO, we found them less well-suited for processing and analyzing genome-scale data containing millions of features, such as bases. MyESL software fills the need for open-source software for conducting ESL analyses with facilities to pre-process the input hypotheses and large alignments, make LASSO flexible and computationally efficient, and post-process the output model to produce different metrics useful in functional or evolutionary genomics. MyESL can take phylogenetic trees and sequence alignments as input and transform them into numeric responses and features, respecetively. The model outputs are processed into user-friendly text and graphical files. The computational core of MyESL is written in C++, which offers model building with or without group sparsity, while the pre- and post-processing of inputs and model outputs is performed using customized functions written in Python. One of its applications in phylogenomics showcases the utility of MyESL. Our analysis of empirical genome-scale datasets shows that MyESL can build evolutionary models quickly and efficiently on a personal desktop, while other computational packages were unable due to their prohibitive requirements of computational resources and time. MyESL is available for Python environments on Linux and distributed as a standalone application for Windows and macOS. It is available from https://github.com/kumarlabgit/MyESL.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760232/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a computational topology-based approach with unsupervised machine-learning algorithms to estimate the database size and content of RNA-like graph topologies. Specifically, we apply graph theory enumeration to generate all 110,667 possible 2D dual graphs for vertex numbers ranging from 2 to 9. Among them, only 0.11% graphs correspond to approximately 200,000 known RNA atomic fragments (collected in 2021) using the RNA-as-Graphs (RAG) mapping method. The remaining 99.89% of the dual graphs may be RNA-like or non-RNA-like. To determine which dual graphs in the 99.89% hypothetical set are more likely to be associated with RNA structures, we apply computational topology descriptors using the Persistent Spectral Graphs (PSG) method to characterize each graph using 19 PSG-based features and use clustering algorithms that partition all possible dual graphs into two clusters, RNA-like cluster and non-RNA-like cluster. The distance of each dual graph to the center of the RNA-like cluster represents the likelihood of it belonging to RNA structures. From validation, our PSG-based RNA-like cluster includes 97.3% of the 121 known RNA dual graphs, suggesting good performance. Furthermore, 46.017% of the hypothetical RNAs are predicted to be RNA-like. Significantly, we observe that all the top 15 RNA-like dual graphs can be separated into multiple subgraphs, whereas the top 15 non-RNA-like dual graphs tend not to have any subgraphs. Moreover, a significant topological difference between top RNA-like and non-RNA-like graphs is evident when comparing their topological features. These findings provide valuable insights into the size of the RNA motif universe and RNA design strategies, offering a novel framework for predicting RNA graph topologies and guiding the discovery of novel RNA motifs, perhaps anti-viral therapeutics by subgraph assembly.
{"title":"How Large is the Universe of RNA-Like Motifs? A Clustering Analysis of RNA Graph Motifs Using Topological Descriptors.","authors":"Rui Wang, Tamar Schlick","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We introduce a computational topology-based approach with unsupervised machine-learning algorithms to estimate the database size and content of RNA-like graph topologies. Specifically, we apply graph theory enumeration to generate all 110,667 possible 2D dual graphs for vertex numbers ranging from 2 to 9. Among them, only 0.11% graphs correspond to approximately 200,000 known RNA atomic fragments (collected in 2021) using the RNA-as-Graphs (RAG) mapping method. The remaining 99.89% of the dual graphs may be RNA-like or non-RNA-like. To determine which dual graphs in the 99.89% hypothetical set are more likely to be associated with RNA structures, we apply computational topology descriptors using the Persistent Spectral Graphs (PSG) method to characterize each graph using 19 PSG-based features and use clustering algorithms that partition all possible dual graphs into two clusters, RNA-like cluster and non-RNA-like cluster. The distance of each dual graph to the center of the RNA-like cluster represents the likelihood of it belonging to RNA structures. From validation, our PSG-based RNA-like cluster includes 97.3% of the 121 known RNA dual graphs, suggesting good performance. Furthermore, 46.017% of the hypothetical RNAs are predicted to be RNA-like. Significantly, we observe that all the top 15 RNA-like dual graphs can be separated into multiple subgraphs, whereas the top 15 non-RNA-like dual graphs tend not to have any subgraphs. Moreover, a significant topological difference between top RNA-like and non-RNA-like graphs is evident when comparing their topological features. These findings provide valuable insights into the size of the RNA motif universe and RNA design strategies, offering a novel framework for predicting RNA graph topologies and guiding the discovery of novel RNA motifs, perhaps anti-viral therapeutics by subgraph assembly.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11760235/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pradip K Bera, Molly McCord, Jun Zhang, Jacob Notbohm
In confluent cell monolayers, patterns of cell forces and motion are systematically altered near topological defects in cell shape. In turn, defects have been proposed to alter cell density, extrusion, and invasion, but it remains unclear how the defects form and how they affect cell forces and motion. Here, we studied +1/2 defects, and, in contrast to prior studies, we observed both tail-to-head and head-to-tail defect motion occurring at the same time in the same cell monolayer. We quantified the cell velocities, the tractions at the cell-substrate interface, and stresses within the cell layer near +1/2 defects. Results revealed that both traction and stress are sources of activity within the epithelial cell monolayer, with their competition defining whether the cells inject or dissipate energy and determining the direction of motion of +1/2 defects. Interestingly, patterns of motion, traction, stress, and energy injection near +1/2 defects existed before defect formation, suggesting that defects form as a result of spatially coordinated patterns in cell forces and motion. These findings reverse the current picture, from one in which defects define the cell forces and motion to one in which coordinated patterns of cell forces and motion cause defects to form and move.
{"title":"Energy Dynamics Powered by Traction and Stress Control Formation and Motion of +1/2 Topological Defects in Epithelial Cell Monolayers.","authors":"Pradip K Bera, Molly McCord, Jun Zhang, Jacob Notbohm","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In confluent cell monolayers, patterns of cell forces and motion are systematically altered near topological defects in cell shape. In turn, defects have been proposed to alter cell density, extrusion, and invasion, but it remains unclear how the defects form and how they affect cell forces and motion. Here, we studied +1/2 defects, and, in contrast to prior studies, we observed both tail-to-head and head-to-tail defect motion occurring at the same time in the same cell monolayer. We quantified the cell velocities, the tractions at the cell-substrate interface, and stresses within the cell layer near +1/2 defects. Results revealed that both traction and stress are sources of activity within the epithelial cell monolayer, with their competition defining whether the cells inject or dissipate energy and determining the direction of motion of +1/2 defects. Interestingly, patterns of motion, traction, stress, and energy injection near +1/2 defects existed before defect formation, suggesting that defects form as a result of spatially coordinated patterns in cell forces and motion. These findings reverse the current picture, from one in which defects define the cell forces and motion to one in which coordinated patterns of cell forces and motion cause defects to form and move.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11759851/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}