Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag019
Aidan Tillman, Daniel Ramirez, Mingyang Lu
Summary: The Random Circuit Perturbation (RACIPE) algorithm enables the exploration of the dynamical behaviors of gene regulatory circuits (GRCs) by simulating an ensemble of differential equation models via randomization of kinetic parameters. Here, we release sRACIPE 2.0, a major update to the R/Bioconductor package, as a unified platform for modeling GRCs with diverse interaction types using both deterministic and stochastic simulations. The update also introduces new features for modeling perturbation, extrinsic signaling and time-corrected noise, and a new diagnostic tool to ensure proper simulations. We hope that this release will serve as a versatile modeling toolkit for the systems biology community.
Availability and implementation: The package is available on GitHub at https://github.com/lusystemsbio/sRACIPE under the MIT license. It is also available on Bioconductor at https://www.bioconductor.org/packages/release/bioc/html/sRACIPE.html.
{"title":"sRACIPE 2.0: a systems biology circuit modeling toolkit for random circuit perturbation.","authors":"Aidan Tillman, Daniel Ramirez, Mingyang Lu","doi":"10.1093/bioinformatics/btag019","DOIUrl":"10.1093/bioinformatics/btag019","url":null,"abstract":"<p><strong>Summary: </strong>The Random Circuit Perturbation (RACIPE) algorithm enables the exploration of the dynamical behaviors of gene regulatory circuits (GRCs) by simulating an ensemble of differential equation models via randomization of kinetic parameters. Here, we release sRACIPE 2.0, a major update to the R/Bioconductor package, as a unified platform for modeling GRCs with diverse interaction types using both deterministic and stochastic simulations. The update also introduces new features for modeling perturbation, extrinsic signaling and time-corrected noise, and a new diagnostic tool to ensure proper simulations. We hope that this release will serve as a versatile modeling toolkit for the systems biology community.</p><p><strong>Availability and implementation: </strong>The package is available on GitHub at https://github.com/lusystemsbio/sRACIPE under the MIT license. It is also available on Bioconductor at https://www.bioconductor.org/packages/release/bioc/html/sRACIPE.html.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866642/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag004
Jierui Xu, Elena I Zavala, Priya Moorjani
Summary: Sediment DNA-the recovery of genetic material from archaeological sediments-is an exciting new frontier in ancient DNA research, offering the potential to study individuals at a given archaeological site without destructive sampling. In recent years, several studies have demonstrated the promise of this approach by extracting hominin DNA from prehistoric sediments, including those dating back to the Middle or Late Pleistocene. However, a lack of open-source workflows for analysis of hominin sediment DNA samples poses a challenge for data processing and reproducibility of findings across studies. Here, we introduce a snakemake workflow, sedimix, for processing genomic sequences from archaeological sediment DNA samples to identify hominin sequences and generate relevant summary statistics to assess the reliability of the pipeline. By performing simulations and comparing our results to two published studies with human DNA from ∼25,000 years ago (including shotgun data from a sediment sample and capture data from touch DNA recovered from a deer tooth pendant) we demonstrate that sedimix yields accurate and reliable inferences. sedimix offers a reliable and adaptable framework to aid in the analysis of sediment DNA datasets and improve reproducibility across studies.
Availability and implementation: sedimix is available as an open-source software with the associated code, example data, and user manual with installation instructions available at https://github.com/jierui-cell/sedimix. A permanent archived version of this release is available via Zenodo: https://doi.org/10.5281/zenodo.17244854.
{"title":"sedimix: a workflow for the analysis of hominin nuclear DNA sequences from sediments.","authors":"Jierui Xu, Elena I Zavala, Priya Moorjani","doi":"10.1093/bioinformatics/btag004","DOIUrl":"10.1093/bioinformatics/btag004","url":null,"abstract":"<p><strong>Summary: </strong>Sediment DNA-the recovery of genetic material from archaeological sediments-is an exciting new frontier in ancient DNA research, offering the potential to study individuals at a given archaeological site without destructive sampling. In recent years, several studies have demonstrated the promise of this approach by extracting hominin DNA from prehistoric sediments, including those dating back to the Middle or Late Pleistocene. However, a lack of open-source workflows for analysis of hominin sediment DNA samples poses a challenge for data processing and reproducibility of findings across studies. Here, we introduce a snakemake workflow, sedimix, for processing genomic sequences from archaeological sediment DNA samples to identify hominin sequences and generate relevant summary statistics to assess the reliability of the pipeline. By performing simulations and comparing our results to two published studies with human DNA from ∼25,000 years ago (including shotgun data from a sediment sample and capture data from touch DNA recovered from a deer tooth pendant) we demonstrate that sedimix yields accurate and reliable inferences. sedimix offers a reliable and adaptable framework to aid in the analysis of sediment DNA datasets and improve reproducibility across studies.</p><p><strong>Availability and implementation: </strong>sedimix is available as an open-source software with the associated code, example data, and user manual with installation instructions available at https://github.com/jierui-cell/sedimix. A permanent archived version of this release is available via Zenodo: https://doi.org/10.5281/zenodo.17244854.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866666/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145946960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag014
Xinyi Tang, Ran Liu
Motivation: Sequence motif identification is crucial for understanding molecular recognition, particularly in immune responses involving peptide binding to major histocompatibility complex (MHC) Class I molecules for antigen presentation to T cells. Traditionally, MHC Class I binding motifs are assumed to be contiguous and span nine amino acids. However, structural evidence suggests that binding may involve nonadjacent residues, challenging the assumptions of existing methods.
Results: In this study, we propose Gap-Aware Motif Mining Algorithm (GAMMA), a probabilistic framework designed to identify noncontiguous motifs under conditions of incomplete labeling. GAMMA employs Bayesian inference with Markov chain Monte Carlo sampling to jointly estimate motif parameters, binding locations, and the relative spacing between binding positions. Through extensive simulations and real-world applications to MHC Class I peptide datasets, GAMMA outperforms existing motif discovery tools such as GLAM2 in accurately localizing binding residues and identifying the underlying motifs. Notably, our results suggest that the true number of binding residues may be eight, fewer than the commonly assumed nine. In addition, for longer peptides, the model captures increased flexibility in the central region, consistent with structural observations that peptides may bulge in the middle.
Availability and implementation: The raw data and the source codes are available on GitHub (https://github.com/RanLIUaca/GAMMAmotif).
{"title":"GAMMA: gap-aware motif mining under incomplete labeling with applications to MHC motifs.","authors":"Xinyi Tang, Ran Liu","doi":"10.1093/bioinformatics/btag014","DOIUrl":"10.1093/bioinformatics/btag014","url":null,"abstract":"<p><strong>Motivation: </strong>Sequence motif identification is crucial for understanding molecular recognition, particularly in immune responses involving peptide binding to major histocompatibility complex (MHC) Class I molecules for antigen presentation to T cells. Traditionally, MHC Class I binding motifs are assumed to be contiguous and span nine amino acids. However, structural evidence suggests that binding may involve nonadjacent residues, challenging the assumptions of existing methods.</p><p><strong>Results: </strong>In this study, we propose Gap-Aware Motif Mining Algorithm (GAMMA), a probabilistic framework designed to identify noncontiguous motifs under conditions of incomplete labeling. GAMMA employs Bayesian inference with Markov chain Monte Carlo sampling to jointly estimate motif parameters, binding locations, and the relative spacing between binding positions. Through extensive simulations and real-world applications to MHC Class I peptide datasets, GAMMA outperforms existing motif discovery tools such as GLAM2 in accurately localizing binding residues and identifying the underlying motifs. Notably, our results suggest that the true number of binding residues may be eight, fewer than the commonly assumed nine. In addition, for longer peptides, the model captures increased flexibility in the central region, consistent with structural observations that peptides may bulge in the middle.</p><p><strong>Availability and implementation: </strong>The raw data and the source codes are available on GitHub (https://github.com/RanLIUaca/GAMMAmotif).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866627/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag017
Manuel Mangoni, Salvatore Daniele Bianco, Francesco Petrizzelli, Michele Pieroni, Pietro Hiram Guzzi, Viviana Caputo, Tommaso Biagini, Tommaso Mazza
Motivation: Molecular dynamics (MD) simulations provide detailed atomistic insights into biomolecular processes, but comparing independent trajectories remains challenging due to stochastic divergence. Misaligned simulations can obscure shared mechanisms or exaggerate differences, limiting reproducibility and mechanistic interpretation. A generalizable, unsupervised method for synchronizing and comparing MD trajectories across systems and conditions is, therefore, needed.
Results: We introduce NetMD, a computational framework that synchronizes and analyzes MD trajectories by integrating graph-based representations with dynamic time warping. Trajectory frames are converted into residue-contact graphs, entropy-filtered to retain variable interactions, and embedded as low-dimensional vectors. NetMD aligns these vectorized trajectories through time-warping barycenter averaging, generating a consensus trajectory while pruning outlier simulations. Applied to transporters, demethylases, and large protein complexes relevant to neurological disease pathways and cancer, NetMD revealed shared multiphase dynamics and identified mutation- or ligand-specific deviations. This unsupervised, time-resolved approach enables direct comparison of MD ensembles across heterogeneous conditions. NetMD is robust and broadly applicable, providing a tool for uncovering conserved patterns and critical divergences in biomolecular dynamics.
Availability and implementation: NetMD is freely available at https://github.com/mazzalab/NetMD.
{"title":"Unsupervised synchronization of molecular dynamics trajectories via graph embedding and time warping.","authors":"Manuel Mangoni, Salvatore Daniele Bianco, Francesco Petrizzelli, Michele Pieroni, Pietro Hiram Guzzi, Viviana Caputo, Tommaso Biagini, Tommaso Mazza","doi":"10.1093/bioinformatics/btag017","DOIUrl":"10.1093/bioinformatics/btag017","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular dynamics (MD) simulations provide detailed atomistic insights into biomolecular processes, but comparing independent trajectories remains challenging due to stochastic divergence. Misaligned simulations can obscure shared mechanisms or exaggerate differences, limiting reproducibility and mechanistic interpretation. A generalizable, unsupervised method for synchronizing and comparing MD trajectories across systems and conditions is, therefore, needed.</p><p><strong>Results: </strong>We introduce NetMD, a computational framework that synchronizes and analyzes MD trajectories by integrating graph-based representations with dynamic time warping. Trajectory frames are converted into residue-contact graphs, entropy-filtered to retain variable interactions, and embedded as low-dimensional vectors. NetMD aligns these vectorized trajectories through time-warping barycenter averaging, generating a consensus trajectory while pruning outlier simulations. Applied to transporters, demethylases, and large protein complexes relevant to neurological disease pathways and cancer, NetMD revealed shared multiphase dynamics and identified mutation- or ligand-specific deviations. This unsupervised, time-resolved approach enables direct comparison of MD ensembles across heterogeneous conditions. NetMD is robust and broadly applicable, providing a tool for uncovering conserved patterns and critical divergences in biomolecular dynamics.</p><p><strong>Availability and implementation: </strong>NetMD is freely available at https://github.com/mazzalab/NetMD.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12930375/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag021
Huda Ahmad, Hannah M Doherty, Sam T Benedict, James R J Haycocks, Ge Zhou, Patrick J Moynihan, Danesh Moradigaravand, Manuel Banzhaf
Motivation: Chemical genomics is a powerful high-throughput approach to systematically link phenotypes to genotypes. However, the vast datasets generated remain challenging to explore due to the lack of integrated, interactive tools for visualization and analysis. Existing workflows often require multiple independent software tools, limiting data accessibility and collaboration. Therefore, we created a user-friendly platform that enables efficient exploration and sharing of chemical genomics data.
Results: We developed ChemGenXplore, a web-based Shiny application designed to streamline the visualization and analysis of chemical genomic screens. It offers two primary functionalities: one for exploring pre-implemented datasets and another for analysing user-uploaded datasets. ChemGenXplore enables users to visualize phenotypic profiles, assess gene-gene and condition-condition correlations, perform GO and KEGG enrichment analysis, and generate customizable, interactive heatmaps. To further support collaborative research, ChemGenXplore also facilitates the comparative analysis of chemical genomic and other omics datasets. By consolidating these features into a single interactive and accessible tool, ChemGenXplore facilitates data sharing, enhances reproducibility, and promotes collaboration within the research community.
Availability and implementation: ChemGenXplore is freely accessible as a web application at https://chemgenxplore.kaust.edu.sa/. Source code and documentation, including instructions for local installation, are provided on GitHub (https://github.com/Hudaahmadd/ChemGenXplore). A Docker image is also available on DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore) to ensure reproducibility and simplify installation.
{"title":"ChemGenXplore: an interactive tool for exploring and analysing chemical genomic data.","authors":"Huda Ahmad, Hannah M Doherty, Sam T Benedict, James R J Haycocks, Ge Zhou, Patrick J Moynihan, Danesh Moradigaravand, Manuel Banzhaf","doi":"10.1093/bioinformatics/btag021","DOIUrl":"10.1093/bioinformatics/btag021","url":null,"abstract":"<p><strong>Motivation: </strong>Chemical genomics is a powerful high-throughput approach to systematically link phenotypes to genotypes. However, the vast datasets generated remain challenging to explore due to the lack of integrated, interactive tools for visualization and analysis. Existing workflows often require multiple independent software tools, limiting data accessibility and collaboration. Therefore, we created a user-friendly platform that enables efficient exploration and sharing of chemical genomics data.</p><p><strong>Results: </strong>We developed ChemGenXplore, a web-based Shiny application designed to streamline the visualization and analysis of chemical genomic screens. It offers two primary functionalities: one for exploring pre-implemented datasets and another for analysing user-uploaded datasets. ChemGenXplore enables users to visualize phenotypic profiles, assess gene-gene and condition-condition correlations, perform GO and KEGG enrichment analysis, and generate customizable, interactive heatmaps. To further support collaborative research, ChemGenXplore also facilitates the comparative analysis of chemical genomic and other omics datasets. By consolidating these features into a single interactive and accessible tool, ChemGenXplore facilitates data sharing, enhances reproducibility, and promotes collaboration within the research community.</p><p><strong>Availability and implementation: </strong>ChemGenXplore is freely accessible as a web application at https://chemgenxplore.kaust.edu.sa/. Source code and documentation, including instructions for local installation, are provided on GitHub (https://github.com/Hudaahmadd/ChemGenXplore). A Docker image is also available on DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore) to ensure reproducibility and simplify installation.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12872398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145967397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag011
Darya Shlyk, Lawrence Hunter
Motivation: Biomedical Entity Linking (BEL) maps mentions in biomedical text to standardized identifiers, enabling structured data integration and downstream knowledge discovery. However, current BEL systems remain fundamentally constrained by the recall of the initial candidate pool, where suboptimal retrieval limits the overall effectiveness of the normalization pipeline.
Results: We present the first systematic evaluation of Generative Relevance Feedback (GRF) for enhancing candidate retrieval in state-of-the-art BEL systems. GRF leverages large language models (LLMs) to enrich the expressiveness of the mention in a zero-shot fashion. We assess GRF's impact under two scenarios-direct linking prediction and candidate generation in cascading normalization pipelines-and analyze its sensitivity to different LLMs, feedback types, and integration strategies. Experiments across eight corpora and four biomedical knowledge bases demonstrate that integrating GRF significantly improves both accuracy and recall, thereby increasing the upper bound on normalization performance. Our findings highlight GRF as an efficient, model-agnostic solution and underscore its potential as a key component for advancing BEL.
Availability and implementation: The code to reproduce our experiments can be found at: https://doi.org/10.5281/zenodo.17853541.
{"title":"Improving biomedical entity linking with generative relevance feedback.","authors":"Darya Shlyk, Lawrence Hunter","doi":"10.1093/bioinformatics/btag011","DOIUrl":"10.1093/bioinformatics/btag011","url":null,"abstract":"<p><strong>Motivation: </strong>Biomedical Entity Linking (BEL) maps mentions in biomedical text to standardized identifiers, enabling structured data integration and downstream knowledge discovery. However, current BEL systems remain fundamentally constrained by the recall of the initial candidate pool, where suboptimal retrieval limits the overall effectiveness of the normalization pipeline.</p><p><strong>Results: </strong>We present the first systematic evaluation of Generative Relevance Feedback (GRF) for enhancing candidate retrieval in state-of-the-art BEL systems. GRF leverages large language models (LLMs) to enrich the expressiveness of the mention in a zero-shot fashion. We assess GRF's impact under two scenarios-direct linking prediction and candidate generation in cascading normalization pipelines-and analyze its sensitivity to different LLMs, feedback types, and integration strategies. Experiments across eight corpora and four biomedical knowledge bases demonstrate that integrating GRF significantly improves both accuracy and recall, thereby increasing the upper bound on normalization performance. Our findings highlight GRF as an efficient, model-agnostic solution and underscore its potential as a key component for advancing BEL.</p><p><strong>Availability and implementation: </strong>The code to reproduce our experiments can be found at: https://doi.org/10.5281/zenodo.17853541.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866626/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag023
Siera Martinez, Tushar Sharma, Luke Johnson, Allen Kim, Vania Ballesteros Prieto, Hovhannes Arestakesyan, Sunisha Harish, Jewel Dias, Joseph Goldfrank, Nathan Edwards, Anelia Horvath
Motivation: Accurately characterizing expressed genetic variation at the single-cell level is essential for understanding transcriptional heterogeneity, allelic regulation, and mutational dynamics within complex tissues. However, few tools enable comprehensive visualization and quantitative analysis of expressed variants across individual cells.
Results: scSNViz is an R package for the exploration, quantification, and visualization of expressed single-nucleotide variants (SNVs) from cell-barcoded single-cell RNA sequencing (scRNA-seq) data. The software supports estimation of variant allele fractions, clustering of SNV expression profiles, and 2D and 3D visualization of individual SNVs or user-defined SNV groups. Beyond visualization, scSNViz facilitates investigation of cell-, cluster-, or lineage-specific variant expression patterns, as well as allelic dynamics including imprinting, random allele inactivation, and transcriptional bursting. It interoperates seamlessly with established single-cell frameworks-Seurat for clustering, Slingshot for trajectory inference, scType for cell-type annotation, and CopyKat for copy-number profiling-enabling integrative multi-omic analyses of expressed variation.
Availability and implementation: scSNViz is implemented in R and freely available at https://github.com/HorvathLab/scSNViz (DOI: 10.5281/zenodo.17307516). The package includes comprehensive documentation and example workflows designed for users with limited bioinformatics experience.
{"title":"scSNViz: visualization and analysis of cell-specific expressed SNVs.","authors":"Siera Martinez, Tushar Sharma, Luke Johnson, Allen Kim, Vania Ballesteros Prieto, Hovhannes Arestakesyan, Sunisha Harish, Jewel Dias, Joseph Goldfrank, Nathan Edwards, Anelia Horvath","doi":"10.1093/bioinformatics/btag023","DOIUrl":"10.1093/bioinformatics/btag023","url":null,"abstract":"<p><strong>Motivation: </strong>Accurately characterizing expressed genetic variation at the single-cell level is essential for understanding transcriptional heterogeneity, allelic regulation, and mutational dynamics within complex tissues. However, few tools enable comprehensive visualization and quantitative analysis of expressed variants across individual cells.</p><p><strong>Results: </strong>scSNViz is an R package for the exploration, quantification, and visualization of expressed single-nucleotide variants (SNVs) from cell-barcoded single-cell RNA sequencing (scRNA-seq) data. The software supports estimation of variant allele fractions, clustering of SNV expression profiles, and 2D and 3D visualization of individual SNVs or user-defined SNV groups. Beyond visualization, scSNViz facilitates investigation of cell-, cluster-, or lineage-specific variant expression patterns, as well as allelic dynamics including imprinting, random allele inactivation, and transcriptional bursting. It interoperates seamlessly with established single-cell frameworks-Seurat for clustering, Slingshot for trajectory inference, scType for cell-type annotation, and CopyKat for copy-number profiling-enabling integrative multi-omic analyses of expressed variation.</p><p><strong>Availability and implementation: </strong>scSNViz is implemented in R and freely available at https://github.com/HorvathLab/scSNViz (DOI: 10.5281/zenodo.17307516). The package includes comprehensive documentation and example workflows designed for users with limited bioinformatics experience.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866635/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag012
Kaylee D Rich, James D Wasmuth
Motivation: Molecular mimicry is used by pathogens to evade the host immune system and manipulate other host cellular processes. It is often mediated by short motifs in non-homologous proteins, whose detection challenges the sensitivity and specificity of existing bioinformatics tools.
Results: We present mimicDetector, a k-mer-based pipeline for identifying protein-level molecular mimicry between pathogens and their hosts. Applied to 17 globally important pathogens, mimicDetector identified a broad and biologically plausible set of mimicry candidates, including helminth proteins mimicking components of the human complement system and a Leishmania infantum mimic of Reticulon-4, a regulator of immune cell recruitment.
Availability and implementation: mimicDetector is freely available at https://github.com/kayleerich/mimicDetector/, implemented in Python and Snakemake, and compatible with Unix-based systems.
{"title":"mimicDetector: a pipeline for protein motif mimicry detection in host-pathogen interactions.","authors":"Kaylee D Rich, James D Wasmuth","doi":"10.1093/bioinformatics/btag012","DOIUrl":"10.1093/bioinformatics/btag012","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular mimicry is used by pathogens to evade the host immune system and manipulate other host cellular processes. It is often mediated by short motifs in non-homologous proteins, whose detection challenges the sensitivity and specificity of existing bioinformatics tools.</p><p><strong>Results: </strong>We present mimicDetector, a k-mer-based pipeline for identifying protein-level molecular mimicry between pathogens and their hosts. Applied to 17 globally important pathogens, mimicDetector identified a broad and biologically plausible set of mimicry candidates, including helminth proteins mimicking components of the human complement system and a Leishmania infantum mimic of Reticulon-4, a regulator of immune cell recruitment.</p><p><strong>Availability and implementation: </strong>mimicDetector is freely available at https://github.com/kayleerich/mimicDetector/, implemented in Python and Snakemake, and compatible with Unix-based systems.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881831/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf658
Jennifer Fouquier, Maggie Stanislawski, John O'Connor, Ashley Scadden, Catherine Lozupone
Motivation: Longitudinal microbiome studies (LMS) are increasingly common but have analytic challenges including nonindependent data requiring mixed-effects models. Furthermore, large amounts of data motivate exploratory analysis to identify factors related to outcome variables. Although change analysis (i.e. calculating feature changes between timepoints) can be powerful, how to best conduct these analyses is often unclear. For example, observational LMS measurements show natural fluctuations, so baseline might not be a reference of primary interest, whereas for interventional LMS, baseline is typically a key reference point, often indicating the start of treatment.
Results: To address these challenges, a feature selection workflow, called EXPLANA (EXPLoratory ANAlysis), was developed for LMS that supports numerical and categorical data, and also accommodates cross-sectional studies. Machine learning methods were combined with different types of change calculations and downstream interpretation methods to identify statistically meaningful variables and explain their relationship to outcomes. EXPLANA generates an interactive report that textually and graphically summarizes methods and results. EXPLANA had good performance on simulated longitudinal data, with a balanced accuracy score of 0.91 (range: 0.79-1.00, SD = 0.05), outperformed an existing tool, QIIME 2 feature-volatility (balanced accuracy: 0.95 versus 0.56) and identified novel order-dependent categorical feature changes (e.g. different effect for A_B versus B_A). EXPLANA is broadly applicable and simplifies analytics for identifying features related to outcomes of interest.
Availability and implementation: Software is available at https://github.com/JTFouquier/explana and https://zenodo.org/records/17478745 (10.5281/zenodo.17478744). Documentation and demos are available at www.explana.io.
动机:纵向微生物组研究(LMS)越来越普遍,但存在分析挑战,包括需要混合效应模型的非独立数据。此外,大量的数据激发了探索性分析,以确定与结果变量相关的因素。尽管变更分析(例如,计算时间点之间的特性变化)可能很强大,但如何最好地进行这些分析通常是不清楚的。例如,观察性LMS测量显示自然波动,因此基线可能不是主要关注的参考,而对于介入性LMS,基线通常是关键参考点,通常表明治疗的开始。结果:为了应对这些挑战,LMS开发了一种称为expla (EXPLoratory ANAlysis,探索性分析)的特征选择工作流,该工作流支持数值和分类数据,并可进行横断面研究。机器学习方法与不同类型的变化计算和下游解释方法相结合,以识别统计上有意义的变量,并解释它们与结果的关系。expla生成一个交互式报告,以文本和图形形式总结方法和结果。expla在模拟纵向数据上表现良好,平衡精度得分为0.91(范围:0.79-1.00,SD = 0.05),优于现有工具QIIME 2特征波动率(平衡精度:0.95 vs. 0.56),并识别出新的顺序相关分类特征变化(例如,A_B与B_A的不同影响)。expla广泛适用,简化了识别与感兴趣的结果相关的特征的分析。可用性:软件可在https://github.com/JTFouquier/explana和https://zenodo.org/records/17478745 (10.5281/zenodo.17478744)获得。文档和演示可在www.explana.io.Supplementary信息上获得;补充数据可在Bioinformatics在线上获得。
{"title":"EXPLANA: a user-friendly workflow for EXPLoratory ANAlysis and feature selection in cross-sectional and longitudinal microbiome studies.","authors":"Jennifer Fouquier, Maggie Stanislawski, John O'Connor, Ashley Scadden, Catherine Lozupone","doi":"10.1093/bioinformatics/btaf658","DOIUrl":"10.1093/bioinformatics/btaf658","url":null,"abstract":"<p><strong>Motivation: </strong>Longitudinal microbiome studies (LMS) are increasingly common but have analytic challenges including nonindependent data requiring mixed-effects models. Furthermore, large amounts of data motivate exploratory analysis to identify factors related to outcome variables. Although change analysis (i.e. calculating feature changes between timepoints) can be powerful, how to best conduct these analyses is often unclear. For example, observational LMS measurements show natural fluctuations, so baseline might not be a reference of primary interest, whereas for interventional LMS, baseline is typically a key reference point, often indicating the start of treatment.</p><p><strong>Results: </strong>To address these challenges, a feature selection workflow, called EXPLANA (EXPLoratory ANAlysis), was developed for LMS that supports numerical and categorical data, and also accommodates cross-sectional studies. Machine learning methods were combined with different types of change calculations and downstream interpretation methods to identify statistically meaningful variables and explain their relationship to outcomes. EXPLANA generates an interactive report that textually and graphically summarizes methods and results. EXPLANA had good performance on simulated longitudinal data, with a balanced accuracy score of 0.91 (range: 0.79-1.00, SD = 0.05), outperformed an existing tool, QIIME 2 feature-volatility (balanced accuracy: 0.95 versus 0.56) and identified novel order-dependent categorical feature changes (e.g. different effect for A_B versus B_A). EXPLANA is broadly applicable and simplifies analytics for identifying features related to outcomes of interest.</p><p><strong>Availability and implementation: </strong>Software is available at https://github.com/JTFouquier/explana and https://zenodo.org/records/17478745 (10.5281/zenodo.17478744). Documentation and demos are available at www.explana.io.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12766912/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.
Results: We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.
Availability and implementation: The main program is available at https://doi.org/10.5281/zenodo.17172822.
{"title":"NanoSSL: attention mechanism-based self-supervised learning method for protein identification using nanopores.","authors":"Yong Xie, Jindong Li, Ziyan Zhang, Bin Meng, Shuaijian Dai, Yuchen Zhou, Eamonn Kennedy, Niandong Jiao, Haobin Chen, Zhuxin Dong","doi":"10.1093/bioinformatics/btaf657","DOIUrl":"10.1093/bioinformatics/btaf657","url":null,"abstract":"<p><strong>Motivation: </strong>Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.</p><p><strong>Results: </strong>We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.</p><p><strong>Availability and implementation: </strong>The main program is available at https://doi.org/10.5281/zenodo.17172822.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"42 1","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777981/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145919221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}