Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf562
Jiyuan Yang, Nana Wei, Yang Qu, Congcong Hu, Weiwei Zhang, Lin Liu, Hua-Jun Wu, Xiaoqi Zheng
Motivation: Spatial transcriptomics (ST) technologies provide valuable insights into cellular heterogeneity by simultaneously acquiring both gene expression profiles and cellular location information. However, the limited diversity and accuracy of "gold standard" datasets hindered the effectiveness and fairness of benchmarking rapidly growing ST analysis tools.
Results: To address this issue, we proposed Spider, a flexible and comprehensive framework for simulating ST data without requiring real ST data as a reference. By characterizing the spatial patterns using cell type proportions and transition matrix between adjacent cells, Spider can produce more realistic and diverse simulated data and offer enhanced modeling flexibility compared to existing simulation methods. Additionally, Spider provides interactive features for customizing the spatial domain, such as zone segmentation and integration of histology imaging data. Benchmark analyses demonstrate that Spider outperforms other simulation tools in preserving the spatial characteristics of real ST data and facilitating the evaluation of downstream analysis methods. Spider is implemented in Python and available at https://github.com/YANG-ERA/Spider.
Availability and implementation: All codes, simulated ST data in this paper are publicly available at https://github.com/YANG-ERA/Spider.
{"title":"Spider: a flexible and unified framework for simulating spatial transcriptomics data.","authors":"Jiyuan Yang, Nana Wei, Yang Qu, Congcong Hu, Weiwei Zhang, Lin Liu, Hua-Jun Wu, Xiaoqi Zheng","doi":"10.1093/bioinformatics/btaf562","DOIUrl":"10.1093/bioinformatics/btaf562","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) technologies provide valuable insights into cellular heterogeneity by simultaneously acquiring both gene expression profiles and cellular location information. However, the limited diversity and accuracy of \"gold standard\" datasets hindered the effectiveness and fairness of benchmarking rapidly growing ST analysis tools.</p><p><strong>Results: </strong>To address this issue, we proposed Spider, a flexible and comprehensive framework for simulating ST data without requiring real ST data as a reference. By characterizing the spatial patterns using cell type proportions and transition matrix between adjacent cells, Spider can produce more realistic and diverse simulated data and offer enhanced modeling flexibility compared to existing simulation methods. Additionally, Spider provides interactive features for customizing the spatial domain, such as zone segmentation and integration of histology imaging data. Benchmark analyses demonstrate that Spider outperforms other simulation tools in preserving the spatial characteristics of real ST data and facilitating the evaluation of downstream analysis methods. Spider is implemented in Python and available at https://github.com/YANG-ERA/Spider.</p><p><strong>Availability and implementation: </strong>All codes, simulated ST data in this paper are publicly available at https://github.com/YANG-ERA/Spider.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790819/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145524860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf665
Massimiliano S Tagliamonte, Abhinav Sharma, Alberto Riva, Monika Moir, Marco Salemi, Cheryl Baxter, Tulio de Oliveira, Carla N Mavian, Eduan Wilkinson
Summary: Next Generation Sequencing is widely deployed in cholera-endemic regions, yet an end-to-end reproducible pipeline that unifies read QC, filtering, reference mapping, variant calling/annotation, recombination screening, and extraction of parsimony informative sites/variant codons, phylogenetic inference for downstream phylodynamic and epidemiological analyses have been lacking, slowing outbreak investigation and public health response. CholeraSeq is a high-throughput genomics pipeline for cholera genomic surveillance. It ingests consensus genomes, short read sequence data, draft assemblies, and scales seamlessly from local to cloud environments. To accelerate epidemiological context placement of new outbreak strains, we provide a curated ready-to-use core genome alignment compiled from public data, enabling flexible, fast, integration of new samples for outbreak investigations.
Availability and implementation: CholeraSeq is freely available on the GitHub platform https://github.com/CERI-KRISP/CholeraSeq. CholeraSeq is implemented in Nextflow with a modular design building upon the nf-core community standards.
{"title":"CholeraSeq: a comprehensive genomic pipeline for cholera surveillance and near real-time outbreak investigation.","authors":"Massimiliano S Tagliamonte, Abhinav Sharma, Alberto Riva, Monika Moir, Marco Salemi, Cheryl Baxter, Tulio de Oliveira, Carla N Mavian, Eduan Wilkinson","doi":"10.1093/bioinformatics/btaf665","DOIUrl":"10.1093/bioinformatics/btaf665","url":null,"abstract":"<p><strong>Summary: </strong>Next Generation Sequencing is widely deployed in cholera-endemic regions, yet an end-to-end reproducible pipeline that unifies read QC, filtering, reference mapping, variant calling/annotation, recombination screening, and extraction of parsimony informative sites/variant codons, phylogenetic inference for downstream phylodynamic and epidemiological analyses have been lacking, slowing outbreak investigation and public health response. CholeraSeq is a high-throughput genomics pipeline for cholera genomic surveillance. It ingests consensus genomes, short read sequence data, draft assemblies, and scales seamlessly from local to cloud environments. To accelerate epidemiological context placement of new outbreak strains, we provide a curated ready-to-use core genome alignment compiled from public data, enabling flexible, fast, integration of new samples for outbreak investigations.</p><p><strong>Availability and implementation: </strong>CholeraSeq is freely available on the GitHub platform https://github.com/CERI-KRISP/CholeraSeq. CholeraSeq is implemented in Nextflow with a modular design building upon the nf-core community standards.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790814/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf585
Nir Nitskansky, Kessem Clein, Barak Raveh
Motivation: Biomolecules undergo dynamic transitions among metastable states to carry out their biological functions. Markov State Models (MSMs) effectively capture these metastable states and transitions at a defined temporal scale. However, biomolecular dynamics typically span multiple temporal scales, ranging from fast atomic vibrations to slower conformational changes and folding events.
Results: We introduce multiscale Markov State Models (mMSMs), which capture biomolecular dynamics across multiple temporal resolutions simultaneously via a hierarchy of MSMs, and mMSM-explore, an unsupervised algorithm for generating mMSMs through multiscale adaptive sampling with on-the-fly identification of temporally metastable states. We benchmark our method on a toy system with nested energy minima; on alanine dipeptide, first with and then without assuming prior knowledge of its two reaction coordinates; and finally, on a fast-folding 35-residue miniprotein, where we map folding pathways across scales. We demonstrate efficient mapping of energy landscapes, correct representation of multiscale hierarchies and transition states, accurate inference of stationary probabilities and transition kinetics, as well as de novo identification of underlying slow, intermediate, and fast reaction coordinates. mMSMs reveal how dynamic processes at different scales contribute collectively to the functional mechanisms of biomolecular machines.
Availability and implementation: Python code and instructions are available at https://github.com/ravehlab/mMSM.
{"title":"Building multiscale Markov state models by systematic mapping of temporal communities.","authors":"Nir Nitskansky, Kessem Clein, Barak Raveh","doi":"10.1093/bioinformatics/btaf585","DOIUrl":"10.1093/bioinformatics/btaf585","url":null,"abstract":"<p><strong>Motivation: </strong>Biomolecules undergo dynamic transitions among metastable states to carry out their biological functions. Markov State Models (MSMs) effectively capture these metastable states and transitions at a defined temporal scale. However, biomolecular dynamics typically span multiple temporal scales, ranging from fast atomic vibrations to slower conformational changes and folding events.</p><p><strong>Results: </strong>We introduce multiscale Markov State Models (mMSMs), which capture biomolecular dynamics across multiple temporal resolutions simultaneously via a hierarchy of MSMs, and mMSM-explore, an unsupervised algorithm for generating mMSMs through multiscale adaptive sampling with on-the-fly identification of temporally metastable states. We benchmark our method on a toy system with nested energy minima; on alanine dipeptide, first with and then without assuming prior knowledge of its two reaction coordinates; and finally, on a fast-folding 35-residue miniprotein, where we map folding pathways across scales. We demonstrate efficient mapping of energy landscapes, correct representation of multiscale hierarchies and transition states, accurate inference of stationary probabilities and transition kinetics, as well as de novo identification of underlying slow, intermediate, and fast reaction coordinates. mMSMs reveal how dynamic processes at different scales contribute collectively to the functional mechanisms of biomolecular machines.</p><p><strong>Availability and implementation: </strong>Python code and instructions are available at https://github.com/ravehlab/mMSM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf675
Tim Stohn, Roderick A P M van Eijl, Klaas W Mulder, Lodewyk F A Wessels, Evert Bosdriesz
Motivation: Signal transduction networks regulate many essential biological processes and are frequently aberrated in diseases such as cancer. A mechanistic understanding of such networks, and how they differ between cell populations, is essential to design effective treatment strategies. Typically, such networks are computationally reconstructed based on systematic perturbation experiments, followed by quantification of signaling protein activity. Recent technological advances now allow for the quantification of the activity of many (signaling) proteins simultaneously in single cells. This makes it feasible to reconstruct or quantify signaling networks without performing systematic perturbations.
Results: Here, we introduce single-cell modular response analysis (scMRA) and single-cell comparative network reconstruction (scCNR) to derive signal transduction networks by exploiting the heterogeneity of single-cell (phospho-)protein measurements. The methods treat stochastic variation in total protein abundances as natural perturbation experiments, whose effects propagate through the network and hence facilitate the reconstruction and quantification of the underlying signaling network. scCNR reconstructs cell population-specific networks, where cells from different populations have the same underlying topology, but the interaction strengths can differ between populations. We extensively validated scMRA and scCNR on simulated data, and applied it to unpublished data of (phospho-)protein measurements of EGFR-inhibitor-treated keratinocytes to recover signaling differences downstream of EGFR. scCNR will help to unravel the mechanistic signaling differences between cell populations, and will subsequently guide the development of well-informed treatment strategies.
Availability and implementation: The code used for scCNR in this study has been deposited on Zenodo https://doi.org/10.5281/zenodo.17600937 and is also available as a Python module at https://github.com/ibivu/scmra. Additionally, data and code to reproduce all figures is available at https://github.com/tstohn/scmra_analysis.
{"title":"Reconstructing and comparing signal transduction networks from single-cell protein quantification data.","authors":"Tim Stohn, Roderick A P M van Eijl, Klaas W Mulder, Lodewyk F A Wessels, Evert Bosdriesz","doi":"10.1093/bioinformatics/btaf675","DOIUrl":"10.1093/bioinformatics/btaf675","url":null,"abstract":"<p><strong>Motivation: </strong>Signal transduction networks regulate many essential biological processes and are frequently aberrated in diseases such as cancer. A mechanistic understanding of such networks, and how they differ between cell populations, is essential to design effective treatment strategies. Typically, such networks are computationally reconstructed based on systematic perturbation experiments, followed by quantification of signaling protein activity. Recent technological advances now allow for the quantification of the activity of many (signaling) proteins simultaneously in single cells. This makes it feasible to reconstruct or quantify signaling networks without performing systematic perturbations.</p><p><strong>Results: </strong>Here, we introduce single-cell modular response analysis (scMRA) and single-cell comparative network reconstruction (scCNR) to derive signal transduction networks by exploiting the heterogeneity of single-cell (phospho-)protein measurements. The methods treat stochastic variation in total protein abundances as natural perturbation experiments, whose effects propagate through the network and hence facilitate the reconstruction and quantification of the underlying signaling network. scCNR reconstructs cell population-specific networks, where cells from different populations have the same underlying topology, but the interaction strengths can differ between populations. We extensively validated scMRA and scCNR on simulated data, and applied it to unpublished data of (phospho-)protein measurements of EGFR-inhibitor-treated keratinocytes to recover signaling differences downstream of EGFR. scCNR will help to unravel the mechanistic signaling differences between cell populations, and will subsequently guide the development of well-informed treatment strategies.</p><p><strong>Availability and implementation: </strong>The code used for scCNR in this study has been deposited on Zenodo https://doi.org/10.5281/zenodo.17600937 and is also available as a Python module at https://github.com/ibivu/scmra. Additionally, data and code to reproduce all figures is available at https://github.com/tstohn/scmra_analysis.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797212/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btag005
Alan J S Beavan, Maria Rosa Domingo-Sananes, James O McInerney
Motivation: The presence or absence of some genes in a genome can influence whether other genes are likely to be present or absent. Understanding these gene co-occurrence and avoidance patterns reveals fundamental principles of genome organization, with applications ranging from evolutionary reconstruction to rational design of synthetic genomes.
Results: PanForest, presented here, uses random forest classifiers to predict the presence and absence of genes in genomes from the set of other genes present. Performance statistics output by PanForest reveal how predictable each gene's presence or absence is, based on the presence or absence of other genes in the genome. Further, PanForest produces statistics indicating the importance of each gene in predicting the presence or absence of each other gene. The PanForest software can run serially or in parallel, thereby facilitating the analysis of pangenomes at Network of Life scale.A pangenome of 12 741 accessory genes in 1000 Escherichia coli genomes was analysed in around 5 h using eight processors. To demonstrate PanForest's utility, we present a case study and show that certain genes associated with resistance to antimicrobial drugs reliably predict the presence or absence of other genes associated with resistance to the same drug. Further, we highlight several associations between those genes and others not known to be associated with antimicrobial resistance (AMR), or associated with resistance to other drugs. We envisage PanForest's use in studies from multiple disciplines concerning the dynamics of gene distributions in pangenomes ranging from biomedical science and synthetic biology to molecular ecology.
Availability and implementation: The software if freely available with a full manual and can be found with at www.github.com/alanbeavan/PanForest DOI: https://doi.org/10.5281/zenodo.17865482.
{"title":"PanForest: predicting genes in genomes using random forests.","authors":"Alan J S Beavan, Maria Rosa Domingo-Sananes, James O McInerney","doi":"10.1093/bioinformatics/btag005","DOIUrl":"10.1093/bioinformatics/btag005","url":null,"abstract":"<p><strong>Motivation: </strong>The presence or absence of some genes in a genome can influence whether other genes are likely to be present or absent. Understanding these gene co-occurrence and avoidance patterns reveals fundamental principles of genome organization, with applications ranging from evolutionary reconstruction to rational design of synthetic genomes.</p><p><strong>Results: </strong>PanForest, presented here, uses random forest classifiers to predict the presence and absence of genes in genomes from the set of other genes present. Performance statistics output by PanForest reveal how predictable each gene's presence or absence is, based on the presence or absence of other genes in the genome. Further, PanForest produces statistics indicating the importance of each gene in predicting the presence or absence of each other gene. The PanForest software can run serially or in parallel, thereby facilitating the analysis of pangenomes at Network of Life scale.A pangenome of 12 741 accessory genes in 1000 Escherichia coli genomes was analysed in around 5 h using eight processors. To demonstrate PanForest's utility, we present a case study and show that certain genes associated with resistance to antimicrobial drugs reliably predict the presence or absence of other genes associated with resistance to the same drug. Further, we highlight several associations between those genes and others not known to be associated with antimicrobial resistance (AMR), or associated with resistance to other drugs. We envisage PanForest's use in studies from multiple disciplines concerning the dynamics of gene distributions in pangenomes ranging from biomedical science and synthetic biology to molecular ecology.</p><p><strong>Availability and implementation: </strong>The software if freely available with a full manual and can be found with at www.github.com/alanbeavan/PanForest DOI: https://doi.org/10.5281/zenodo.17865482.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12857576/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145946703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf631
Jan van Eck, Dea Gogishvili, Wilson Silva, Sanne Abeln
Motivation: Protein language models (PLMs) have revolutionized computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black-box nature limits biological interpretation and translation to actionable insights. Bridging this gap requires approaches that maintain predictive performance while providing interpretable explanations of model behaviour.
Results: We present PLM-eXplain (PLM-X), an explainable adapter layer that bridges this gap by factoring PLM embeddings into two complementary components: an interpretable subspace based on established biochemical features, and a residual subspace that retains predictive, non-interpretable information. Using embeddings from ESM2 and ProtBert, PLM-X incorporates well-established properties, including secondary structure and hydropathy, while maintaining high predictive performance. We demonstrate the effectiveness of our approach across three biologically relevant classification tasks: extracellular vesicle association, transmembrane helix prediction, and aggregation propensity prediction. PLM-X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalizable solution for enhancing PLM interpretability across various downstream applications.
Availability and implementation: Source code and models are available at https://github.com/AIT4LIFE-UU/PLM-eXplain/.
{"title":"PLM-eXplain: divide and conquer the protein embedding space.","authors":"Jan van Eck, Dea Gogishvili, Wilson Silva, Sanne Abeln","doi":"10.1093/bioinformatics/btaf631","DOIUrl":"10.1093/bioinformatics/btaf631","url":null,"abstract":"<p><strong>Motivation: </strong>Protein language models (PLMs) have revolutionized computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black-box nature limits biological interpretation and translation to actionable insights. Bridging this gap requires approaches that maintain predictive performance while providing interpretable explanations of model behaviour.</p><p><strong>Results: </strong>We present PLM-eXplain (PLM-X), an explainable adapter layer that bridges this gap by factoring PLM embeddings into two complementary components: an interpretable subspace based on established biochemical features, and a residual subspace that retains predictive, non-interpretable information. Using embeddings from ESM2 and ProtBert, PLM-X incorporates well-established properties, including secondary structure and hydropathy, while maintaining high predictive performance. We demonstrate the effectiveness of our approach across three biologically relevant classification tasks: extracellular vesicle association, transmembrane helix prediction, and aggregation propensity prediction. PLM-X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalizable solution for enhancing PLM interpretability across various downstream applications.</p><p><strong>Availability and implementation: </strong>Source code and models are available at https://github.com/AIT4LIFE-UU/PLM-eXplain/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf680
Christine S Liu, Jerold Chun
Motivation: Long-read sequencing has made RNA isoform detection and characterization more accessible. While several bioinformatics tools have been developed to examine the data generated by these approaches, a major challenge in the field has been comparing isoform profiles across several samples.
Results: We developed isoSeQL, a tool for compiling long-read transcriptomic data, identifying common and unique isoforms across multiple samples, and extracting and visualizing various metrics. isoSeQL will augment approaches that utilize long-read sequencing to discover novel isoforms and to examine how isoforms vary across different experimental and biological conditions and cell types. We demonstrate how to use isoSeQL with publicly available datasets.
Availability and implementation: isoSeQL is available on Github: https://github.com/christine-liu/isoSeQL and Zenodo:https://doi.org/10.5281/zenodo.15717809.
{"title":"isoSeQL: comparing long-read isoforms across multiple datasets.","authors":"Christine S Liu, Jerold Chun","doi":"10.1093/bioinformatics/btaf680","DOIUrl":"10.1093/bioinformatics/btaf680","url":null,"abstract":"<p><strong>Motivation: </strong>Long-read sequencing has made RNA isoform detection and characterization more accessible. While several bioinformatics tools have been developed to examine the data generated by these approaches, a major challenge in the field has been comparing isoform profiles across several samples.</p><p><strong>Results: </strong>We developed isoSeQL, a tool for compiling long-read transcriptomic data, identifying common and unique isoforms across multiple samples, and extracting and visualizing various metrics. isoSeQL will augment approaches that utilize long-read sequencing to discover novel isoforms and to examine how isoforms vary across different experimental and biological conditions and cell types. We demonstrate how to use isoSeQL with publicly available datasets.</p><p><strong>Availability and implementation: </strong>isoSeQL is available on Github: https://github.com/christine-liu/isoSeQL and Zenodo:https://doi.org/10.5281/zenodo.15717809.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790818/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145844273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf686
Elya Wygoda, Asher Moshe, Nimrod Serok, Edo Dotan, Noa Ecker, Naiel Jabareen, Omer Israeli, Itsik Pe'er, Tal Pupko
Motivation: Sequence simulations along phylogenetic trees play an important role in numerous molecular evolution studies such as benchmarking algorithms for ancestral sequence reconstruction, multiple sequence alignment, and phylogeny inference. They are also used in phylogenetic model-selection tasks, including the inference of selective forces. Recently, Approximate Bayesian Computation (ABC)-based approaches have been developed for inferring parameters of complex evolutionary models, which rely on massive generation of simulated data. For all these applications, computationally efficient sequence simulators are essential.
Results: In this study, we investigate fast algorithms for simulating sequences along a phylogenetic tree, focusing on accelerating the speed-limiting component of the simulation process: handling insertion and deletion (indel) events. We demonstrate that data structures which efficiently store indel events along a tree can substantially accelerate the simulation process compared to a naive approach. To illustrate the utility of this efficient simulator, we integrated it into an ABC-based algorithm for inferring indel model parameters and applied it to study indel dynamics within Chiroptera.
Availability and implementation: The source code for the different simulation algorithms, alongside the data used, is available at: https://github.com/nimrodSerokTAU/evo-sim. The simulator has also been integrated into SpartaABC, a website for the inference of indel parameters, accessible at: https://spartaabc.tau.ac.il/.
{"title":"Efficient algorithms for simulating sequences along a phylogenetic tree.","authors":"Elya Wygoda, Asher Moshe, Nimrod Serok, Edo Dotan, Noa Ecker, Naiel Jabareen, Omer Israeli, Itsik Pe'er, Tal Pupko","doi":"10.1093/bioinformatics/btaf686","DOIUrl":"10.1093/bioinformatics/btaf686","url":null,"abstract":"<p><strong>Motivation: </strong>Sequence simulations along phylogenetic trees play an important role in numerous molecular evolution studies such as benchmarking algorithms for ancestral sequence reconstruction, multiple sequence alignment, and phylogeny inference. They are also used in phylogenetic model-selection tasks, including the inference of selective forces. Recently, Approximate Bayesian Computation (ABC)-based approaches have been developed for inferring parameters of complex evolutionary models, which rely on massive generation of simulated data. For all these applications, computationally efficient sequence simulators are essential.</p><p><strong>Results: </strong>In this study, we investigate fast algorithms for simulating sequences along a phylogenetic tree, focusing on accelerating the speed-limiting component of the simulation process: handling insertion and deletion (indel) events. We demonstrate that data structures which efficiently store indel events along a tree can substantially accelerate the simulation process compared to a naive approach. To illustrate the utility of this efficient simulator, we integrated it into an ABC-based algorithm for inferring indel model parameters and applied it to study indel dynamics within Chiroptera.</p><p><strong>Availability and implementation: </strong>The source code for the different simulation algorithms, alongside the data used, is available at: https://github.com/nimrodSerokTAU/evo-sim. The simulator has also been integrated into SpartaABC, a website for the inference of indel parameters, accessible at: https://spartaabc.tau.ac.il/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797210/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145851820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf632
Jens Zentgraf, Johanna Elena Schmitz, Sven Rahmann
Motivation: The first step when working with DNA data of human-derived microbiomes is to remove human contamination for two reasons. First, many countries have strict privacy and data protection guidelines for human sequence data, so microbiome data containing partly human data cannot be easily further processed or published. Second, human contamination may cause problems in downstream analysis, such as metagenomic binning or genome assembly. For large-scale metagenomics projects, fast and accurate removal of human contamination is therefore critical.
Results: We introduce Cleanifier, a fast and memory frugal alignment-free tool for detecting and removing human contamination based on gapped k-mers, or spaced seeds. Cleanifier uses a pangenome index of known human gapped k-mers, and the creation and use of alternative references is also possible. Reads are classified and filtered according to their gapped k-mer content. Cleanifier supports two filtering modes: one that queries all gapped k-mers and one that queries only a sample of them. A comparison of Cleanifier with other state-of-the-art tools shows that the sampling mode makes Cleanifier the fastest method with comparable accuracy. When using a probabilistic Cuckoo filter to store the complete k-mer set, Cleanifier has similar memory requirements to methods that use a sampled minimizer index. At the same time, Cleanifier is more flexible, because it can use different sampling methods on the same index.
Availability and implementation: Cleanifier is available via gitlab (https://gitlab.com/rahmannlab/cleanifier), PyPi (https://pypi.org/project/cleanifier/), and Bioconda (https://anaconda.org/bioconda/cleanifier). The pre-computed human pangenome index is available at Zenodo (https://doi.org/10.5281/zenodo.15639519).
{"title":"Cleanifier: contamination removal from microbial sequences using spaced seeds of a human pangenome index.","authors":"Jens Zentgraf, Johanna Elena Schmitz, Sven Rahmann","doi":"10.1093/bioinformatics/btaf632","DOIUrl":"10.1093/bioinformatics/btaf632","url":null,"abstract":"<p><strong>Motivation: </strong>The first step when working with DNA data of human-derived microbiomes is to remove human contamination for two reasons. First, many countries have strict privacy and data protection guidelines for human sequence data, so microbiome data containing partly human data cannot be easily further processed or published. Second, human contamination may cause problems in downstream analysis, such as metagenomic binning or genome assembly. For large-scale metagenomics projects, fast and accurate removal of human contamination is therefore critical.</p><p><strong>Results: </strong>We introduce Cleanifier, a fast and memory frugal alignment-free tool for detecting and removing human contamination based on gapped k-mers, or spaced seeds. Cleanifier uses a pangenome index of known human gapped k-mers, and the creation and use of alternative references is also possible. Reads are classified and filtered according to their gapped k-mer content. Cleanifier supports two filtering modes: one that queries all gapped k-mers and one that queries only a sample of them. A comparison of Cleanifier with other state-of-the-art tools shows that the sampling mode makes Cleanifier the fastest method with comparable accuracy. When using a probabilistic Cuckoo filter to store the complete k-mer set, Cleanifier has similar memory requirements to methods that use a sampled minimizer index. At the same time, Cleanifier is more flexible, because it can use different sampling methods on the same index.</p><p><strong>Availability and implementation: </strong>Cleanifier is available via gitlab (https://gitlab.com/rahmannlab/cleanifier), PyPi (https://pypi.org/project/cleanifier/), and Bioconda (https://anaconda.org/bioconda/cleanifier). The pre-computed human pangenome index is available at Zenodo (https://doi.org/10.5281/zenodo.15639519).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758600/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145552501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf668
Weimin Guo, Yadong Liu, Yadong Wang, Tao Jiang
Summary: Nanopore sequencing technology enables real-time sequencing and is widely used in rapid detection applications. However, in clinical scenarios, existing structural variant (SV) detection tools typically separate sequencing from computation, limiting their timeliness for clinical applications. To address this, we introduce cuteSV-OL, a novel framework designed for real-time SV discovery, which can be embedded within nanopore sequencing instruments to analyze data concurrently with its generation. Additionally, cuteSV-OL features a real-time SV detection rate evaluation module, allowing users to terminate sequencing early when appropriate, thereby reducing time and cost. Experimental results show that on a standard desktop computer, cuteSV-OL can perform real-time analysis during sequencing and complete SV calling within min after sequencing ends, achieving performance comparable to offline methods. This approach has the potential to enhance rapid clinical diagnostics.
Availability and implementation: cuteSV-OL is released under the MIT license and is available at https://github.com/gwmHIT/cuteSV-OL. It can also be installed via Bioconda or accessed through https://doi.org/10.5281/zenodo.17777436.
{"title":"cuteSV-OL: a real-time structural variation detection framework for nanopore sequencing devices.","authors":"Weimin Guo, Yadong Liu, Yadong Wang, Tao Jiang","doi":"10.1093/bioinformatics/btaf668","DOIUrl":"10.1093/bioinformatics/btaf668","url":null,"abstract":"<p><strong>Summary: </strong>Nanopore sequencing technology enables real-time sequencing and is widely used in rapid detection applications. However, in clinical scenarios, existing structural variant (SV) detection tools typically separate sequencing from computation, limiting their timeliness for clinical applications. To address this, we introduce cuteSV-OL, a novel framework designed for real-time SV discovery, which can be embedded within nanopore sequencing instruments to analyze data concurrently with its generation. Additionally, cuteSV-OL features a real-time SV detection rate evaluation module, allowing users to terminate sequencing early when appropriate, thereby reducing time and cost. Experimental results show that on a standard desktop computer, cuteSV-OL can perform real-time analysis during sequencing and complete SV calling within min after sequencing ends, achieving performance comparable to offline methods. This approach has the potential to enhance rapid clinical diagnostics.</p><p><strong>Availability and implementation: </strong>cuteSV-OL is released under the MIT license and is available at https://github.com/gwmHIT/cuteSV-OL. It can also be installed via Bioconda or accessed through https://doi.org/10.5281/zenodo.17777436.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777969/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}