Motivation: Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.
Results: We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.
Availability and implementation: The main program is available at https://doi.org/10.5281/zenodo.17172822.
{"title":"NanoSSL: attention mechanism-based self-supervised learning method for protein identification using nanopores.","authors":"Yong Xie, Jindong Li, Ziyan Zhang, Bin Meng, Shuaijian Dai, Yuchen Zhou, Eamonn Kennedy, Niandong Jiao, Haobin Chen, Zhuxin Dong","doi":"10.1093/bioinformatics/btaf657","DOIUrl":"10.1093/bioinformatics/btaf657","url":null,"abstract":"<p><strong>Motivation: </strong>Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.</p><p><strong>Results: </strong>We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.</p><p><strong>Availability and implementation: </strong>The main program is available at https://doi.org/10.5281/zenodo.17172822.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"42 1","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777981/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145919221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf677
Sylvère Bastien, Pauline François, Sara Moussadeq, Jérôme Lemoine, Karen Moreau, François Vandenesch
Motivation: Sequence variability can be extremely high, particularly in bacteria due to the rapid accumulation of mutations linked to their high replication rate and environmental selection pressure, which often favors diversifying selection. For most species, there are no automated, computationally efficient tools available for constructing a nonredundant database covering the allelic variability of target proteins.
Results: We have thus developed Bacterial Peptide Sequence Selection, a Nextflow pipeline to define a minimal list of peptide sequences for detecting all variants of a protein of interest.
Availability and implementation: All the code and containers used are freely available on Gitlab from https://gitbio.ens-lyon.fr/ciri/stapath/bpss or on Zenodo (10.5281/zenodo.16894981) under GPLv3 open-source license and DockerHub platform from https://hub.docker.com/u/stapath.
{"title":"BPSS: a Nextflow pipeline for Bacterial Peptide Sequence Selection to detect protein diversity.","authors":"Sylvère Bastien, Pauline François, Sara Moussadeq, Jérôme Lemoine, Karen Moreau, François Vandenesch","doi":"10.1093/bioinformatics/btaf677","DOIUrl":"10.1093/bioinformatics/btaf677","url":null,"abstract":"<p><strong>Motivation: </strong>Sequence variability can be extremely high, particularly in bacteria due to the rapid accumulation of mutations linked to their high replication rate and environmental selection pressure, which often favors diversifying selection. For most species, there are no automated, computationally efficient tools available for constructing a nonredundant database covering the allelic variability of target proteins.</p><p><strong>Results: </strong>We have thus developed Bacterial Peptide Sequence Selection, a Nextflow pipeline to define a minimal list of peptide sequences for detecting all variants of a protein of interest.</p><p><strong>Availability and implementation: </strong>All the code and containers used are freely available on Gitlab from https://gitbio.ens-lyon.fr/ciri/stapath/bpss or on Zenodo (10.5281/zenodo.16894981) under GPLv3 open-source license and DockerHub platform from https://hub.docker.com/u/stapath.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145835679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf678
Genereux Akotenou, Asmaa H Hassan, Morad M Mokhtar, Achraf El Allali
Motivation: Understanding the role of transcription factors (TFs) in plants is essential for the study of gene regulation and various biological processes. However, both TF detection and classification remain challenging due to the great diversity and complexity of these proteins. Conventional approaches, such as BLAST, often suffer from high computational complexity and limited performance on less common TF families.
Results: We introduce MegaPlantTF, the first comprehensive machine learning and deep learning framework for the prediction (TF versus non-TF) and classification (family-level) of plant TFs. Our method employs k-mer-based protein representations and a two-stage architecture combining a deep feed-forward neural network with a stacking ensemble classifier. To ensure robust performance assessment, we report micro-, macro-, and weighted-average performance metrics, providing a holistic evaluation of both frequent and underrepresented TF families. Additionally, we employ threshold-based evaluation to calibrate confidence in TF detection. The results show that MegaPlantTF achieves strong accuracy and precision, particularly with a k-mer size of 3 and a classification threshold of 0.5, and maintains stable performance even under stringent thresholds. In addition to the standard cross-validation tests, a use case study on Sorghum bicolor confirms that our method performs strongly in the genome-wide analysis, making it highly suitable for large-scale TF identification and classification tasks. MegaPlantTF represents a novel contribution by integrating k-mer encoding, binary family-specific classifiers, and a two-stage stacking ensemble into a unified, reproducible framework for large-scale plant TF identification and classification.
Availability and implementation: MegaPlantTF is freely accessible through a public web server available at https://bioinformatics.um6p.ma/MegaPlantTF. The complete source code, including pretrained models and example datasets, is available at https://github.com/Bioinformatics-UM6P/MegaPlantTF.
{"title":"MegaPlantTF: a machine learning framework for comprehensive identification and classification of plant transcription factors.","authors":"Genereux Akotenou, Asmaa H Hassan, Morad M Mokhtar, Achraf El Allali","doi":"10.1093/bioinformatics/btaf678","DOIUrl":"10.1093/bioinformatics/btaf678","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the role of transcription factors (TFs) in plants is essential for the study of gene regulation and various biological processes. However, both TF detection and classification remain challenging due to the great diversity and complexity of these proteins. Conventional approaches, such as BLAST, often suffer from high computational complexity and limited performance on less common TF families.</p><p><strong>Results: </strong>We introduce MegaPlantTF, the first comprehensive machine learning and deep learning framework for the prediction (TF versus non-TF) and classification (family-level) of plant TFs. Our method employs k-mer-based protein representations and a two-stage architecture combining a deep feed-forward neural network with a stacking ensemble classifier. To ensure robust performance assessment, we report micro-, macro-, and weighted-average performance metrics, providing a holistic evaluation of both frequent and underrepresented TF families. Additionally, we employ threshold-based evaluation to calibrate confidence in TF detection. The results show that MegaPlantTF achieves strong accuracy and precision, particularly with a k-mer size of 3 and a classification threshold of 0.5, and maintains stable performance even under stringent thresholds. In addition to the standard cross-validation tests, a use case study on Sorghum bicolor confirms that our method performs strongly in the genome-wide analysis, making it highly suitable for large-scale TF identification and classification tasks. MegaPlantTF represents a novel contribution by integrating k-mer encoding, binary family-specific classifiers, and a two-stage stacking ensemble into a unified, reproducible framework for large-scale plant TF identification and classification.</p><p><strong>Availability and implementation: </strong>MegaPlantTF is freely accessible through a public web server available at https://bioinformatics.um6p.ma/MegaPlantTF. The complete source code, including pretrained models and example datasets, is available at https://github.com/Bioinformatics-UM6P/MegaPlantTF.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12803907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145835682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf441
Aimin Li, Haotian Zhou, Rong Fei, Juntao Zou, Xiguo Yuan, Yajun Liu, Saurav Mallik, Xinhong Hei, Lei Wang
Motivation: Gene expression plays a crucial role in cell function, and enhancers can regulate gene expression precisely. Therefore, accurate prediction of enhancers is particularly critical. However, existing prediction methods have low accuracy or rely on fixed multiple epigenetic signals, which may not always be available.
Results: We propose a two-stage framework that accurately predicts enhancers by flexibly combining multiple epigenetic signals. In the first stage, we designed a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals. In the second stage, we developed a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. The accuracy of the Blending-KAN model reached 99.69 ± 0.11% when five epigenetic signals were used. In cross-cell line prediction, the accuracy was more significant than or equal to 93.72%. With Gaussian noise, it still maintains an accuracy of 98.74 ± 0.03%. In the second stage, the accuracy of the Stacking-Auto model is 80.50%, which is better than the existing 17 methods. The results show that our models can be flexibly used to predict and locate enhancers utilizing a combination of multiple epigenetic signals.
Availability and implementation: The source code is available at https://github.com/emanlee/Hi-Enhancer and https://doi.org/10.6084/m9.figshare.29262158.v1.
{"title":"Hi-Enhancer: a two-stage framework for prediction and localization of enhancers based on Blending-KAN and Stacking-Auto models.","authors":"Aimin Li, Haotian Zhou, Rong Fei, Juntao Zou, Xiguo Yuan, Yajun Liu, Saurav Mallik, Xinhong Hei, Lei Wang","doi":"10.1093/bioinformatics/btaf441","DOIUrl":"10.1093/bioinformatics/btaf441","url":null,"abstract":"<p><strong>Motivation: </strong>Gene expression plays a crucial role in cell function, and enhancers can regulate gene expression precisely. Therefore, accurate prediction of enhancers is particularly critical. However, existing prediction methods have low accuracy or rely on fixed multiple epigenetic signals, which may not always be available.</p><p><strong>Results: </strong>We propose a two-stage framework that accurately predicts enhancers by flexibly combining multiple epigenetic signals. In the first stage, we designed a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals. In the second stage, we developed a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. The accuracy of the Blending-KAN model reached 99.69 ± 0.11% when five epigenetic signals were used. In cross-cell line prediction, the accuracy was more significant than or equal to 93.72%. With Gaussian noise, it still maintains an accuracy of 98.74 ± 0.03%. In the second stage, the accuracy of the Stacking-Auto model is 80.50%, which is better than the existing 17 methods. The results show that our models can be flexibly used to predict and locate enhancers utilizing a combination of multiple epigenetic signals.</p><p><strong>Availability and implementation: </strong>The source code is available at https://github.com/emanlee/Hi-Enhancer and https://doi.org/10.6084/m9.figshare.29262158.v1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758598/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144839356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf585
Nir Nitskansky, Kessem Clein, Barak Raveh
Motivation: Biomolecules undergo dynamic transitions among metastable states to carry out their biological functions. Markov State Models (MSMs) effectively capture these metastable states and transitions at a defined temporal scale. However, biomolecular dynamics typically span multiple temporal scales, ranging from fast atomic vibrations to slower conformational changes and folding events.
Results: We introduce multiscale Markov State Models (mMSMs), which capture biomolecular dynamics across multiple temporal resolutions simultaneously via a hierarchy of MSMs, and mMSM-explore, an unsupervised algorithm for generating mMSMs through multiscale adaptive sampling with on-the-fly identification of temporally metastable states. We benchmark our method on a toy system with nested energy minima; on alanine dipeptide, first with and then without assuming prior knowledge of its two reaction coordinates; and finally, on a fast-folding 35-residue miniprotein, where we map folding pathways across scales. We demonstrate efficient mapping of energy landscapes, correct representation of multiscale hierarchies and transition states, accurate inference of stationary probabilities and transition kinetics, as well as de novo identification of underlying slow, intermediate, and fast reaction coordinates. mMSMs reveal how dynamic processes at different scales contribute collectively to the functional mechanisms of biomolecular machines.
Availability and implementation: Python code and instructions are available at https://github.com/ravehlab/mMSM.
{"title":"Building multiscale Markov state models by systematic mapping of temporal communities.","authors":"Nir Nitskansky, Kessem Clein, Barak Raveh","doi":"10.1093/bioinformatics/btaf585","DOIUrl":"10.1093/bioinformatics/btaf585","url":null,"abstract":"<p><strong>Motivation: </strong>Biomolecules undergo dynamic transitions among metastable states to carry out their biological functions. Markov State Models (MSMs) effectively capture these metastable states and transitions at a defined temporal scale. However, biomolecular dynamics typically span multiple temporal scales, ranging from fast atomic vibrations to slower conformational changes and folding events.</p><p><strong>Results: </strong>We introduce multiscale Markov State Models (mMSMs), which capture biomolecular dynamics across multiple temporal resolutions simultaneously via a hierarchy of MSMs, and mMSM-explore, an unsupervised algorithm for generating mMSMs through multiscale adaptive sampling with on-the-fly identification of temporally metastable states. We benchmark our method on a toy system with nested energy minima; on alanine dipeptide, first with and then without assuming prior knowledge of its two reaction coordinates; and finally, on a fast-folding 35-residue miniprotein, where we map folding pathways across scales. We demonstrate efficient mapping of energy landscapes, correct representation of multiscale hierarchies and transition states, accurate inference of stationary probabilities and transition kinetics, as well as de novo identification of underlying slow, intermediate, and fast reaction coordinates. mMSMs reveal how dynamic processes at different scales contribute collectively to the functional mechanisms of biomolecular machines.</p><p><strong>Availability and implementation: </strong>Python code and instructions are available at https://github.com/ravehlab/mMSM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf675
Tim Stohn, Roderick A P M van Eijl, Klaas W Mulder, Lodewyk F A Wessels, Evert Bosdriesz
Motivation: Signal transduction networks regulate many essential biological processes and are frequently aberrated in diseases such as cancer. A mechanistic understanding of such networks, and how they differ between cell populations, is essential to design effective treatment strategies. Typically, such networks are computationally reconstructed based on systematic perturbation experiments, followed by quantification of signaling protein activity. Recent technological advances now allow for the quantification of the activity of many (signaling) proteins simultaneously in single cells. This makes it feasible to reconstruct or quantify signaling networks without performing systematic perturbations.
Results: Here, we introduce single-cell modular response analysis (scMRA) and single-cell comparative network reconstruction (scCNR) to derive signal transduction networks by exploiting the heterogeneity of single-cell (phospho-)protein measurements. The methods treat stochastic variation in total protein abundances as natural perturbation experiments, whose effects propagate through the network and hence facilitate the reconstruction and quantification of the underlying signaling network. scCNR reconstructs cell population-specific networks, where cells from different populations have the same underlying topology, but the interaction strengths can differ between populations. We extensively validated scMRA and scCNR on simulated data, and applied it to unpublished data of (phospho-)protein measurements of EGFR-inhibitor-treated keratinocytes to recover signaling differences downstream of EGFR. scCNR will help to unravel the mechanistic signaling differences between cell populations, and will subsequently guide the development of well-informed treatment strategies.
Availability and implementation: The code used for scCNR in this study has been deposited on Zenodo https://doi.org/10.5281/zenodo.17600937 and is also available as a Python module at https://github.com/ibivu/scmra. Additionally, data and code to reproduce all figures is available at https://github.com/tstohn/scmra_analysis.
{"title":"Reconstructing and comparing signal transduction networks from single-cell protein quantification data.","authors":"Tim Stohn, Roderick A P M van Eijl, Klaas W Mulder, Lodewyk F A Wessels, Evert Bosdriesz","doi":"10.1093/bioinformatics/btaf675","DOIUrl":"10.1093/bioinformatics/btaf675","url":null,"abstract":"<p><strong>Motivation: </strong>Signal transduction networks regulate many essential biological processes and are frequently aberrated in diseases such as cancer. A mechanistic understanding of such networks, and how they differ between cell populations, is essential to design effective treatment strategies. Typically, such networks are computationally reconstructed based on systematic perturbation experiments, followed by quantification of signaling protein activity. Recent technological advances now allow for the quantification of the activity of many (signaling) proteins simultaneously in single cells. This makes it feasible to reconstruct or quantify signaling networks without performing systematic perturbations.</p><p><strong>Results: </strong>Here, we introduce single-cell modular response analysis (scMRA) and single-cell comparative network reconstruction (scCNR) to derive signal transduction networks by exploiting the heterogeneity of single-cell (phospho-)protein measurements. The methods treat stochastic variation in total protein abundances as natural perturbation experiments, whose effects propagate through the network and hence facilitate the reconstruction and quantification of the underlying signaling network. scCNR reconstructs cell population-specific networks, where cells from different populations have the same underlying topology, but the interaction strengths can differ between populations. We extensively validated scMRA and scCNR on simulated data, and applied it to unpublished data of (phospho-)protein measurements of EGFR-inhibitor-treated keratinocytes to recover signaling differences downstream of EGFR. scCNR will help to unravel the mechanistic signaling differences between cell populations, and will subsequently guide the development of well-informed treatment strategies.</p><p><strong>Availability and implementation: </strong>The code used for scCNR in this study has been deposited on Zenodo https://doi.org/10.5281/zenodo.17600937 and is also available as a Python module at https://github.com/ibivu/scmra. Additionally, data and code to reproduce all figures is available at https://github.com/tstohn/scmra_analysis.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797212/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf562
Jiyuan Yang, Nana Wei, Yang Qu, Congcong Hu, Weiwei Zhang, Lin Liu, Hua-Jun Wu, Xiaoqi Zheng
Motivation: Spatial transcriptomics (ST) technologies provide valuable insights into cellular heterogeneity by simultaneously acquiring both gene expression profiles and cellular location information. However, the limited diversity and accuracy of "gold standard" datasets hindered the effectiveness and fairness of benchmarking rapidly growing ST analysis tools.
Results: To address this issue, we proposed Spider, a flexible and comprehensive framework for simulating ST data without requiring real ST data as a reference. By characterizing the spatial patterns using cell type proportions and transition matrix between adjacent cells, Spider can produce more realistic and diverse simulated data and offer enhanced modeling flexibility compared to existing simulation methods. Additionally, Spider provides interactive features for customizing the spatial domain, such as zone segmentation and integration of histology imaging data. Benchmark analyses demonstrate that Spider outperforms other simulation tools in preserving the spatial characteristics of real ST data and facilitating the evaluation of downstream analysis methods. Spider is implemented in Python and available at https://github.com/YANG-ERA/Spider.
Availability and implementation: All codes, simulated ST data in this paper are publicly available at https://github.com/YANG-ERA/Spider.
{"title":"Spider: a flexible and unified framework for simulating spatial transcriptomics data.","authors":"Jiyuan Yang, Nana Wei, Yang Qu, Congcong Hu, Weiwei Zhang, Lin Liu, Hua-Jun Wu, Xiaoqi Zheng","doi":"10.1093/bioinformatics/btaf562","DOIUrl":"10.1093/bioinformatics/btaf562","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) technologies provide valuable insights into cellular heterogeneity by simultaneously acquiring both gene expression profiles and cellular location information. However, the limited diversity and accuracy of \"gold standard\" datasets hindered the effectiveness and fairness of benchmarking rapidly growing ST analysis tools.</p><p><strong>Results: </strong>To address this issue, we proposed Spider, a flexible and comprehensive framework for simulating ST data without requiring real ST data as a reference. By characterizing the spatial patterns using cell type proportions and transition matrix between adjacent cells, Spider can produce more realistic and diverse simulated data and offer enhanced modeling flexibility compared to existing simulation methods. Additionally, Spider provides interactive features for customizing the spatial domain, such as zone segmentation and integration of histology imaging data. Benchmark analyses demonstrate that Spider outperforms other simulation tools in preserving the spatial characteristics of real ST data and facilitating the evaluation of downstream analysis methods. Spider is implemented in Python and available at https://github.com/YANG-ERA/Spider.</p><p><strong>Availability and implementation: </strong>All codes, simulated ST data in this paper are publicly available at https://github.com/YANG-ERA/Spider.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790819/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145524860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf665
Massimiliano S Tagliamonte, Abhinav Sharma, Alberto Riva, Monika Moir, Marco Salemi, Cheryl Baxter, Tulio de Oliveira, Carla N Mavian, Eduan Wilkinson
Summary: Next Generation Sequencing is widely deployed in cholera-endemic regions, yet an end-to-end reproducible pipeline that unifies read QC, filtering, reference mapping, variant calling/annotation, recombination screening, and extraction of parsimony informative sites/variant codons, phylogenetic inference for downstream phylodynamic and epidemiological analyses have been lacking, slowing outbreak investigation and public health response. CholeraSeq is a high-throughput genomics pipeline for cholera genomic surveillance. It ingests consensus genomes, short read sequence data, draft assemblies, and scales seamlessly from local to cloud environments. To accelerate epidemiological context placement of new outbreak strains, we provide a curated ready-to-use core genome alignment compiled from public data, enabling flexible, fast, integration of new samples for outbreak investigations.
Availability and implementation: CholeraSeq is freely available on the GitHub platform https://github.com/CERI-KRISP/CholeraSeq. CholeraSeq is implemented in Nextflow with a modular design building upon the nf-core community standards.
{"title":"CholeraSeq: a comprehensive genomic pipeline for cholera surveillance and near real-time outbreak investigation.","authors":"Massimiliano S Tagliamonte, Abhinav Sharma, Alberto Riva, Monika Moir, Marco Salemi, Cheryl Baxter, Tulio de Oliveira, Carla N Mavian, Eduan Wilkinson","doi":"10.1093/bioinformatics/btaf665","DOIUrl":"10.1093/bioinformatics/btaf665","url":null,"abstract":"<p><strong>Summary: </strong>Next Generation Sequencing is widely deployed in cholera-endemic regions, yet an end-to-end reproducible pipeline that unifies read QC, filtering, reference mapping, variant calling/annotation, recombination screening, and extraction of parsimony informative sites/variant codons, phylogenetic inference for downstream phylodynamic and epidemiological analyses have been lacking, slowing outbreak investigation and public health response. CholeraSeq is a high-throughput genomics pipeline for cholera genomic surveillance. It ingests consensus genomes, short read sequence data, draft assemblies, and scales seamlessly from local to cloud environments. To accelerate epidemiological context placement of new outbreak strains, we provide a curated ready-to-use core genome alignment compiled from public data, enabling flexible, fast, integration of new samples for outbreak investigations.</p><p><strong>Availability and implementation: </strong>CholeraSeq is freely available on the GitHub platform https://github.com/CERI-KRISP/CholeraSeq. CholeraSeq is implemented in Nextflow with a modular design building upon the nf-core community standards.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790814/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf632
Jens Zentgraf, Johanna Elena Schmitz, Sven Rahmann
Motivation: The first step when working with DNA data of human-derived microbiomes is to remove human contamination for two reasons. First, many countries have strict privacy and data protection guidelines for human sequence data, so microbiome data containing partly human data cannot be easily further processed or published. Second, human contamination may cause problems in downstream analysis, such as metagenomic binning or genome assembly. For large-scale metagenomics projects, fast and accurate removal of human contamination is therefore critical.
Results: We introduce Cleanifier, a fast and memory frugal alignment-free tool for detecting and removing human contamination based on gapped k-mers, or spaced seeds. Cleanifier uses a pangenome index of known human gapped k-mers, and the creation and use of alternative references is also possible. Reads are classified and filtered according to their gapped k-mer content. Cleanifier supports two filtering modes: one that queries all gapped k-mers and one that queries only a sample of them. A comparison of Cleanifier with other state-of-the-art tools shows that the sampling mode makes Cleanifier the fastest method with comparable accuracy. When using a probabilistic Cuckoo filter to store the complete k-mer set, Cleanifier has similar memory requirements to methods that use a sampled minimizer index. At the same time, Cleanifier is more flexible, because it can use different sampling methods on the same index.
Availability and implementation: Cleanifier is available via gitlab (https://gitlab.com/rahmannlab/cleanifier), PyPi (https://pypi.org/project/cleanifier/), and Bioconda (https://anaconda.org/bioconda/cleanifier). The pre-computed human pangenome index is available at Zenodo (https://doi.org/10.5281/zenodo.15639519).
{"title":"Cleanifier: contamination removal from microbial sequences using spaced seeds of a human pangenome index.","authors":"Jens Zentgraf, Johanna Elena Schmitz, Sven Rahmann","doi":"10.1093/bioinformatics/btaf632","DOIUrl":"10.1093/bioinformatics/btaf632","url":null,"abstract":"<p><strong>Motivation: </strong>The first step when working with DNA data of human-derived microbiomes is to remove human contamination for two reasons. First, many countries have strict privacy and data protection guidelines for human sequence data, so microbiome data containing partly human data cannot be easily further processed or published. Second, human contamination may cause problems in downstream analysis, such as metagenomic binning or genome assembly. For large-scale metagenomics projects, fast and accurate removal of human contamination is therefore critical.</p><p><strong>Results: </strong>We introduce Cleanifier, a fast and memory frugal alignment-free tool for detecting and removing human contamination based on gapped k-mers, or spaced seeds. Cleanifier uses a pangenome index of known human gapped k-mers, and the creation and use of alternative references is also possible. Reads are classified and filtered according to their gapped k-mer content. Cleanifier supports two filtering modes: one that queries all gapped k-mers and one that queries only a sample of them. A comparison of Cleanifier with other state-of-the-art tools shows that the sampling mode makes Cleanifier the fastest method with comparable accuracy. When using a probabilistic Cuckoo filter to store the complete k-mer set, Cleanifier has similar memory requirements to methods that use a sampled minimizer index. At the same time, Cleanifier is more flexible, because it can use different sampling methods on the same index.</p><p><strong>Availability and implementation: </strong>Cleanifier is available via gitlab (https://gitlab.com/rahmannlab/cleanifier), PyPi (https://pypi.org/project/cleanifier/), and Bioconda (https://anaconda.org/bioconda/cleanifier). The pre-computed human pangenome index is available at Zenodo (https://doi.org/10.5281/zenodo.15639519).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758600/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145552501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf668
Weimin Guo, Yadong Liu, Yadong Wang, Tao Jiang
Summary: Nanopore sequencing technology enables real-time sequencing and is widely used in rapid detection applications. However, in clinical scenarios, existing structural variant (SV) detection tools typically separate sequencing from computation, limiting their timeliness for clinical applications. To address this, we introduce cuteSV-OL, a novel framework designed for real-time SV discovery, which can be embedded within nanopore sequencing instruments to analyze data concurrently with its generation. Additionally, cuteSV-OL features a real-time SV detection rate evaluation module, allowing users to terminate sequencing early when appropriate, thereby reducing time and cost. Experimental results show that on a standard desktop computer, cuteSV-OL can perform real-time analysis during sequencing and complete SV calling within min after sequencing ends, achieving performance comparable to offline methods. This approach has the potential to enhance rapid clinical diagnostics.
Availability and implementation: cuteSV-OL is released under the MIT license and is available at https://github.com/gwmHIT/cuteSV-OL. It can also be installed via Bioconda or accessed through https://doi.org/10.5281/zenodo.17777436.
{"title":"cuteSV-OL: a real-time structural variation detection framework for nanopore sequencing devices.","authors":"Weimin Guo, Yadong Liu, Yadong Wang, Tao Jiang","doi":"10.1093/bioinformatics/btaf668","DOIUrl":"10.1093/bioinformatics/btaf668","url":null,"abstract":"<p><strong>Summary: </strong>Nanopore sequencing technology enables real-time sequencing and is widely used in rapid detection applications. However, in clinical scenarios, existing structural variant (SV) detection tools typically separate sequencing from computation, limiting their timeliness for clinical applications. To address this, we introduce cuteSV-OL, a novel framework designed for real-time SV discovery, which can be embedded within nanopore sequencing instruments to analyze data concurrently with its generation. Additionally, cuteSV-OL features a real-time SV detection rate evaluation module, allowing users to terminate sequencing early when appropriate, thereby reducing time and cost. Experimental results show that on a standard desktop computer, cuteSV-OL can perform real-time analysis during sequencing and complete SV calling within min after sequencing ends, achieving performance comparable to offline methods. This approach has the potential to enhance rapid clinical diagnostics.</p><p><strong>Availability and implementation: </strong>cuteSV-OL is released under the MIT license and is available at https://github.com/gwmHIT/cuteSV-OL. It can also be installed via Bioconda or accessed through https://doi.org/10.5281/zenodo.17777436.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777969/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}