Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag017
Manuel Mangoni, Salvatore Daniele Bianco, Francesco Petrizzelli, Michele Pieroni, Pietro Hiram Guzzi, Viviana Caputo, Tommaso Biagini, Tommaso Mazza
Motivation: Molecular dynamics (MD) simulations provide detailed atomistic insights into biomolecular processes, but comparing independent trajectories remains challenging due to stochastic divergence. Misaligned simulations can obscure shared mechanisms or exaggerate differences, limiting reproducibility and mechanistic interpretation. A generalizable, unsupervised method for synchronizing and comparing MD trajectories across systems and conditions is, therefore, needed.
Results: We introduce NetMD, a computational framework that synchronizes and analyzes MD trajectories by integrating graph-based representations with dynamic time warping. Trajectory frames are converted into residue-contact graphs, entropy-filtered to retain variable interactions, and embedded as low-dimensional vectors. NetMD aligns these vectorized trajectories through time-warping barycenter averaging, generating a consensus trajectory while pruning outlier simulations. Applied to transporters, demethylases, and large protein complexes relevant to neurological disease pathways and cancer, NetMD revealed shared multiphase dynamics and identified mutation- or ligand-specific deviations. This unsupervised, time-resolved approach enables direct comparison of MD ensembles across heterogeneous conditions. NetMD is robust and broadly applicable, providing a tool for uncovering conserved patterns and critical divergences in biomolecular dynamics.
Availability and implementation: NetMD is freely available at https://github.com/mazzalab/NetMD.
{"title":"Unsupervised synchronization of molecular dynamics trajectories via graph embedding and time warping.","authors":"Manuel Mangoni, Salvatore Daniele Bianco, Francesco Petrizzelli, Michele Pieroni, Pietro Hiram Guzzi, Viviana Caputo, Tommaso Biagini, Tommaso Mazza","doi":"10.1093/bioinformatics/btag017","DOIUrl":"10.1093/bioinformatics/btag017","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular dynamics (MD) simulations provide detailed atomistic insights into biomolecular processes, but comparing independent trajectories remains challenging due to stochastic divergence. Misaligned simulations can obscure shared mechanisms or exaggerate differences, limiting reproducibility and mechanistic interpretation. A generalizable, unsupervised method for synchronizing and comparing MD trajectories across systems and conditions is, therefore, needed.</p><p><strong>Results: </strong>We introduce NetMD, a computational framework that synchronizes and analyzes MD trajectories by integrating graph-based representations with dynamic time warping. Trajectory frames are converted into residue-contact graphs, entropy-filtered to retain variable interactions, and embedded as low-dimensional vectors. NetMD aligns these vectorized trajectories through time-warping barycenter averaging, generating a consensus trajectory while pruning outlier simulations. Applied to transporters, demethylases, and large protein complexes relevant to neurological disease pathways and cancer, NetMD revealed shared multiphase dynamics and identified mutation- or ligand-specific deviations. This unsupervised, time-resolved approach enables direct comparison of MD ensembles across heterogeneous conditions. NetMD is robust and broadly applicable, providing a tool for uncovering conserved patterns and critical divergences in biomolecular dynamics.</p><p><strong>Availability and implementation: </strong>NetMD is freely available at https://github.com/mazzalab/NetMD.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag021
Huda Ahmad, Hannah M Doherty, Sam T Benedict, James R J Haycocks, Ge Zhou, Patrick J Moynihan, Danesh Moradigaravand, Manuel Banzhaf
Motivation: Chemical genomics is a powerful high-throughput approach to systematically link phenotypes to genotypes. However, the vast datasets generated remain challenging to explore due to the lack of integrated, interactive tools for visualization and analysis. Existing workflows often require multiple independent software tools, limiting data accessibility and collaboration. Therefore, we created a user-friendly platform that enables efficient exploration and sharing of chemical genomics data.
Results: We developed ChemGenXplore, a web-based Shiny application designed to streamline the visualization and analysis of chemical genomic screens. It offers two primary functionalities: one for exploring pre-implemented datasets and another for analysing user-uploaded datasets. ChemGenXplore enables users to visualize phenotypic profiles, assess gene-gene and condition-condition correlations, perform GO and KEGG enrichment analysis, and generate customizable, interactive heatmaps. To further support collaborative research, ChemGenXplore also facilitates the comparative analysis of chemical genomic and other omics datasets. By consolidating these features into a single interactive and accessible tool, ChemGenXplore facilitates data sharing, enhances reproducibility, and promotes collaboration within the research community.
Availability and implementation: ChemGenXplore is freely accessible as a web application at https://chemgenxplore.kaust.edu.sa/. Source code and documentation, including instructions for local installation, are provided on GitHub (https://github.com/Hudaahmadd/ChemGenXplore). A Docker image is also available on DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore) to ensure reproducibility and simplify installation.
{"title":"ChemGenXplore: an interactive tool for exploring and analysing chemical genomic data.","authors":"Huda Ahmad, Hannah M Doherty, Sam T Benedict, James R J Haycocks, Ge Zhou, Patrick J Moynihan, Danesh Moradigaravand, Manuel Banzhaf","doi":"10.1093/bioinformatics/btag021","DOIUrl":"10.1093/bioinformatics/btag021","url":null,"abstract":"<p><strong>Motivation: </strong>Chemical genomics is a powerful high-throughput approach to systematically link phenotypes to genotypes. However, the vast datasets generated remain challenging to explore due to the lack of integrated, interactive tools for visualization and analysis. Existing workflows often require multiple independent software tools, limiting data accessibility and collaboration. Therefore, we created a user-friendly platform that enables efficient exploration and sharing of chemical genomics data.</p><p><strong>Results: </strong>We developed ChemGenXplore, a web-based Shiny application designed to streamline the visualization and analysis of chemical genomic screens. It offers two primary functionalities: one for exploring pre-implemented datasets and another for analysing user-uploaded datasets. ChemGenXplore enables users to visualize phenotypic profiles, assess gene-gene and condition-condition correlations, perform GO and KEGG enrichment analysis, and generate customizable, interactive heatmaps. To further support collaborative research, ChemGenXplore also facilitates the comparative analysis of chemical genomic and other omics datasets. By consolidating these features into a single interactive and accessible tool, ChemGenXplore facilitates data sharing, enhances reproducibility, and promotes collaboration within the research community.</p><p><strong>Availability and implementation: </strong>ChemGenXplore is freely accessible as a web application at https://chemgenxplore.kaust.edu.sa/. Source code and documentation, including instructions for local installation, are provided on GitHub (https://github.com/Hudaahmadd/ChemGenXplore). A Docker image is also available on DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore) to ensure reproducibility and simplify installation.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12872398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145967397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag011
Darya Shlyk, Lawrence Hunter
Motivation: Biomedical Entity Linking (BEL) maps mentions in biomedical text to standardized identifiers, enabling structured data integration and downstream knowledge discovery. However, current BEL systems remain fundamentally constrained by the recall of the initial candidate pool, where suboptimal retrieval limits the overall effectiveness of the normalization pipeline.
Results: We present the first systematic evaluation of Generative Relevance Feedback (GRF) for enhancing candidate retrieval in state-of-the-art BEL systems. GRF leverages large language models (LLMs) to enrich the expressiveness of the mention in a zero-shot fashion. We assess GRF's impact under two scenarios-direct linking prediction and candidate generation in cascading normalization pipelines-and analyze its sensitivity to different LLMs, feedback types, and integration strategies. Experiments across eight corpora and four biomedical knowledge bases demonstrate that integrating GRF significantly improves both accuracy and recall, thereby increasing the upper bound on normalization performance. Our findings highlight GRF as an efficient, model-agnostic solution and underscore its potential as a key component for advancing BEL.
Availability and implementation: The code to reproduce our experiments can be found at: https://doi.org/10.5281/zenodo.17853541.
{"title":"Improving biomedical entity linking with generative relevance feedback.","authors":"Darya Shlyk, Lawrence Hunter","doi":"10.1093/bioinformatics/btag011","DOIUrl":"10.1093/bioinformatics/btag011","url":null,"abstract":"<p><strong>Motivation: </strong>Biomedical Entity Linking (BEL) maps mentions in biomedical text to standardized identifiers, enabling structured data integration and downstream knowledge discovery. However, current BEL systems remain fundamentally constrained by the recall of the initial candidate pool, where suboptimal retrieval limits the overall effectiveness of the normalization pipeline.</p><p><strong>Results: </strong>We present the first systematic evaluation of Generative Relevance Feedback (GRF) for enhancing candidate retrieval in state-of-the-art BEL systems. GRF leverages large language models (LLMs) to enrich the expressiveness of the mention in a zero-shot fashion. We assess GRF's impact under two scenarios-direct linking prediction and candidate generation in cascading normalization pipelines-and analyze its sensitivity to different LLMs, feedback types, and integration strategies. Experiments across eight corpora and four biomedical knowledge bases demonstrate that integrating GRF significantly improves both accuracy and recall, thereby increasing the upper bound on normalization performance. Our findings highlight GRF as an efficient, model-agnostic solution and underscore its potential as a key component for advancing BEL.</p><p><strong>Availability and implementation: </strong>The code to reproduce our experiments can be found at: https://doi.org/10.5281/zenodo.17853541.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866626/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag012
Kaylee D Rich, James D Wasmuth
Motivation: Molecular mimicry is used by pathogens to evade the host immune system and manipulate other host cellular processes. It is often mediated by short motifs in non-homologous proteins, whose detection challenges the sensitivity and specificity of existing bioinformatics tools.
Results: We present mimicDetector, a k-mer-based pipeline for identifying protein-level molecular mimicry between pathogens and their hosts. Applied to 17 globally important pathogens, mimicDetector identified a broad and biologically plausible set of mimicry candidates, including helminth proteins mimicking components of the human complement system and a Leishmania infantum mimic of Reticulon-4, a regulator of immune cell recruitment.
Availability and implementation: mimicDetector is freely available at https://github.com/kayleerich/mimicDetector/, implemented in Python and Snakemake, and compatible with Unix-based systems.
{"title":"mimicDetector: a pipeline for protein motif mimicry detection in host-pathogen interactions.","authors":"Kaylee D Rich, James D Wasmuth","doi":"10.1093/bioinformatics/btag012","DOIUrl":"10.1093/bioinformatics/btag012","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular mimicry is used by pathogens to evade the host immune system and manipulate other host cellular processes. It is often mediated by short motifs in non-homologous proteins, whose detection challenges the sensitivity and specificity of existing bioinformatics tools.</p><p><strong>Results: </strong>We present mimicDetector, a k-mer-based pipeline for identifying protein-level molecular mimicry between pathogens and their hosts. Applied to 17 globally important pathogens, mimicDetector identified a broad and biologically plausible set of mimicry candidates, including helminth proteins mimicking components of the human complement system and a Leishmania infantum mimic of Reticulon-4, a regulator of immune cell recruitment.</p><p><strong>Availability and implementation: </strong>mimicDetector is freely available at https://github.com/kayleerich/mimicDetector/, implemented in Python and Snakemake, and compatible with Unix-based systems.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881831/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag023
Siera Martinez, Tushar Sharma, Luke Johnson, Allen Kim, Vania Ballesteros Prieto, Hovhannes Arestakesyan, Sunisha Harish, Jewel Dias, Joseph Goldfrank, Nathan Edwards, Anelia Horvath
Motivation: Accurately characterizing expressed genetic variation at the single-cell level is essential for understanding transcriptional heterogeneity, allelic regulation, and mutational dynamics within complex tissues. However, few tools enable comprehensive visualization and quantitative analysis of expressed variants across individual cells.
Results: scSNViz is an R package for the exploration, quantification, and visualization of expressed single-nucleotide variants (SNVs) from cell-barcoded single-cell RNA sequencing (scRNA-seq) data. The software supports estimation of variant allele fractions, clustering of SNV expression profiles, and 2D and 3D visualization of individual SNVs or user-defined SNV groups. Beyond visualization, scSNViz facilitates investigation of cell-, cluster-, or lineage-specific variant expression patterns, as well as allelic dynamics including imprinting, random allele inactivation, and transcriptional bursting. It interoperates seamlessly with established single-cell frameworks-Seurat for clustering, Slingshot for trajectory inference, scType for cell-type annotation, and CopyKat for copy-number profiling-enabling integrative multi-omic analyses of expressed variation.
Availability and implementation: scSNViz is implemented in R and freely available at https://github.com/HorvathLab/scSNViz (DOI: 10.5281/zenodo.17307516). The package includes comprehensive documentation and example workflows designed for users with limited bioinformatics experience.
{"title":"scSNViz: visualization and analysis of cell-specific expressed SNVs.","authors":"Siera Martinez, Tushar Sharma, Luke Johnson, Allen Kim, Vania Ballesteros Prieto, Hovhannes Arestakesyan, Sunisha Harish, Jewel Dias, Joseph Goldfrank, Nathan Edwards, Anelia Horvath","doi":"10.1093/bioinformatics/btag023","DOIUrl":"10.1093/bioinformatics/btag023","url":null,"abstract":"<p><strong>Motivation: </strong>Accurately characterizing expressed genetic variation at the single-cell level is essential for understanding transcriptional heterogeneity, allelic regulation, and mutational dynamics within complex tissues. However, few tools enable comprehensive visualization and quantitative analysis of expressed variants across individual cells.</p><p><strong>Results: </strong>scSNViz is an R package for the exploration, quantification, and visualization of expressed single-nucleotide variants (SNVs) from cell-barcoded single-cell RNA sequencing (scRNA-seq) data. The software supports estimation of variant allele fractions, clustering of SNV expression profiles, and 2D and 3D visualization of individual SNVs or user-defined SNV groups. Beyond visualization, scSNViz facilitates investigation of cell-, cluster-, or lineage-specific variant expression patterns, as well as allelic dynamics including imprinting, random allele inactivation, and transcriptional bursting. It interoperates seamlessly with established single-cell frameworks-Seurat for clustering, Slingshot for trajectory inference, scType for cell-type annotation, and CopyKat for copy-number profiling-enabling integrative multi-omic analyses of expressed variation.</p><p><strong>Availability and implementation: </strong>scSNViz is implemented in R and freely available at https://github.com/HorvathLab/scSNViz (DOI: 10.5281/zenodo.17307516). The package includes comprehensive documentation and example workflows designed for users with limited bioinformatics experience.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866635/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf658
Jennifer Fouquier, Maggie Stanislawski, John O'Connor, Ashley Scadden, Catherine Lozupone
Motivation: Longitudinal microbiome studies (LMS) are increasingly common but have analytic challenges including nonindependent data requiring mixed-effects models. Furthermore, large amounts of data motivate exploratory analysis to identify factors related to outcome variables. Although change analysis (i.e. calculating feature changes between timepoints) can be powerful, how to best conduct these analyses is often unclear. For example, observational LMS measurements show natural fluctuations, so baseline might not be a reference of primary interest, whereas for interventional LMS, baseline is typically a key reference point, often indicating the start of treatment.
Results: To address these challenges, a feature selection workflow, called EXPLANA (EXPLoratory ANAlysis), was developed for LMS that supports numerical and categorical data, and also accommodates cross-sectional studies. Machine learning methods were combined with different types of change calculations and downstream interpretation methods to identify statistically meaningful variables and explain their relationship to outcomes. EXPLANA generates an interactive report that textually and graphically summarizes methods and results. EXPLANA had good performance on simulated longitudinal data, with a balanced accuracy score of 0.91 (range: 0.79-1.00, SD = 0.05), outperformed an existing tool, QIIME 2 feature-volatility (balanced accuracy: 0.95 versus 0.56) and identified novel order-dependent categorical feature changes (e.g. different effect for A_B versus B_A). EXPLANA is broadly applicable and simplifies analytics for identifying features related to outcomes of interest.
Availability and implementation: Software is available at https://github.com/JTFouquier/explana and https://zenodo.org/records/17478745 (10.5281/zenodo.17478744). Documentation and demos are available at www.explana.io.
动机:纵向微生物组研究(LMS)越来越普遍,但存在分析挑战,包括需要混合效应模型的非独立数据。此外,大量的数据激发了探索性分析,以确定与结果变量相关的因素。尽管变更分析(例如,计算时间点之间的特性变化)可能很强大,但如何最好地进行这些分析通常是不清楚的。例如,观察性LMS测量显示自然波动,因此基线可能不是主要关注的参考,而对于介入性LMS,基线通常是关键参考点,通常表明治疗的开始。结果:为了应对这些挑战,LMS开发了一种称为expla (EXPLoratory ANAlysis,探索性分析)的特征选择工作流,该工作流支持数值和分类数据,并可进行横断面研究。机器学习方法与不同类型的变化计算和下游解释方法相结合,以识别统计上有意义的变量,并解释它们与结果的关系。expla生成一个交互式报告,以文本和图形形式总结方法和结果。expla在模拟纵向数据上表现良好,平衡精度得分为0.91(范围:0.79-1.00,SD = 0.05),优于现有工具QIIME 2特征波动率(平衡精度:0.95 vs. 0.56),并识别出新的顺序相关分类特征变化(例如,A_B与B_A的不同影响)。expla广泛适用,简化了识别与感兴趣的结果相关的特征的分析。可用性:软件可在https://github.com/JTFouquier/explana和https://zenodo.org/records/17478745 (10.5281/zenodo.17478744)获得。文档和演示可在www.explana.io.Supplementary信息上获得;补充数据可在Bioinformatics在线上获得。
{"title":"EXPLANA: a user-friendly workflow for EXPLoratory ANAlysis and feature selection in cross-sectional and longitudinal microbiome studies.","authors":"Jennifer Fouquier, Maggie Stanislawski, John O'Connor, Ashley Scadden, Catherine Lozupone","doi":"10.1093/bioinformatics/btaf658","DOIUrl":"10.1093/bioinformatics/btaf658","url":null,"abstract":"<p><strong>Motivation: </strong>Longitudinal microbiome studies (LMS) are increasingly common but have analytic challenges including nonindependent data requiring mixed-effects models. Furthermore, large amounts of data motivate exploratory analysis to identify factors related to outcome variables. Although change analysis (i.e. calculating feature changes between timepoints) can be powerful, how to best conduct these analyses is often unclear. For example, observational LMS measurements show natural fluctuations, so baseline might not be a reference of primary interest, whereas for interventional LMS, baseline is typically a key reference point, often indicating the start of treatment.</p><p><strong>Results: </strong>To address these challenges, a feature selection workflow, called EXPLANA (EXPLoratory ANAlysis), was developed for LMS that supports numerical and categorical data, and also accommodates cross-sectional studies. Machine learning methods were combined with different types of change calculations and downstream interpretation methods to identify statistically meaningful variables and explain their relationship to outcomes. EXPLANA generates an interactive report that textually and graphically summarizes methods and results. EXPLANA had good performance on simulated longitudinal data, with a balanced accuracy score of 0.91 (range: 0.79-1.00, SD = 0.05), outperformed an existing tool, QIIME 2 feature-volatility (balanced accuracy: 0.95 versus 0.56) and identified novel order-dependent categorical feature changes (e.g. different effect for A_B versus B_A). EXPLANA is broadly applicable and simplifies analytics for identifying features related to outcomes of interest.</p><p><strong>Availability and implementation: </strong>Software is available at https://github.com/JTFouquier/explana and https://zenodo.org/records/17478745 (10.5281/zenodo.17478744). Documentation and demos are available at www.explana.io.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12766912/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.
Results: We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.
Availability and implementation: The main program is available at https://doi.org/10.5281/zenodo.17172822.
{"title":"NanoSSL: attention mechanism-based self-supervised learning method for protein identification using nanopores.","authors":"Yong Xie, Jindong Li, Ziyan Zhang, Bin Meng, Shuaijian Dai, Yuchen Zhou, Eamonn Kennedy, Niandong Jiao, Haobin Chen, Zhuxin Dong","doi":"10.1093/bioinformatics/btaf657","DOIUrl":"10.1093/bioinformatics/btaf657","url":null,"abstract":"<p><strong>Motivation: </strong>Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.</p><p><strong>Results: </strong>We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.</p><p><strong>Availability and implementation: </strong>The main program is available at https://doi.org/10.5281/zenodo.17172822.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"42 1","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777981/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145919221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf678
Genereux Akotenou, Asmaa H Hassan, Morad M Mokhtar, Achraf El Allali
Motivation: Understanding the role of transcription factors (TFs) in plants is essential for the study of gene regulation and various biological processes. However, both TF detection and classification remain challenging due to the great diversity and complexity of these proteins. Conventional approaches, such as BLAST, often suffer from high computational complexity and limited performance on less common TF families.
Results: We introduce MegaPlantTF, the first comprehensive machine learning and deep learning framework for the prediction (TF versus non-TF) and classification (family-level) of plant TFs. Our method employs k-mer-based protein representations and a two-stage architecture combining a deep feed-forward neural network with a stacking ensemble classifier. To ensure robust performance assessment, we report micro-, macro-, and weighted-average performance metrics, providing a holistic evaluation of both frequent and underrepresented TF families. Additionally, we employ threshold-based evaluation to calibrate confidence in TF detection. The results show that MegaPlantTF achieves strong accuracy and precision, particularly with a k-mer size of 3 and a classification threshold of 0.5, and maintains stable performance even under stringent thresholds. In addition to the standard cross-validation tests, a use case study on Sorghum bicolor confirms that our method performs strongly in the genome-wide analysis, making it highly suitable for large-scale TF identification and classification tasks. MegaPlantTF represents a novel contribution by integrating k-mer encoding, binary family-specific classifiers, and a two-stage stacking ensemble into a unified, reproducible framework for large-scale plant TF identification and classification.
Availability and implementation: MegaPlantTF is freely accessible through a public web server available at https://bioinformatics.um6p.ma/MegaPlantTF. The complete source code, including pretrained models and example datasets, is available at https://github.com/Bioinformatics-UM6P/MegaPlantTF.
{"title":"MegaPlantTF: a machine learning framework for comprehensive identification and classification of plant transcription factors.","authors":"Genereux Akotenou, Asmaa H Hassan, Morad M Mokhtar, Achraf El Allali","doi":"10.1093/bioinformatics/btaf678","DOIUrl":"10.1093/bioinformatics/btaf678","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the role of transcription factors (TFs) in plants is essential for the study of gene regulation and various biological processes. However, both TF detection and classification remain challenging due to the great diversity and complexity of these proteins. Conventional approaches, such as BLAST, often suffer from high computational complexity and limited performance on less common TF families.</p><p><strong>Results: </strong>We introduce MegaPlantTF, the first comprehensive machine learning and deep learning framework for the prediction (TF versus non-TF) and classification (family-level) of plant TFs. Our method employs k-mer-based protein representations and a two-stage architecture combining a deep feed-forward neural network with a stacking ensemble classifier. To ensure robust performance assessment, we report micro-, macro-, and weighted-average performance metrics, providing a holistic evaluation of both frequent and underrepresented TF families. Additionally, we employ threshold-based evaluation to calibrate confidence in TF detection. The results show that MegaPlantTF achieves strong accuracy and precision, particularly with a k-mer size of 3 and a classification threshold of 0.5, and maintains stable performance even under stringent thresholds. In addition to the standard cross-validation tests, a use case study on Sorghum bicolor confirms that our method performs strongly in the genome-wide analysis, making it highly suitable for large-scale TF identification and classification tasks. MegaPlantTF represents a novel contribution by integrating k-mer encoding, binary family-specific classifiers, and a two-stage stacking ensemble into a unified, reproducible framework for large-scale plant TF identification and classification.</p><p><strong>Availability and implementation: </strong>MegaPlantTF is freely accessible through a public web server available at https://bioinformatics.um6p.ma/MegaPlantTF. The complete source code, including pretrained models and example datasets, is available at https://github.com/Bioinformatics-UM6P/MegaPlantTF.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12803907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145835682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf677
Sylvère Bastien, Pauline François, Sara Moussadeq, Jérôme Lemoine, Karen Moreau, François Vandenesch
Motivation: Sequence variability can be extremely high, particularly in bacteria due to the rapid accumulation of mutations linked to their high replication rate and environmental selection pressure, which often favors diversifying selection. For most species, there are no automated, computationally efficient tools available for constructing a nonredundant database covering the allelic variability of target proteins.
Results: We have thus developed Bacterial Peptide Sequence Selection, a Nextflow pipeline to define a minimal list of peptide sequences for detecting all variants of a protein of interest.
Availability and implementation: All the code and containers used are freely available on Gitlab from https://gitbio.ens-lyon.fr/ciri/stapath/bpss or on Zenodo (10.5281/zenodo.16894981) under GPLv3 open-source license and DockerHub platform from https://hub.docker.com/u/stapath.
{"title":"BPSS: a Nextflow pipeline for Bacterial Peptide Sequence Selection to detect protein diversity.","authors":"Sylvère Bastien, Pauline François, Sara Moussadeq, Jérôme Lemoine, Karen Moreau, François Vandenesch","doi":"10.1093/bioinformatics/btaf677","DOIUrl":"10.1093/bioinformatics/btaf677","url":null,"abstract":"<p><strong>Motivation: </strong>Sequence variability can be extremely high, particularly in bacteria due to the rapid accumulation of mutations linked to their high replication rate and environmental selection pressure, which often favors diversifying selection. For most species, there are no automated, computationally efficient tools available for constructing a nonredundant database covering the allelic variability of target proteins.</p><p><strong>Results: </strong>We have thus developed Bacterial Peptide Sequence Selection, a Nextflow pipeline to define a minimal list of peptide sequences for detecting all variants of a protein of interest.</p><p><strong>Availability and implementation: </strong>All the code and containers used are freely available on Gitlab from https://gitbio.ens-lyon.fr/ciri/stapath/bpss or on Zenodo (10.5281/zenodo.16894981) under GPLv3 open-source license and DockerHub platform from https://hub.docker.com/u/stapath.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145835679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1093/bioinformatics/btaf441
Aimin Li, Haotian Zhou, Rong Fei, Juntao Zou, Xiguo Yuan, Yajun Liu, Saurav Mallik, Xinhong Hei, Lei Wang
Motivation: Gene expression plays a crucial role in cell function, and enhancers can regulate gene expression precisely. Therefore, accurate prediction of enhancers is particularly critical. However, existing prediction methods have low accuracy or rely on fixed multiple epigenetic signals, which may not always be available.
Results: We propose a two-stage framework that accurately predicts enhancers by flexibly combining multiple epigenetic signals. In the first stage, we designed a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals. In the second stage, we developed a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. The accuracy of the Blending-KAN model reached 99.69 ± 0.11% when five epigenetic signals were used. In cross-cell line prediction, the accuracy was more significant than or equal to 93.72%. With Gaussian noise, it still maintains an accuracy of 98.74 ± 0.03%. In the second stage, the accuracy of the Stacking-Auto model is 80.50%, which is better than the existing 17 methods. The results show that our models can be flexibly used to predict and locate enhancers utilizing a combination of multiple epigenetic signals.
Availability and implementation: The source code is available at https://github.com/emanlee/Hi-Enhancer and https://doi.org/10.6084/m9.figshare.29262158.v1.
{"title":"Hi-Enhancer: a two-stage framework for prediction and localization of enhancers based on Blending-KAN and Stacking-Auto models.","authors":"Aimin Li, Haotian Zhou, Rong Fei, Juntao Zou, Xiguo Yuan, Yajun Liu, Saurav Mallik, Xinhong Hei, Lei Wang","doi":"10.1093/bioinformatics/btaf441","DOIUrl":"10.1093/bioinformatics/btaf441","url":null,"abstract":"<p><strong>Motivation: </strong>Gene expression plays a crucial role in cell function, and enhancers can regulate gene expression precisely. Therefore, accurate prediction of enhancers is particularly critical. However, existing prediction methods have low accuracy or rely on fixed multiple epigenetic signals, which may not always be available.</p><p><strong>Results: </strong>We propose a two-stage framework that accurately predicts enhancers by flexibly combining multiple epigenetic signals. In the first stage, we designed a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals. In the second stage, we developed a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. The accuracy of the Blending-KAN model reached 99.69 ± 0.11% when five epigenetic signals were used. In cross-cell line prediction, the accuracy was more significant than or equal to 93.72%. With Gaussian noise, it still maintains an accuracy of 98.74 ± 0.03%. In the second stage, the accuracy of the Stacking-Auto model is 80.50%, which is better than the existing 17 methods. The results show that our models can be flexibly used to predict and locate enhancers utilizing a combination of multiple epigenetic signals.</p><p><strong>Availability and implementation: </strong>The source code is available at https://github.com/emanlee/Hi-Enhancer and https://doi.org/10.6084/m9.figshare.29262158.v1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758598/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144839356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}