Pub Date : 2025-11-20eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf285
Yijun Li, Stefan Stanojevic, Bing He, Zheng Jing, Qianhui Huang, Jian Kang, Lana X Garmire
Motivation: Spatial transcriptomics has allowed researchers to analyze transcriptome data in its tissue sample's spatial context. Various methods have been developed for detecting spatially variable genes (SV genes), whose gene expression over the tissue space shows strong spatial autocorrelation. Such genes are often used to define clusters in cells or spots downstream. However, highly variable (HV) genes, whose quantitative gene expressions show significant variation from cell to cell, are conventionally used in clustering analyses.
Results: In this report, we investigate whether adding highly variable genes to spatially variable genes can improve the cell type clustering performance in spatial transcriptomics data. We tested the clustering performance of HV genes, SV genes, and the union of both gene sets (concatenation) on over 50 real spatial transcriptomics datasets across multiple platforms, using a variety of spatial and non-spatial metrics. Our results show that combining HV genes and SV genes can improve overall cell-type clustering performance.
Availability and implementation: All data and code used in this evaluation study can be found in the following link: https://github.com/lanagarmire/ST_benchmark.
{"title":"Adding highly variable genes to spatially variable genes can improve cell type clustering performance in spatial transcriptomics data.","authors":"Yijun Li, Stefan Stanojevic, Bing He, Zheng Jing, Qianhui Huang, Jian Kang, Lana X Garmire","doi":"10.1093/bioadv/vbaf285","DOIUrl":"10.1093/bioadv/vbaf285","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics has allowed researchers to analyze transcriptome data in its tissue sample's spatial context. Various methods have been developed for detecting spatially variable genes (SV genes), whose gene expression over the tissue space shows strong spatial autocorrelation. Such genes are often used to define clusters in cells or spots downstream. However, highly variable (HV) genes, whose quantitative gene expressions show significant variation from cell to cell, are conventionally used in clustering analyses.</p><p><strong>Results: </strong>In this report, we investigate whether adding highly variable genes to spatially variable genes can improve the cell type clustering performance in spatial transcriptomics data. We tested the clustering performance of HV genes, SV genes, and the union of both gene sets (concatenation) on over 50 real spatial transcriptomics datasets across multiple platforms, using a variety of spatial and non-spatial metrics. Our results show that combining HV genes and SV genes can improve overall cell-type clustering performance.</p><p><strong>Availability and implementation: </strong>All data and code used in this evaluation study can be found in the following link: https://github.com/lanagarmire/ST_benchmark.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf285"},"PeriodicalIF":2.8,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12809558/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Cancer immunotherapy uses the immune system to recognize and eliminate tumor cells by presenting tumor antigens through Human Leukocyte Antigen (HLA) molecules. Accurate prediction of HLA-peptide interactions is essential for personalized immunotherapy development. Allele-specific models achieve high accuracy and handle variable peptide lengths but require separate training for each allele, limiting scalability to rare or unseen HLAs. Pan-specific models generalize across multiple alleles and match or surpass allele-specific methods. Ensemble methods improve prediction by combining outputs from multiple predictors, often via linear combinations, though nonlinear strategies may better capture HLA-peptide complexities.We propose SiaScoreNet, a three-step predictive pipeline enhancing HLA-peptide interaction prediction. First, ESM, a pretrained transformer-based protein language model, embeds HLA and peptide sequences into fixed-length representations, accommodating varying sequence lengths. Second, we integrate predicted scores from state-of-the-art models into a comprehensive feature vector. Third, a nonlinear ensemble strategy combines features, capturing complex dependencies and boosting performance.
Results: Benchmark evaluations show SiaScoreNet outperforms existing models in accuracy, comparable to TransPHLA, BigMHC, and CapHLA. Recent models prioritize recall over precision, valuable for identifying potential binders but resource-intensive. SiaScoreNet offers improved performance and runtime efficiency compared to these models, evaluated against HPV viruses for HLA-peptide prediction.
Availability and implementation: The data and source code for prediction and experiments presented in this study is publicly available in the SiaScoreNet repository hosted on GitHub: https://github.com/CBRC-lab/SiaScoreNet.
{"title":"SiaScoreNet: a siamese neural network-based model integrating prediction scores for HLA-peptide interaction prediction.","authors":"Mahsa Saadat, Fatemeh Zare-Mirakabad, Milad Besharatifard","doi":"10.1093/bioadv/vbaf248","DOIUrl":"10.1093/bioadv/vbaf248","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer immunotherapy uses the immune system to recognize and eliminate tumor cells by presenting tumor antigens through Human Leukocyte Antigen (HLA) molecules. Accurate prediction of HLA-peptide interactions is essential for personalized immunotherapy development. Allele-specific models achieve high accuracy and handle variable peptide lengths but require separate training for each allele, limiting scalability to rare or unseen HLAs. Pan-specific models generalize across multiple alleles and match or surpass allele-specific methods. Ensemble methods improve prediction by combining outputs from multiple predictors, often via linear combinations, though nonlinear strategies may better capture HLA-peptide complexities.We propose <i>SiaScoreNet</i>, a three-step predictive pipeline enhancing HLA-peptide interaction prediction. First, ESM, a pretrained transformer-based protein language model, embeds HLA and peptide sequences into fixed-length representations, accommodating varying sequence lengths. Second, we integrate predicted scores from state-of-the-art models into a comprehensive feature vector. Third, a nonlinear ensemble strategy combines features, capturing complex dependencies and boosting performance.</p><p><strong>Results: </strong>Benchmark evaluations show <i>SiaScoreNet</i> outperforms existing models in accuracy, comparable to TransPHLA, BigMHC, and CapHLA. Recent models prioritize recall over precision, valuable for identifying potential binders but resource-intensive. <i>SiaScoreNet</i> offers improved performance and runtime efficiency compared to these models, evaluated against HPV viruses for HLA-peptide prediction.</p><p><strong>Availability and implementation: </strong>The data and source code for prediction and experiments presented in this study is publicly available in the <i>SiaScoreNet</i> repository hosted on GitHub: https://github.com/CBRC-lab/SiaScoreNet.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf248"},"PeriodicalIF":2.8,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12641608/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf294
Marvin van Aalst, Tim Nies, Tobias Pfennig, Anna Matuszyńska
Summary: Recent advances in artificial intelligence have accelerated the adoption of machine learning (ML) in biology, enabling powerful predictive models across diverse applications. However, in scientific research, the need for interpretability and mechanistic insight remains crucial. To address this, we introduce MxlPy, a Python package that combines mechanistic modelling with ML to deliver explainable, data-informed solutions. MxlPy facilitates mechanistic learning, an emerging approach that integrates the transparency of mathematical models with the flexibility of data-driven methods. By streamlining tasks such as data integration, model formulation, output analysis, and surrogate modelling, MxlPy enhances the modelling experience without sacrificing interpretability. Designed for both computational biologists and interdisciplinary researchers, it supports the development of accurate, efficient, and explainable models, making it a valuable tool for advancing bioinformatics, systems biology, and biomedical research.
Availability and implementation: MxlPy source code is freely available at https://github.com/Computational-Biology-Aachen/MxlPy. The full documentation with features and examples can be found here https://computational-biology-aachen.github.io/MxlPy.
{"title":"MxlPy-Python package for mechanistic learning and hybrid modelling in life science.","authors":"Marvin van Aalst, Tim Nies, Tobias Pfennig, Anna Matuszyńska","doi":"10.1093/bioadv/vbaf294","DOIUrl":"10.1093/bioadv/vbaf294","url":null,"abstract":"<p><strong>Summary: </strong>Recent advances in artificial intelligence have accelerated the adoption of machine learning (ML) in biology, enabling powerful predictive models across diverse applications. However, in scientific research, the need for interpretability and mechanistic insight remains crucial. To address this, we introduce MxlPy, a Python package that combines mechanistic modelling with ML to deliver explainable, data-informed solutions. MxlPy facilitates mechanistic learning, an emerging approach that integrates the transparency of mathematical models with the flexibility of data-driven methods. By streamlining tasks such as data integration, model formulation, output analysis, and surrogate modelling, MxlPy enhances the modelling experience without sacrificing interpretability. Designed for both computational biologists and interdisciplinary researchers, it supports the development of accurate, efficient, and explainable models, making it a valuable tool for advancing bioinformatics, systems biology, and biomedical research.</p><p><strong>Availability and implementation: </strong>MxlPy source code is freely available at https://github.com/Computational-Biology-Aachen/MxlPy. The full documentation with features and examples can be found here https://computational-biology-aachen.github.io/MxlPy.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf294"},"PeriodicalIF":2.8,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12668773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145662876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf291
Annice Najafi
Summary: SurprisalAnalysis is an open-source R package with an accompanying web-based application that utilizes Surprisal analysis to extract patterns of genes that tend to get up or down regulated as a result of a biological process. Surprisal analysis frames gene expression values in thermodynamic terms and identifies entropy-driven constraints and relevant gene weights that allow the decomposition of each gene's expression into a baseline (maximal entropy) component and one or more constraint-driven components. These components correspond to distinct biological modules or processes whose coordinated up or down regulation underlies the observed system dynamics.
Availability and implementation: SurprisalAnalysis is written in R and is freely available on GitHub (https://github.com/AnniceNajafi/SurprisalAnalysis). The package is distributed under a permissive license to promote scientific collaboration and reproducibility. A web-based application with a Graphical User Interface (GUI) is hosted on https://najafiannice.shinyapps.io/surprisal_analysis_app/.
{"title":"SurprisalAnalysis: an open-source software for information-theoretic analysis of gene expression.","authors":"Annice Najafi","doi":"10.1093/bioadv/vbaf291","DOIUrl":"10.1093/bioadv/vbaf291","url":null,"abstract":"<p><strong>Summary: </strong>SurprisalAnalysis is an open-source R package with an accompanying web-based application that utilizes Surprisal analysis to extract patterns of genes that tend to get up or down regulated as a result of a biological process. Surprisal analysis frames gene expression values in thermodynamic terms and identifies entropy-driven constraints and relevant gene weights that allow the decomposition of each gene's expression into a baseline (maximal entropy) component and one or more constraint-driven components. These components correspond to distinct biological modules or processes whose coordinated up or down regulation underlies the observed system dynamics.</p><p><strong>Availability and implementation: </strong>SurprisalAnalysis is written in R and is freely available on GitHub (https://github.com/AnniceNajafi/SurprisalAnalysis). The package is distributed under a permissive license to promote scientific collaboration and reproducibility. A web-based application with a Graphical User Interface (GUI) is hosted on https://najafiannice.shinyapps.io/surprisal_analysis_app/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf291"},"PeriodicalIF":2.8,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12707977/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf295
Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L Grossman
Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language.
Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts.
Availability and implementation: We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.
{"title":"GDC Cohort Copilot: an AI copilot for curating cohorts from the genomic data commons.","authors":"Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L Grossman","doi":"10.1093/bioadv/vbaf295","DOIUrl":"10.1093/bioadv/vbaf295","url":null,"abstract":"<p><strong>Motivation: </strong>The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language.</p><p><strong>Results: </strong>We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts.</p><p><strong>Availability and implementation: </strong>We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf295"},"PeriodicalIF":2.8,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12677940/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf282
Margot Celerier, Andrew J Oldfield, William Ritchie
Motivation: Modern genomics laboratories generate massive volumes of sequencing data, often resulting in significant storage costs. Genomics storage consists of duplicate files, temporary processing files, and redundant intermediate data.
Results: We developed SeqManager, a web-based application that provides automated identification, classification, and management of sequencing data files with intelligent duplicate detection. It also detects intermediate sequencing files that can safely be removed. Evaluation across four genomics laboratory settings demonstrate that our tool is fast and has a very low memory footprint.
Availability and implementation: SeqManager is freely available under the MIT license at https://github.com/AIGeneRegulation/Sequencing-Data-Manager.
{"title":"SeqManager: a web-based tool for efficient sequencing data storage management and duplicate detection.","authors":"Margot Celerier, Andrew J Oldfield, William Ritchie","doi":"10.1093/bioadv/vbaf282","DOIUrl":"10.1093/bioadv/vbaf282","url":null,"abstract":"<p><strong>Motivation: </strong>Modern genomics laboratories generate massive volumes of sequencing data, often resulting in significant storage costs. Genomics storage consists of duplicate files, temporary processing files, and redundant intermediate data.</p><p><strong>Results: </strong>We developed SeqManager, a web-based application that provides automated identification, classification, and management of sequencing data files with intelligent duplicate detection. It also detects intermediate sequencing files that can safely be removed. Evaluation across four genomics laboratory settings demonstrate that our tool is fast and has a very low memory footprint.</p><p><strong>Availability and implementation: </strong>SeqManager is freely available under the MIT license at https://github.com/AIGeneRegulation/Sequencing-Data-Manager.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf282"},"PeriodicalIF":2.8,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701791/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Drug repurposing offers a cost-effective and time-efficient strategy for identifying new therapeutic uses for existing medications, capitalizing on their known safety profiles and pharmacokinetics. We present an automated virtual screening pipeline using AutoDock Vina, a molecular docking software that predicts how small molecules bind to protein targets. This pipeline enhances the speed and accuracy of drug candidate identification by automating and parallelizing the docking process.
Results: We developed and validated a fully automated virtual screening pipeline based on AutoDock Vina, enabling computational parallelization and random ligand positioning without relying on prior knowledge of biologically active protein domains. As a proof of concept, the pipeline was applied to the "serotonin and anxiety" pathway. Docking results were compared with known drug-target interactions, demonstrating the ability of the pipeline to reliably identify compounds interacting with serotonin receptors. This case study confirms the pipeline's effectiveness in supporting drug repurposing by identifying promising candidates for further experimental validation.
Availability and implementation: The AutoDock Vina automation pipeline is freely available for noncommercial use at https://gitlab.com/la_sveva/pip2.0. It is compatible with Linux systems, and a Docker image is provided for ease of deployment and reproducibility. Researchers can easily integrate the pipeline into existing workflows, supporting broader adoption in virtual screening and drug repurposing projects.
{"title":"Unveiling novel drug-target couples: an empowered automated pipeline for enhanced virtual screening using AutoDock Vina.","authors":"Sveva Bonomi, Stefano Carsi, Emily Samuela Turilli-Ghisolfi, Elisa Oltra, Tiziana Alberio, Mauro Fasano","doi":"10.1093/bioadv/vbaf267","DOIUrl":"10.1093/bioadv/vbaf267","url":null,"abstract":"<p><strong>Motivation: </strong>Drug repurposing offers a cost-effective and time-efficient strategy for identifying new therapeutic uses for existing medications, capitalizing on their known safety profiles and pharmacokinetics. We present an automated virtual screening pipeline using AutoDock Vina, a molecular docking software that predicts how small molecules bind to protein targets. This pipeline enhances the speed and accuracy of drug candidate identification by automating and parallelizing the docking process.</p><p><strong>Results: </strong>We developed and validated a fully automated virtual screening pipeline based on AutoDock Vina, enabling computational parallelization and random ligand positioning without relying on prior knowledge of biologically active protein domains. As a proof of concept, the pipeline was applied to the \"serotonin and anxiety\" pathway. Docking results were compared with known drug-target interactions, demonstrating the ability of the pipeline to reliably identify compounds interacting with serotonin receptors. This case study confirms the pipeline's effectiveness in supporting drug repurposing by identifying promising candidates for further experimental validation.</p><p><strong>Availability and implementation: </strong>The AutoDock Vina automation pipeline is freely available for noncommercial use at https://gitlab.com/la_sveva/pip2.0. It is compatible with Linux systems, and a Docker image is provided for ease of deployment and reproducibility. Researchers can easily integrate the pipeline into existing workflows, supporting broader adoption in virtual screening and drug repurposing projects.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf267"},"PeriodicalIF":2.8,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699991/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf290
Ondřej Sladký, Pavel Veselý, Karel Břinda
Motivation: The growing volumes and heterogeneity of genomic data call for scalable and versatile k-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k, sampled data, or high-diversity settings.
Results: We introduce FMSI, a superstring-based index for arbitrary k-mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in k-mer superstrings and uses the Masked Burrows-Wheeler Transform, a novel extension of the classical Burrows-Wheeler Transform that incorporates position masking. Across a range of k values and dataset types-including genomic, pangenomic, and metagenomic-FMSI consistently achieves superior query space efficiency, using up to 2-3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI's footprint in some cases, but then FMSI is 2-3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary k-mer sets across diverse bioinformatics applications.
Availability and implementation: FMSI is developed in C++ and released under the MIT license, with source code provided at https://github.com/OndrejSladky/fmsi and an installable package available through Bioconda. The datasets used in the experiments are deposited at Zenodo (https://doi.org/10.5281/zenodo.14722244).
{"title":"FroM Superstring to Indexing: a space-efficient index for unconstrained <i>k</i>-mer sets using the Masked Burrows-Wheeler Transform (MBWT).","authors":"Ondřej Sladký, Pavel Veselý, Karel Břinda","doi":"10.1093/bioadv/vbaf290","DOIUrl":"10.1093/bioadv/vbaf290","url":null,"abstract":"<p><strong>Motivation: </strong>The growing volumes and heterogeneity of genomic data call for scalable and versatile <i>k</i>-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small <i>k</i>, sampled data, or high-diversity settings.</p><p><strong>Results: </strong>We introduce FMSI, a superstring-based index for arbitrary <i>k</i>-mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in <i>k</i>-mer superstrings and uses the Masked Burrows-Wheeler Transform, a novel extension of the classical Burrows-Wheeler Transform that incorporates position masking. Across a range of <i>k</i> values and dataset types-including genomic, pangenomic, and metagenomic-FMSI consistently achieves superior query space efficiency, using up to 2-3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI's footprint in some cases, but then FMSI is 2-3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary <i>k</i>-mer sets across diverse bioinformatics applications.</p><p><strong>Availability and implementation: </strong>FMSI is developed in C++ and released under the MIT license, with source code provided at https://github.com/OndrejSladky/fmsi and an installable package available through Bioconda. The datasets used in the experiments are deposited at Zenodo (https://doi.org/10.5281/zenodo.14722244).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf290"},"PeriodicalIF":2.8,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12800775/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf288
Valentina Di Salvatore, Avisa Maleki, Babak Mohajer, Alvaro Ras-Carmona, Giulia Russo, Pedro Antonio Reche, Francesco Pappalardo
Motivation: The rapid evolution of SARS-CoV-2 highlights the importance of computational approaches to explore mutational effects on the viral spike protein. In this work, we present a genetic algorithm (GA) framework applied to the structural optimization of spike protein variants, with a focus on energetic and binding properties rather than direct evolutionary prediction.
Results: Our GA-driven pipeline generated spike variants with progressively improved structural stability as indicated by lower discrete optimized protein energy scores across generations. The approach also enabled evaluation of Gibbs free energy and binding affinity for spike-Angiotensin-converting enzyme 2 receptor interactions, revealing candidate conformations with favorable thermodynamic properties. These results demonstrate the algorithm's capacity to refine protein models and explore mutational landscapes in silico, although no validation against naturally emerging variants was performed. This study presents a methodological framework for GA-based structural modeling of SARS-CoV-2 spike mutations. Rather than forecasting specific variants of concern, it demonstrates the feasibility of a computational approach that can be extended and integrated with evolutionary and experimental evidence to strengthen future efforts in variant monitoring and vaccine development.
Availability and implementation: All the Python and R scripts are available upon request to the authors.
{"title":"Exploring SARS-CoV-2 spike protein mutations through genetic algorithm-driven structural modeling.","authors":"Valentina Di Salvatore, Avisa Maleki, Babak Mohajer, Alvaro Ras-Carmona, Giulia Russo, Pedro Antonio Reche, Francesco Pappalardo","doi":"10.1093/bioadv/vbaf288","DOIUrl":"10.1093/bioadv/vbaf288","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid evolution of SARS-CoV-2 highlights the importance of computational approaches to explore mutational effects on the viral spike protein. In this work, we present a genetic algorithm (GA) framework applied to the structural optimization of spike protein variants, with a focus on energetic and binding properties rather than direct evolutionary prediction.</p><p><strong>Results: </strong>Our GA-driven pipeline generated spike variants with progressively improved structural stability as indicated by lower discrete optimized protein energy scores across generations. The approach also enabled evaluation of Gibbs free energy and binding affinity for spike-Angiotensin-converting enzyme 2 receptor interactions, revealing candidate conformations with favorable thermodynamic properties. These results demonstrate the algorithm's capacity to refine protein models and explore mutational landscapes in silico, although no validation against naturally emerging variants was performed. This study presents a methodological framework for GA-based structural modeling of SARS-CoV-2 spike mutations. Rather than forecasting specific variants of concern, it demonstrates the feasibility of a computational approach that can be extended and integrated with evolutionary and experimental evidence to strengthen future efforts in variant monitoring and vaccine development.</p><p><strong>Availability and implementation: </strong>All the Python and R scripts are available upon request to the authors.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf288"},"PeriodicalIF":2.8,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12627402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf289
Manju Anandakrishnan, Karen E Ross, Chuming Chen, K Vijay-Shanker, Cathy H Wu
Motivation: Protein kinases regulate cellular signaling pathways through a cascade of phosphorylation activity, selectively targeting specific residues on substrate proteins (phosphosites). Determining the characteristics of kinases that phosphorylate specific substrates have been extensively studied. Most tools utilize amino acid sequence motifs around phosphosites but don't consider substrate protein's biological characteristics.
Results: We present KSMoFinder, a kinase-substrate-motif prediction model that learns factors beyond motif similarities by integrating proteins' biological contexts. We learn the semantics in a knowledge graph containing proteins' contextual relationships, kinase-specific motifs and motif composition, and represent the proteins and motifs as vectors. Using the representations as features, we train a supervised deep-learning classifier to identify kinase-phosphosite relationships. We use ground truth kinase-substrate-motif dataset from iPTMnet and PhosphositePlus and evaluate KSMoFinder's prediction performance. Pairwise comparative assessments with prior kinase-substrate prediction tools demonstrate KSMoFinder's superior performance. KSMoFinder trained using our knowledge graph embeddings surpasses the prediction performances using embeddings of popular protein language models such as ProtT5, ESM2, and ESM3 with a ROC-AUC of 0.851 and PR-AUC of 0.839 on a testing dataset with equal number of positives and negatives. Unlike most existing tools, KSMoFinder can be utilized to predict at the motif and at the substrate protein level.
Availability and implementation: Source code is available at https://github.com/manju-anandakrishnan/KSMoFinder.
{"title":"KSMoFinder-knowledge graph embedding of proteins and motifs for predicting kinases of human phosphosites.","authors":"Manju Anandakrishnan, Karen E Ross, Chuming Chen, K Vijay-Shanker, Cathy H Wu","doi":"10.1093/bioadv/vbaf289","DOIUrl":"10.1093/bioadv/vbaf289","url":null,"abstract":"<p><strong>Motivation: </strong>Protein kinases regulate cellular signaling pathways through a cascade of phosphorylation activity, selectively targeting specific residues on substrate proteins (phosphosites). Determining the characteristics of kinases that phosphorylate specific substrates have been extensively studied. Most tools utilize amino acid sequence motifs around phosphosites but don't consider substrate protein's biological characteristics.</p><p><strong>Results: </strong>We present KSMoFinder, a kinase-substrate-motif prediction model that learns factors beyond motif similarities by integrating proteins' biological contexts. We learn the semantics in a knowledge graph containing proteins' contextual relationships, kinase-specific motifs and motif composition, and represent the proteins and motifs as vectors. Using the representations as features, we train a supervised deep-learning classifier to identify kinase-phosphosite relationships. We use ground truth kinase-substrate-motif dataset from iPTMnet and PhosphositePlus and evaluate KSMoFinder's prediction performance. Pairwise comparative assessments with prior kinase-substrate prediction tools demonstrate KSMoFinder's superior performance. KSMoFinder trained using our knowledge graph embeddings surpasses the prediction performances using embeddings of popular protein language models such as ProtT5, ESM2, and ESM3 with a ROC-AUC of 0.851 and PR-AUC of 0.839 on a testing dataset with equal number of positives and negatives. Unlike most existing tools, KSMoFinder can be utilized to predict at the motif and at the substrate protein level.</p><p><strong>Availability and implementation: </strong>Source code is available at https://github.com/manju-anandakrishnan/KSMoFinder.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf289"},"PeriodicalIF":2.8,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664573/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}