Pub Date : 2025-11-20eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf293
Siddharth Sethi, Emil K Gustavsson, Harpreet Saini, Mina Ryten
Motivation: Long-read RNA sequencing has the potential to accurately quantify transcriptomes and reveal the isoform diversity of disease-causing genes. However, despite the recent advances in analysis tools for transcript discovery, long-read RNA sequencing data is still challenging to analyse, due to the detection of hundreds or even thousands of novel transcripts per gene.
Results: Here, we introduce PSQAN, a workflow to help researchers prioritize high-confidence and potentially biologically relevant transcripts associated with candidate genes and make transcript characterization results more interpretable. PSQAN performs a gene-based analysis on characterized transcripts generated by SQANTI3 and TALON. PSQAN re-groups transcripts into easily interpretable categories to facilitate their prioritization, allows transcript-level expression thresholds, and generates visualizations to determine optimal expression thresholds. Overall, we demonstrate that PSQAN is a useful tool which enables users to identify known and novel transcripts of potential biological importance.
Availability and implementation: PSQAN is an analysis workflow implemented in Snakemake and R and is licensed under the GNU General Public License version 3. The source code and documentation of this tool is available at https://github.com/sid-sethi/PSQAN.
{"title":"PSQAN: a pipeline to prioritize novel and biologically relevant transcripts from long-read RNA sequencing.","authors":"Siddharth Sethi, Emil K Gustavsson, Harpreet Saini, Mina Ryten","doi":"10.1093/bioadv/vbaf293","DOIUrl":"10.1093/bioadv/vbaf293","url":null,"abstract":"<p><strong>Motivation: </strong>Long-read RNA sequencing has the potential to accurately quantify transcriptomes and reveal the isoform diversity of disease-causing genes. However, despite the recent advances in analysis tools for transcript discovery, long-read RNA sequencing data is still challenging to analyse, due to the detection of hundreds or even thousands of novel transcripts per gene.</p><p><strong>Results: </strong>Here, we introduce PSQAN, a workflow to help researchers prioritize high-confidence and potentially biologically relevant transcripts associated with candidate genes and make transcript characterization results more interpretable. PSQAN performs a gene-based analysis on characterized transcripts generated by SQANTI3 and TALON. PSQAN re-groups transcripts into easily interpretable categories to facilitate their prioritization, allows transcript-level expression thresholds, and generates visualizations to determine optimal expression thresholds. Overall, we demonstrate that PSQAN is a useful tool which enables users to identify known and novel transcripts of potential biological importance.</p><p><strong>Availability and implementation: </strong>PSQAN is an analysis workflow implemented in Snakemake and R and is licensed under the GNU General Public License version 3. The source code and documentation of this tool is available at https://github.com/sid-sethi/PSQAN.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf293"},"PeriodicalIF":2.8,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf285
Yijun Li, Stefan Stanojevic, Bing He, Zheng Jing, Qianhui Huang, Jian Kang, Lana X Garmire
Motivation: Spatial transcriptomics has allowed researchers to analyze transcriptome data in its tissue sample's spatial context. Various methods have been developed for detecting spatially variable genes (SV genes), whose gene expression over the tissue space shows strong spatial autocorrelation. Such genes are often used to define clusters in cells or spots downstream. However, highly variable (HV) genes, whose quantitative gene expressions show significant variation from cell to cell, are conventionally used in clustering analyses.
Results: In this report, we investigate whether adding highly variable genes to spatially variable genes can improve the cell type clustering performance in spatial transcriptomics data. We tested the clustering performance of HV genes, SV genes, and the union of both gene sets (concatenation) on over 50 real spatial transcriptomics datasets across multiple platforms, using a variety of spatial and non-spatial metrics. Our results show that combining HV genes and SV genes can improve overall cell-type clustering performance.
Availability and implementation: All data and code used in this evaluation study can be found in the following link: https://github.com/lanagarmire/ST_benchmark.
{"title":"Adding highly variable genes to spatially variable genes can improve cell type clustering performance in spatial transcriptomics data.","authors":"Yijun Li, Stefan Stanojevic, Bing He, Zheng Jing, Qianhui Huang, Jian Kang, Lana X Garmire","doi":"10.1093/bioadv/vbaf285","DOIUrl":"10.1093/bioadv/vbaf285","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics has allowed researchers to analyze transcriptome data in its tissue sample's spatial context. Various methods have been developed for detecting spatially variable genes (SV genes), whose gene expression over the tissue space shows strong spatial autocorrelation. Such genes are often used to define clusters in cells or spots downstream. However, highly variable (HV) genes, whose quantitative gene expressions show significant variation from cell to cell, are conventionally used in clustering analyses.</p><p><strong>Results: </strong>In this report, we investigate whether adding highly variable genes to spatially variable genes can improve the cell type clustering performance in spatial transcriptomics data. We tested the clustering performance of HV genes, SV genes, and the union of both gene sets (concatenation) on over 50 real spatial transcriptomics datasets across multiple platforms, using a variety of spatial and non-spatial metrics. Our results show that combining HV genes and SV genes can improve overall cell-type clustering performance.</p><p><strong>Availability and implementation: </strong>All data and code used in this evaluation study can be found in the following link: https://github.com/lanagarmire/ST_benchmark.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf285"},"PeriodicalIF":2.8,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12809558/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Cancer immunotherapy uses the immune system to recognize and eliminate tumor cells by presenting tumor antigens through Human Leukocyte Antigen (HLA) molecules. Accurate prediction of HLA-peptide interactions is essential for personalized immunotherapy development. Allele-specific models achieve high accuracy and handle variable peptide lengths but require separate training for each allele, limiting scalability to rare or unseen HLAs. Pan-specific models generalize across multiple alleles and match or surpass allele-specific methods. Ensemble methods improve prediction by combining outputs from multiple predictors, often via linear combinations, though nonlinear strategies may better capture HLA-peptide complexities.We propose SiaScoreNet, a three-step predictive pipeline enhancing HLA-peptide interaction prediction. First, ESM, a pretrained transformer-based protein language model, embeds HLA and peptide sequences into fixed-length representations, accommodating varying sequence lengths. Second, we integrate predicted scores from state-of-the-art models into a comprehensive feature vector. Third, a nonlinear ensemble strategy combines features, capturing complex dependencies and boosting performance.
Results: Benchmark evaluations show SiaScoreNet outperforms existing models in accuracy, comparable to TransPHLA, BigMHC, and CapHLA. Recent models prioritize recall over precision, valuable for identifying potential binders but resource-intensive. SiaScoreNet offers improved performance and runtime efficiency compared to these models, evaluated against HPV viruses for HLA-peptide prediction.
Availability and implementation: The data and source code for prediction and experiments presented in this study is publicly available in the SiaScoreNet repository hosted on GitHub: https://github.com/CBRC-lab/SiaScoreNet.
{"title":"SiaScoreNet: a siamese neural network-based model integrating prediction scores for HLA-peptide interaction prediction.","authors":"Mahsa Saadat, Fatemeh Zare-Mirakabad, Milad Besharatifard","doi":"10.1093/bioadv/vbaf248","DOIUrl":"10.1093/bioadv/vbaf248","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer immunotherapy uses the immune system to recognize and eliminate tumor cells by presenting tumor antigens through Human Leukocyte Antigen (HLA) molecules. Accurate prediction of HLA-peptide interactions is essential for personalized immunotherapy development. Allele-specific models achieve high accuracy and handle variable peptide lengths but require separate training for each allele, limiting scalability to rare or unseen HLAs. Pan-specific models generalize across multiple alleles and match or surpass allele-specific methods. Ensemble methods improve prediction by combining outputs from multiple predictors, often via linear combinations, though nonlinear strategies may better capture HLA-peptide complexities.We propose <i>SiaScoreNet</i>, a three-step predictive pipeline enhancing HLA-peptide interaction prediction. First, ESM, a pretrained transformer-based protein language model, embeds HLA and peptide sequences into fixed-length representations, accommodating varying sequence lengths. Second, we integrate predicted scores from state-of-the-art models into a comprehensive feature vector. Third, a nonlinear ensemble strategy combines features, capturing complex dependencies and boosting performance.</p><p><strong>Results: </strong>Benchmark evaluations show <i>SiaScoreNet</i> outperforms existing models in accuracy, comparable to TransPHLA, BigMHC, and CapHLA. Recent models prioritize recall over precision, valuable for identifying potential binders but resource-intensive. <i>SiaScoreNet</i> offers improved performance and runtime efficiency compared to these models, evaluated against HPV viruses for HLA-peptide prediction.</p><p><strong>Availability and implementation: </strong>The data and source code for prediction and experiments presented in this study is publicly available in the <i>SiaScoreNet</i> repository hosted on GitHub: https://github.com/CBRC-lab/SiaScoreNet.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf248"},"PeriodicalIF":2.8,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12641608/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf294
Marvin van Aalst, Tim Nies, Tobias Pfennig, Anna Matuszyńska
Summary: Recent advances in artificial intelligence have accelerated the adoption of machine learning (ML) in biology, enabling powerful predictive models across diverse applications. However, in scientific research, the need for interpretability and mechanistic insight remains crucial. To address this, we introduce MxlPy, a Python package that combines mechanistic modelling with ML to deliver explainable, data-informed solutions. MxlPy facilitates mechanistic learning, an emerging approach that integrates the transparency of mathematical models with the flexibility of data-driven methods. By streamlining tasks such as data integration, model formulation, output analysis, and surrogate modelling, MxlPy enhances the modelling experience without sacrificing interpretability. Designed for both computational biologists and interdisciplinary researchers, it supports the development of accurate, efficient, and explainable models, making it a valuable tool for advancing bioinformatics, systems biology, and biomedical research.
Availability and implementation: MxlPy source code is freely available at https://github.com/Computational-Biology-Aachen/MxlPy. The full documentation with features and examples can be found here https://computational-biology-aachen.github.io/MxlPy.
{"title":"MxlPy-Python package for mechanistic learning and hybrid modelling in life science.","authors":"Marvin van Aalst, Tim Nies, Tobias Pfennig, Anna Matuszyńska","doi":"10.1093/bioadv/vbaf294","DOIUrl":"10.1093/bioadv/vbaf294","url":null,"abstract":"<p><strong>Summary: </strong>Recent advances in artificial intelligence have accelerated the adoption of machine learning (ML) in biology, enabling powerful predictive models across diverse applications. However, in scientific research, the need for interpretability and mechanistic insight remains crucial. To address this, we introduce MxlPy, a Python package that combines mechanistic modelling with ML to deliver explainable, data-informed solutions. MxlPy facilitates mechanistic learning, an emerging approach that integrates the transparency of mathematical models with the flexibility of data-driven methods. By streamlining tasks such as data integration, model formulation, output analysis, and surrogate modelling, MxlPy enhances the modelling experience without sacrificing interpretability. Designed for both computational biologists and interdisciplinary researchers, it supports the development of accurate, efficient, and explainable models, making it a valuable tool for advancing bioinformatics, systems biology, and biomedical research.</p><p><strong>Availability and implementation: </strong>MxlPy source code is freely available at https://github.com/Computational-Biology-Aachen/MxlPy. The full documentation with features and examples can be found here https://computational-biology-aachen.github.io/MxlPy.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf294"},"PeriodicalIF":2.8,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12668773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145662876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf291
Annice Najafi
Summary: SurprisalAnalysis is an open-source R package with an accompanying web-based application that utilizes Surprisal analysis to extract patterns of genes that tend to get up or down regulated as a result of a biological process. Surprisal analysis frames gene expression values in thermodynamic terms and identifies entropy-driven constraints and relevant gene weights that allow the decomposition of each gene's expression into a baseline (maximal entropy) component and one or more constraint-driven components. These components correspond to distinct biological modules or processes whose coordinated up or down regulation underlies the observed system dynamics.
Availability and implementation: SurprisalAnalysis is written in R and is freely available on GitHub (https://github.com/AnniceNajafi/SurprisalAnalysis). The package is distributed under a permissive license to promote scientific collaboration and reproducibility. A web-based application with a Graphical User Interface (GUI) is hosted on https://najafiannice.shinyapps.io/surprisal_analysis_app/.
{"title":"SurprisalAnalysis: an open-source software for information-theoretic analysis of gene expression.","authors":"Annice Najafi","doi":"10.1093/bioadv/vbaf291","DOIUrl":"10.1093/bioadv/vbaf291","url":null,"abstract":"<p><strong>Summary: </strong>SurprisalAnalysis is an open-source R package with an accompanying web-based application that utilizes Surprisal analysis to extract patterns of genes that tend to get up or down regulated as a result of a biological process. Surprisal analysis frames gene expression values in thermodynamic terms and identifies entropy-driven constraints and relevant gene weights that allow the decomposition of each gene's expression into a baseline (maximal entropy) component and one or more constraint-driven components. These components correspond to distinct biological modules or processes whose coordinated up or down regulation underlies the observed system dynamics.</p><p><strong>Availability and implementation: </strong>SurprisalAnalysis is written in R and is freely available on GitHub (https://github.com/AnniceNajafi/SurprisalAnalysis). The package is distributed under a permissive license to promote scientific collaboration and reproducibility. A web-based application with a Graphical User Interface (GUI) is hosted on https://najafiannice.shinyapps.io/surprisal_analysis_app/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf291"},"PeriodicalIF":2.8,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12707977/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf295
Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L Grossman
Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language.
Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts.
Availability and implementation: We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.
{"title":"GDC Cohort Copilot: an AI copilot for curating cohorts from the genomic data commons.","authors":"Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L Grossman","doi":"10.1093/bioadv/vbaf295","DOIUrl":"10.1093/bioadv/vbaf295","url":null,"abstract":"<p><strong>Motivation: </strong>The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language.</p><p><strong>Results: </strong>We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts.</p><p><strong>Availability and implementation: </strong>We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf295"},"PeriodicalIF":2.8,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12677940/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf292
Seth Stadick
Motivation: To date, there has been no command line utility for performing index-free alignment-based filtering of records. Since filtering using command line tools is a staple of Bioinformatics, this leaves a gap in command line workflows.
Results: Ish is the first composable unix-style command line tool for filtering the input target records to only those that match the input query with a threshold alignment score, using a selectable alignment algorithm and selectable record type. The core alignment algorithms for ish meet or exceed the performance of their reference implementations in Parasail for both SIMD and GPU alignment, as measured by gigacell updates per second (GCUPs).
Availability and implementation: The source code and documentation are available at https://github.com/BioRadOpenSource/ish under the Apache-2.0 License and open for community contributions. Ish is installable with Conda and supports Linux and macOS.
{"title":"Ish: SIMD and GPU accelerated local and semi-global alignment as a CLI filtering tool.","authors":"Seth Stadick","doi":"10.1093/bioadv/vbaf292","DOIUrl":"https://doi.org/10.1093/bioadv/vbaf292","url":null,"abstract":"<p><strong>Motivation: </strong>To date, there has been no command line utility for performing index-free alignment-based filtering of records. Since filtering using command line tools is a staple of Bioinformatics, this leaves a gap in command line workflows.</p><p><strong>Results: </strong>Ish is the first composable unix-style command line tool for filtering the input target records to only those that match the input query with a threshold alignment score, using a selectable alignment algorithm and selectable record type. The core alignment algorithms for ish meet or exceed the performance of their reference implementations in Parasail for both SIMD and GPU alignment, as measured by gigacell updates per second (GCUPs).</p><p><strong>Availability and implementation: </strong>The source code and documentation are available at https://github.com/BioRadOpenSource/ish under the Apache-2.0 License and open for community contributions. Ish is installable with Conda and supports Linux and macOS.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf292"},"PeriodicalIF":2.8,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12930471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf282
Margot Celerier, Andrew J Oldfield, William Ritchie
Motivation: Modern genomics laboratories generate massive volumes of sequencing data, often resulting in significant storage costs. Genomics storage consists of duplicate files, temporary processing files, and redundant intermediate data.
Results: We developed SeqManager, a web-based application that provides automated identification, classification, and management of sequencing data files with intelligent duplicate detection. It also detects intermediate sequencing files that can safely be removed. Evaluation across four genomics laboratory settings demonstrate that our tool is fast and has a very low memory footprint.
Availability and implementation: SeqManager is freely available under the MIT license at https://github.com/AIGeneRegulation/Sequencing-Data-Manager.
{"title":"SeqManager: a web-based tool for efficient sequencing data storage management and duplicate detection.","authors":"Margot Celerier, Andrew J Oldfield, William Ritchie","doi":"10.1093/bioadv/vbaf282","DOIUrl":"10.1093/bioadv/vbaf282","url":null,"abstract":"<p><strong>Motivation: </strong>Modern genomics laboratories generate massive volumes of sequencing data, often resulting in significant storage costs. Genomics storage consists of duplicate files, temporary processing files, and redundant intermediate data.</p><p><strong>Results: </strong>We developed SeqManager, a web-based application that provides automated identification, classification, and management of sequencing data files with intelligent duplicate detection. It also detects intermediate sequencing files that can safely be removed. Evaluation across four genomics laboratory settings demonstrate that our tool is fast and has a very low memory footprint.</p><p><strong>Availability and implementation: </strong>SeqManager is freely available under the MIT license at https://github.com/AIGeneRegulation/Sequencing-Data-Manager.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf282"},"PeriodicalIF":2.8,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701791/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Drug repurposing offers a cost-effective and time-efficient strategy for identifying new therapeutic uses for existing medications, capitalizing on their known safety profiles and pharmacokinetics. We present an automated virtual screening pipeline using AutoDock Vina, a molecular docking software that predicts how small molecules bind to protein targets. This pipeline enhances the speed and accuracy of drug candidate identification by automating and parallelizing the docking process.
Results: We developed and validated a fully automated virtual screening pipeline based on AutoDock Vina, enabling computational parallelization and random ligand positioning without relying on prior knowledge of biologically active protein domains. As a proof of concept, the pipeline was applied to the "serotonin and anxiety" pathway. Docking results were compared with known drug-target interactions, demonstrating the ability of the pipeline to reliably identify compounds interacting with serotonin receptors. This case study confirms the pipeline's effectiveness in supporting drug repurposing by identifying promising candidates for further experimental validation.
Availability and implementation: The AutoDock Vina automation pipeline is freely available for noncommercial use at https://gitlab.com/la_sveva/pip2.0. It is compatible with Linux systems, and a Docker image is provided for ease of deployment and reproducibility. Researchers can easily integrate the pipeline into existing workflows, supporting broader adoption in virtual screening and drug repurposing projects.
{"title":"Unveiling novel drug-target couples: an empowered automated pipeline for enhanced virtual screening using AutoDock Vina.","authors":"Sveva Bonomi, Stefano Carsi, Emily Samuela Turilli-Ghisolfi, Elisa Oltra, Tiziana Alberio, Mauro Fasano","doi":"10.1093/bioadv/vbaf267","DOIUrl":"10.1093/bioadv/vbaf267","url":null,"abstract":"<p><strong>Motivation: </strong>Drug repurposing offers a cost-effective and time-efficient strategy for identifying new therapeutic uses for existing medications, capitalizing on their known safety profiles and pharmacokinetics. We present an automated virtual screening pipeline using AutoDock Vina, a molecular docking software that predicts how small molecules bind to protein targets. This pipeline enhances the speed and accuracy of drug candidate identification by automating and parallelizing the docking process.</p><p><strong>Results: </strong>We developed and validated a fully automated virtual screening pipeline based on AutoDock Vina, enabling computational parallelization and random ligand positioning without relying on prior knowledge of biologically active protein domains. As a proof of concept, the pipeline was applied to the \"serotonin and anxiety\" pathway. Docking results were compared with known drug-target interactions, demonstrating the ability of the pipeline to reliably identify compounds interacting with serotonin receptors. This case study confirms the pipeline's effectiveness in supporting drug repurposing by identifying promising candidates for further experimental validation.</p><p><strong>Availability and implementation: </strong>The AutoDock Vina automation pipeline is freely available for noncommercial use at https://gitlab.com/la_sveva/pip2.0. It is compatible with Linux systems, and a Docker image is provided for ease of deployment and reproducibility. Researchers can easily integrate the pipeline into existing workflows, supporting broader adoption in virtual screening and drug repurposing projects.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf267"},"PeriodicalIF":2.8,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699991/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf290
Ondřej Sladký, Pavel Veselý, Karel Břinda
Motivation: The growing volumes and heterogeneity of genomic data call for scalable and versatile k-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k, sampled data, or high-diversity settings.
Results: We introduce FMSI, a superstring-based index for arbitrary k-mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in k-mer superstrings and uses the Masked Burrows-Wheeler Transform, a novel extension of the classical Burrows-Wheeler Transform that incorporates position masking. Across a range of k values and dataset types-including genomic, pangenomic, and metagenomic-FMSI consistently achieves superior query space efficiency, using up to 2-3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI's footprint in some cases, but then FMSI is 2-3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary k-mer sets across diverse bioinformatics applications.
Availability and implementation: FMSI is developed in C++ and released under the MIT license, with source code provided at https://github.com/OndrejSladky/fmsi and an installable package available through Bioconda. The datasets used in the experiments are deposited at Zenodo (https://doi.org/10.5281/zenodo.14722244).
{"title":"FroM Superstring to Indexing: a space-efficient index for unconstrained <i>k</i>-mer sets using the Masked Burrows-Wheeler Transform (MBWT).","authors":"Ondřej Sladký, Pavel Veselý, Karel Břinda","doi":"10.1093/bioadv/vbaf290","DOIUrl":"10.1093/bioadv/vbaf290","url":null,"abstract":"<p><strong>Motivation: </strong>The growing volumes and heterogeneity of genomic data call for scalable and versatile <i>k</i>-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small <i>k</i>, sampled data, or high-diversity settings.</p><p><strong>Results: </strong>We introduce FMSI, a superstring-based index for arbitrary <i>k</i>-mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in <i>k</i>-mer superstrings and uses the Masked Burrows-Wheeler Transform, a novel extension of the classical Burrows-Wheeler Transform that incorporates position masking. Across a range of <i>k</i> values and dataset types-including genomic, pangenomic, and metagenomic-FMSI consistently achieves superior query space efficiency, using up to 2-3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI's footprint in some cases, but then FMSI is 2-3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary <i>k</i>-mer sets across diverse bioinformatics applications.</p><p><strong>Availability and implementation: </strong>FMSI is developed in C++ and released under the MIT license, with source code provided at https://github.com/OndrejSladky/fmsi and an installable package available through Bioconda. The datasets used in the experiments are deposited at Zenodo (https://doi.org/10.5281/zenodo.14722244).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf290"},"PeriodicalIF":2.8,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12800775/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}