Motivation: Drug repurposing offers a cost-effective and time-efficient strategy for identifying new therapeutic uses for existing medications, capitalizing on their known safety profiles and pharmacokinetics. We present an automated virtual screening pipeline using AutoDock Vina, a molecular docking software that predicts how small molecules bind to protein targets. This pipeline enhances the speed and accuracy of drug candidate identification by automating and parallelizing the docking process.
Results: We developed and validated a fully automated virtual screening pipeline based on AutoDock Vina, enabling computational parallelization and random ligand positioning without relying on prior knowledge of biologically active protein domains. As a proof of concept, the pipeline was applied to the "serotonin and anxiety" pathway. Docking results were compared with known drug-target interactions, demonstrating the ability of the pipeline to reliably identify compounds interacting with serotonin receptors. This case study confirms the pipeline's effectiveness in supporting drug repurposing by identifying promising candidates for further experimental validation.
Availability and implementation: The AutoDock Vina automation pipeline is freely available for noncommercial use at https://gitlab.com/la_sveva/pip2.0. It is compatible with Linux systems, and a Docker image is provided for ease of deployment and reproducibility. Researchers can easily integrate the pipeline into existing workflows, supporting broader adoption in virtual screening and drug repurposing projects.
{"title":"Unveiling novel drug-target couples: an empowered automated pipeline for enhanced virtual screening using AutoDock Vina.","authors":"Sveva Bonomi, Stefano Carsi, Emily Samuela Turilli-Ghisolfi, Elisa Oltra, Tiziana Alberio, Mauro Fasano","doi":"10.1093/bioadv/vbaf267","DOIUrl":"10.1093/bioadv/vbaf267","url":null,"abstract":"<p><strong>Motivation: </strong>Drug repurposing offers a cost-effective and time-efficient strategy for identifying new therapeutic uses for existing medications, capitalizing on their known safety profiles and pharmacokinetics. We present an automated virtual screening pipeline using AutoDock Vina, a molecular docking software that predicts how small molecules bind to protein targets. This pipeline enhances the speed and accuracy of drug candidate identification by automating and parallelizing the docking process.</p><p><strong>Results: </strong>We developed and validated a fully automated virtual screening pipeline based on AutoDock Vina, enabling computational parallelization and random ligand positioning without relying on prior knowledge of biologically active protein domains. As a proof of concept, the pipeline was applied to the \"serotonin and anxiety\" pathway. Docking results were compared with known drug-target interactions, demonstrating the ability of the pipeline to reliably identify compounds interacting with serotonin receptors. This case study confirms the pipeline's effectiveness in supporting drug repurposing by identifying promising candidates for further experimental validation.</p><p><strong>Availability and implementation: </strong>The AutoDock Vina automation pipeline is freely available for noncommercial use at https://gitlab.com/la_sveva/pip2.0. It is compatible with Linux systems, and a Docker image is provided for ease of deployment and reproducibility. Researchers can easily integrate the pipeline into existing workflows, supporting broader adoption in virtual screening and drug repurposing projects.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf267"},"PeriodicalIF":2.8,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699991/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf290
Ondřej Sladký, Pavel Veselý, Karel Břinda
Motivation: The growing volumes and heterogeneity of genomic data call for scalable and versatile k-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small k, sampled data, or high-diversity settings.
Results: We introduce FMSI, a superstring-based index for arbitrary k-mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in k-mer superstrings and uses the Masked Burrows-Wheeler Transform, a novel extension of the classical Burrows-Wheeler Transform that incorporates position masking. Across a range of k values and dataset types-including genomic, pangenomic, and metagenomic-FMSI consistently achieves superior query space efficiency, using up to 2-3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI's footprint in some cases, but then FMSI is 2-3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary k-mer sets across diverse bioinformatics applications.
Availability and implementation: FMSI is developed in C++ and released under the MIT license, with source code provided at https://github.com/OndrejSladky/fmsi and an installable package available through Bioconda. The datasets used in the experiments are deposited at Zenodo (https://doi.org/10.5281/zenodo.14722244).
{"title":"FroM Superstring to Indexing: a space-efficient index for unconstrained <i>k</i>-mer sets using the Masked Burrows-Wheeler Transform (MBWT).","authors":"Ondřej Sladký, Pavel Veselý, Karel Břinda","doi":"10.1093/bioadv/vbaf290","DOIUrl":"10.1093/bioadv/vbaf290","url":null,"abstract":"<p><strong>Motivation: </strong>The growing volumes and heterogeneity of genomic data call for scalable and versatile <i>k</i>-mer-set indexes. However, state-of-the-art indexes such as SBWT and SSHash depend on long non-branching paths in de Bruijn graphs, which limits their efficiency for small <i>k</i>, sampled data, or high-diversity settings.</p><p><strong>Results: </strong>We introduce FMSI, a superstring-based index for arbitrary <i>k</i>-mer sets that supports efficient membership and compressed dictionary queries with strong theoretical guarantees. FMSI builds on recent advances in <i>k</i>-mer superstrings and uses the Masked Burrows-Wheeler Transform, a novel extension of the classical Burrows-Wheeler Transform that incorporates position masking. Across a range of <i>k</i> values and dataset types-including genomic, pangenomic, and metagenomic-FMSI consistently achieves superior query space efficiency, using up to 2-3× less memory than state-of-the-art methods, while maintaining competitive query times. Only a space-optimized version of SBWT can match the FMSI's footprint in some cases, but then FMSI is 2-3× faster. Our results establish superstring-based indexing as a robust, scalable, and versatile framework for arbitrary <i>k</i>-mer sets across diverse bioinformatics applications.</p><p><strong>Availability and implementation: </strong>FMSI is developed in C++ and released under the MIT license, with source code provided at https://github.com/OndrejSladky/fmsi and an installable package available through Bioconda. The datasets used in the experiments are deposited at Zenodo (https://doi.org/10.5281/zenodo.14722244).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf290"},"PeriodicalIF":2.8,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12800775/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf288
Valentina Di Salvatore, Avisa Maleki, Babak Mohajer, Alvaro Ras-Carmona, Giulia Russo, Pedro Antonio Reche, Francesco Pappalardo
Motivation: The rapid evolution of SARS-CoV-2 highlights the importance of computational approaches to explore mutational effects on the viral spike protein. In this work, we present a genetic algorithm (GA) framework applied to the structural optimization of spike protein variants, with a focus on energetic and binding properties rather than direct evolutionary prediction.
Results: Our GA-driven pipeline generated spike variants with progressively improved structural stability as indicated by lower discrete optimized protein energy scores across generations. The approach also enabled evaluation of Gibbs free energy and binding affinity for spike-Angiotensin-converting enzyme 2 receptor interactions, revealing candidate conformations with favorable thermodynamic properties. These results demonstrate the algorithm's capacity to refine protein models and explore mutational landscapes in silico, although no validation against naturally emerging variants was performed. This study presents a methodological framework for GA-based structural modeling of SARS-CoV-2 spike mutations. Rather than forecasting specific variants of concern, it demonstrates the feasibility of a computational approach that can be extended and integrated with evolutionary and experimental evidence to strengthen future efforts in variant monitoring and vaccine development.
Availability and implementation: All the Python and R scripts are available upon request to the authors.
{"title":"Exploring SARS-CoV-2 spike protein mutations through genetic algorithm-driven structural modeling.","authors":"Valentina Di Salvatore, Avisa Maleki, Babak Mohajer, Alvaro Ras-Carmona, Giulia Russo, Pedro Antonio Reche, Francesco Pappalardo","doi":"10.1093/bioadv/vbaf288","DOIUrl":"10.1093/bioadv/vbaf288","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid evolution of SARS-CoV-2 highlights the importance of computational approaches to explore mutational effects on the viral spike protein. In this work, we present a genetic algorithm (GA) framework applied to the structural optimization of spike protein variants, with a focus on energetic and binding properties rather than direct evolutionary prediction.</p><p><strong>Results: </strong>Our GA-driven pipeline generated spike variants with progressively improved structural stability as indicated by lower discrete optimized protein energy scores across generations. The approach also enabled evaluation of Gibbs free energy and binding affinity for spike-Angiotensin-converting enzyme 2 receptor interactions, revealing candidate conformations with favorable thermodynamic properties. These results demonstrate the algorithm's capacity to refine protein models and explore mutational landscapes in silico, although no validation against naturally emerging variants was performed. This study presents a methodological framework for GA-based structural modeling of SARS-CoV-2 spike mutations. Rather than forecasting specific variants of concern, it demonstrates the feasibility of a computational approach that can be extended and integrated with evolutionary and experimental evidence to strengthen future efforts in variant monitoring and vaccine development.</p><p><strong>Availability and implementation: </strong>All the Python and R scripts are available upon request to the authors.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf288"},"PeriodicalIF":2.8,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12627402/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf289
Manju Anandakrishnan, Karen E Ross, Chuming Chen, K Vijay-Shanker, Cathy H Wu
Motivation: Protein kinases regulate cellular signaling pathways through a cascade of phosphorylation activity, selectively targeting specific residues on substrate proteins (phosphosites). Determining the characteristics of kinases that phosphorylate specific substrates have been extensively studied. Most tools utilize amino acid sequence motifs around phosphosites but don't consider substrate protein's biological characteristics.
Results: We present KSMoFinder, a kinase-substrate-motif prediction model that learns factors beyond motif similarities by integrating proteins' biological contexts. We learn the semantics in a knowledge graph containing proteins' contextual relationships, kinase-specific motifs and motif composition, and represent the proteins and motifs as vectors. Using the representations as features, we train a supervised deep-learning classifier to identify kinase-phosphosite relationships. We use ground truth kinase-substrate-motif dataset from iPTMnet and PhosphositePlus and evaluate KSMoFinder's prediction performance. Pairwise comparative assessments with prior kinase-substrate prediction tools demonstrate KSMoFinder's superior performance. KSMoFinder trained using our knowledge graph embeddings surpasses the prediction performances using embeddings of popular protein language models such as ProtT5, ESM2, and ESM3 with a ROC-AUC of 0.851 and PR-AUC of 0.839 on a testing dataset with equal number of positives and negatives. Unlike most existing tools, KSMoFinder can be utilized to predict at the motif and at the substrate protein level.
Availability and implementation: Source code is available at https://github.com/manju-anandakrishnan/KSMoFinder.
{"title":"KSMoFinder-knowledge graph embedding of proteins and motifs for predicting kinases of human phosphosites.","authors":"Manju Anandakrishnan, Karen E Ross, Chuming Chen, K Vijay-Shanker, Cathy H Wu","doi":"10.1093/bioadv/vbaf289","DOIUrl":"10.1093/bioadv/vbaf289","url":null,"abstract":"<p><strong>Motivation: </strong>Protein kinases regulate cellular signaling pathways through a cascade of phosphorylation activity, selectively targeting specific residues on substrate proteins (phosphosites). Determining the characteristics of kinases that phosphorylate specific substrates have been extensively studied. Most tools utilize amino acid sequence motifs around phosphosites but don't consider substrate protein's biological characteristics.</p><p><strong>Results: </strong>We present KSMoFinder, a kinase-substrate-motif prediction model that learns factors beyond motif similarities by integrating proteins' biological contexts. We learn the semantics in a knowledge graph containing proteins' contextual relationships, kinase-specific motifs and motif composition, and represent the proteins and motifs as vectors. Using the representations as features, we train a supervised deep-learning classifier to identify kinase-phosphosite relationships. We use ground truth kinase-substrate-motif dataset from iPTMnet and PhosphositePlus and evaluate KSMoFinder's prediction performance. Pairwise comparative assessments with prior kinase-substrate prediction tools demonstrate KSMoFinder's superior performance. KSMoFinder trained using our knowledge graph embeddings surpasses the prediction performances using embeddings of popular protein language models such as ProtT5, ESM2, and ESM3 with a ROC-AUC of 0.851 and PR-AUC of 0.839 on a testing dataset with equal number of positives and negatives. Unlike most existing tools, KSMoFinder can be utilized to predict at the motif and at the substrate protein level.</p><p><strong>Availability and implementation: </strong>Source code is available at https://github.com/manju-anandakrishnan/KSMoFinder.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf289"},"PeriodicalIF":2.8,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664573/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf283
Stefan Vocht, Yanren Linda Hu, Andreas Lösch, Kevin Rupp, Tilo Wettig, Lars Grasedyck, Niko Beerenwinkel, Rainer Spang, Rudolf Schill
Summary: Mutual Hazard Networks (MHNs) are statistical models for analyzing (genetic) cancer progression. Many cancers develop silently and are only noticeable when they have significantly progressed, creating an observational gap until diagnosis. MHNs bridge this gap by reconstructing the underlying dynamics of disease progression. We present mhn, a Python package for dynamic cancer progression analysis using MHNs. It trains an MHN model from tumor genotypes. mhn overcomes challenges of numerical efficiency in model training by making use of state space restriction, allowing training MHNs with >100 mutational events, 5 times more than was possible before. The package offers (i) reconstruction of the most likely evolutionary history of tumors, (ii) sampling of artificial tumor histories, and (iii) visualization of genomic interactions and likely progression trajectories. These features substantially extend earlier implementations, providing a fast and user-friendly framework for researchers and clinicians to study cancer dynamics.
Availability and implementation: mhn can be installed from PyPI using pip and is available under the MIT License on GitHub (https://github.com/spang-lab/LearnMHN). Installation instructions and package functionalities are detailed on GitHub and PyPI, with a comprehensive guide on Read the Docs (https://learnmhn.readthedocs.io/en/latest/index.html) and a Jupyter notebook on GitHub to help users explore the package.
{"title":"<b>mhn</b>: a Python package for analyzing cancer progression with Mutual Hazard Networks.","authors":"Stefan Vocht, Yanren Linda Hu, Andreas Lösch, Kevin Rupp, Tilo Wettig, Lars Grasedyck, Niko Beerenwinkel, Rainer Spang, Rudolf Schill","doi":"10.1093/bioadv/vbaf283","DOIUrl":"10.1093/bioadv/vbaf283","url":null,"abstract":"<p><strong>Summary: </strong>Mutual Hazard Networks (MHNs) are statistical models for analyzing (genetic) cancer progression. Many cancers develop silently and are only noticeable when they have significantly progressed, creating an observational gap until diagnosis. MHNs bridge this gap by reconstructing the underlying dynamics of disease progression. We present mhn, a Python package for dynamic cancer progression analysis using MHNs. It trains an MHN model from tumor genotypes. mhn overcomes challenges of numerical efficiency in model training by making use of <i>state space restriction</i>, allowing training MHNs with >100 mutational events, 5 times more than was possible before. The package offers (i) reconstruction of the most likely evolutionary history of tumors, (ii) sampling of artificial tumor histories, and (iii) visualization of genomic interactions and likely progression trajectories. These features substantially extend earlier implementations, providing a fast and user-friendly framework for researchers and clinicians to study cancer dynamics.</p><p><strong>Availability and implementation: </strong>mhn can be installed from PyPI using pip and is available under the MIT License on GitHub (https://github.com/spang-lab/LearnMHN). Installation instructions and package functionalities are detailed on GitHub and PyPI, with a comprehensive guide on Read the Docs (https://learnmhn.readthedocs.io/en/latest/index.html) and a Jupyter notebook on GitHub to help users explore the package.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf283"},"PeriodicalIF":2.8,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12776348/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf286
Finn Archinuk, Haley Greenyer, Ulrike Stege, Steffany A L Bennett, Miroslava Cuperlovic-Culf, Hosna Jabbari
Motivation: Various methods have been proposed to construct metabolic networks from metabolomic data; however, small sample sizes, multiple confounding factors, the presence of indirect interactions as well as randomness in metabolic processes are of major concern.
Results: In this study, we benchmark existing algorithms for creating correlation- and regression-based networks of changes in metabolite abundance and evaluate their performance across different sample sizes of a generative model. Using standard interaction-level tests and network-scale analyses based on centrality scores, we assess how well these methods recover represented metabolomic networks. Our findings reveal significant challenges in network inference and result interpretation, even when sample sizes are significant and data are the result of computer modeling of metabolic pathways. Despite these limitations, we demonstrate that correlation-based network inference can, to some extent, discriminate between two different metabolic states in the computational model. This suggests potential utility in distinguishing overarching changes in metabolic processes but not direct pathways in different conditions.
Availability and implementation: All relevant data is provided at https://github.com/TheCOBRALab/metabolicRelationships.
{"title":"Are the tools fit for purpose? Network inference algorithms evaluated on a simulated lipidomics network.","authors":"Finn Archinuk, Haley Greenyer, Ulrike Stege, Steffany A L Bennett, Miroslava Cuperlovic-Culf, Hosna Jabbari","doi":"10.1093/bioadv/vbaf286","DOIUrl":"10.1093/bioadv/vbaf286","url":null,"abstract":"<p><strong>Motivation: </strong>Various methods have been proposed to construct metabolic networks from metabolomic data; however, small sample sizes, multiple confounding factors, the presence of indirect interactions as well as randomness in metabolic processes are of major concern.</p><p><strong>Results: </strong>In this study, we benchmark existing algorithms for creating correlation- and regression-based networks of changes in metabolite abundance and evaluate their performance across different sample sizes of a generative model. Using standard interaction-level tests and network-scale analyses based on centrality scores, we assess how well these methods recover represented metabolomic networks. Our findings reveal significant challenges in network inference and result interpretation, even when sample sizes are significant and data are the result of computer modeling of metabolic pathways. Despite these limitations, we demonstrate that correlation-based network inference can, to some extent, discriminate between two different metabolic states in the computational model. This suggests potential utility in distinguishing overarching changes in metabolic processes but not direct pathways in different conditions.</p><p><strong>Availability and implementation: </strong>All relevant data is provided at https://github.com/TheCOBRALab/metabolicRelationships.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf286"},"PeriodicalIF":2.8,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12640239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145590074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf279
Alexandre Lanau, Joshua J Waterfall
Motivation: Some recently published methods for single-cell RNA-seq preprocessing and correction are not necessarily available in both Python and R, which limits the accessibility of these tools to the wider community.
Results: We present pyALRA, an efficient python implementation of the (r-)ALRA R package conceived to impute drop out values using a low-rank zero-preserving approximation for single cell RNA-seq. This re-implementation achieves similar prediction performance using corresponding python methods and allows both speed and RAM consumption improvements.
Availability and implementation: pyALRA is released as an open-source software under the MIT license. The source code is available on GitHub at https://github.com/alexandrelanau/pyALRA and on Zenodo at https://doi.org/10.5281/zenodo.15730914.
{"title":"pyALRA: python implementation of low-rank zero-preserving approximation of single cell RNA-seq.","authors":"Alexandre Lanau, Joshua J Waterfall","doi":"10.1093/bioadv/vbaf279","DOIUrl":"10.1093/bioadv/vbaf279","url":null,"abstract":"<p><strong>Motivation: </strong>Some recently published methods for single-cell RNA-seq preprocessing and correction are not necessarily available in both Python and R, which limits the accessibility of these tools to the wider community.</p><p><strong>Results: </strong>We present pyALRA, an efficient python implementation of the (r-)ALRA R package conceived to impute drop out values using a low-rank zero-preserving approximation for single cell RNA-seq. This re-implementation achieves similar prediction performance using corresponding python methods and allows both speed and RAM consumption improvements.</p><p><strong>Availability and implementation: </strong>pyALRA is released as an open-source software under the MIT license. The source code is available on GitHub at https://github.com/alexandrelanau/pyALRA and on Zenodo at https://doi.org/10.5281/zenodo.15730914.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf279"},"PeriodicalIF":2.8,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664701/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145650149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf275
Jordan C Rozum, Hunter Ufford, Alexandria K Im, Tong Zhang, David D Pollock, Doo Nam Kim, Song Feng
Summary: Understanding protein function at the molecular level requires connecting residue-level annotations with physical and structural properties. This can be cumbersome and error-prone when functional annotation, computation of physicochemical properties, and structure visualization are separated. To address this, we introduce ProCaliper, an open-source Python library for computing and visualizing physicochemical properties of proteins. It can retrieve annotation and structure data from UniProt and AlphaFold databases, compute residue-level properties such as charge, solvent accessibility, and protonation state, and interactively visualize the results of these computations along with user-supplied residue-level data. Additionally, ProCaliper incorporates functional and structural information to construct and optionally sparsify networks that encode the distance between residues and/or annotated functional sites or regions.
Availability and implementation: The package ProCaliper and its source code, along with the code used to generate the figures in this manuscript, are freely available at https://github.com/PNNL-Predictive-Phenomics/ProCaliper.
{"title":"ProCaliper: functional and structural analysis, visualization, and annotation of proteins.","authors":"Jordan C Rozum, Hunter Ufford, Alexandria K Im, Tong Zhang, David D Pollock, Doo Nam Kim, Song Feng","doi":"10.1093/bioadv/vbaf275","DOIUrl":"10.1093/bioadv/vbaf275","url":null,"abstract":"<p><strong>Summary: </strong>Understanding protein function at the molecular level requires connecting residue-level annotations with physical and structural properties. This can be cumbersome and error-prone when functional annotation, computation of physicochemical properties, and structure visualization are separated. To address this, we introduce ProCaliper, an open-source Python library for computing and visualizing physicochemical properties of proteins. It can retrieve annotation and structure data from UniProt and AlphaFold databases, compute residue-level properties such as charge, solvent accessibility, and protonation state, and interactively visualize the results of these computations along with user-supplied residue-level data. Additionally, ProCaliper incorporates functional and structural information to construct and optionally sparsify networks that encode the distance between residues and/or annotated functional sites or regions.</p><p><strong>Availability and implementation: </strong>The package ProCaliper and its source code, along with the code used to generate the figures in this manuscript, are freely available at https://github.com/PNNL-Predictive-Phenomics/ProCaliper.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf275"},"PeriodicalIF":2.8,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12607263/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145515047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-09eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf287
René L Warren, Lauren Coombe, Johnathan Wong, Parham Kazemi, Inanc Birol
Motivation: Ancestry information is essential to large cohort studies but is often unavailable or inconsistently measured. For studies involving genome sequencing, existing ancestry prediction methods are constrained by computational demands and complex input requirements. Efficient, scalable approaches are needed to infer ancestry directly from sequencing data while maintaining accuracy and reproducibility.
Results: We present ntRoot, a computationally lightweight method for inferring human super-population-level ancestry from whole genome assemblies or short or long sequencing data. Utilizing a reference-guided, alignment-free single nucleotide variant detection framework, ntRoot employs a succinct Bloom filter to efficiently query diverse genomic inputs against a variant reference panel with known genotypes and ancestry. Demonstrated on over 600 human genome samples, including complete genomes, draft assemblies, and 280 independently generated samples, ntRoot accurately predicts geographic labels and shows high concordance with traditional methods such as ADMIXTURE (R2 = 0.9567) when estimating ancestry fractions. Analyses complete within 30 minutes for assemblies and 75 min for 30-fold sequencing data using 13-68 GB of memory. ntRoot provides global and local ancestry inference, delivering high-resolution predictions across genomic loci. This paradigm fills a critical gap in cohort studies by enabling rapid, resource-efficient, and accurate ancestry inference at scale, advancing ancestry characterization in genomic research.
Availability: ntRoot is freely available on GitHub (https://github.com/bcgsc/ntroot).
{"title":"ntRoot: computational inference of human ancestry at scale from genomic data.","authors":"René L Warren, Lauren Coombe, Johnathan Wong, Parham Kazemi, Inanc Birol","doi":"10.1093/bioadv/vbaf287","DOIUrl":"10.1093/bioadv/vbaf287","url":null,"abstract":"<p><strong>Motivation: </strong>Ancestry information is essential to large cohort studies but is often unavailable or inconsistently measured. For studies involving genome sequencing, existing ancestry prediction methods are constrained by computational demands and complex input requirements. Efficient, scalable approaches are needed to infer ancestry directly from sequencing data while maintaining accuracy and reproducibility.</p><p><strong>Results: </strong>We present ntRoot, a computationally lightweight method for inferring human super-population-level ancestry from whole genome assemblies or short or long sequencing data. Utilizing a reference-guided, alignment-free single nucleotide variant detection framework, ntRoot employs a succinct Bloom filter to efficiently query diverse genomic inputs against a variant reference panel with known genotypes and ancestry. Demonstrated on over 600 human genome samples, including complete genomes, draft assemblies, and 280 independently generated samples, ntRoot accurately predicts geographic labels and shows high concordance with traditional methods such as ADMIXTURE (<i>R</i> <sup>2</sup> = 0.9567) when estimating ancestry fractions. Analyses complete within 30 minutes for assemblies and 75 min for 30-fold sequencing data using 13-68 GB of memory. ntRoot provides global and local ancestry inference, delivering high-resolution predictions across genomic loci. This paradigm fills a critical gap in cohort studies by enabling rapid, resource-efficient, and accurate ancestry inference at scale, advancing ancestry characterization in genomic research.</p><p><strong>Availability: </strong>ntRoot is freely available on GitHub (https://github.com/bcgsc/ntroot).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf287"},"PeriodicalIF":2.8,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12695050/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145745951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: High-throughput sequencing (HTS) has become an integral part of routine analysis for microbiologists. The process of sequencing dozens of samples generates vast amounts of data that cannot be annotated manually. To address this challenge, numerous tools for bacterial genome analysis have been developed over the years. Using freely available databases, these tools enable users to significantly accelerate their analyses. However, many of these tools require advanced computer science expertise to operate effectively.
Results: To overcome this limitation, we developed BacExplorer. Featuring a user-friendly interface, a locally installable application, and an interactive HTML report, BacExplorer empowers users of all skill levels to perform their own analyses with ease and efficiency.
Availability and implementation: BacExplorer is available at: https://github.com/knowmics-lab/BacExplorer.
{"title":"BacExplorer: an integrated platform for <i>de novo</i> bacterial genome annotation.","authors":"Grete Francesca Privitera, Adriana Antonella Cannata, Floriana Campanile, Salvatore Alaimo, Dafne Bongiorno, Alfredo Pulvirenti","doi":"10.1093/bioadv/vbaf281","DOIUrl":"10.1093/bioadv/vbaf281","url":null,"abstract":"<p><strong>Motivation: </strong>High-throughput sequencing (HTS) has become an integral part of routine analysis for microbiologists. The process of sequencing dozens of samples generates vast amounts of data that cannot be annotated manually. To address this challenge, numerous tools for bacterial genome analysis have been developed over the years. Using freely available databases, these tools enable users to significantly accelerate their analyses. However, many of these tools require advanced computer science expertise to operate effectively.</p><p><strong>Results: </strong>To overcome this limitation, we developed BacExplorer. Featuring a user-friendly interface, a locally installable application, and an interactive HTML report, BacExplorer empowers users of all skill levels to perform their own analyses with ease and efficiency.</p><p><strong>Availability and implementation: </strong>BacExplorer is available at: https://github.com/knowmics-lab/BacExplorer.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf281"},"PeriodicalIF":2.8,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12640510/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145598076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}