Pub Date : 2025-11-27eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf308
Pengchong Ma, Haoze Zheng, Weijun Yi, Li Ma, Brandi Sigmon, Karrie A Weber, Gangqing Hu, Qiuming Yao
Motivation: The rapid evolution of bioinformatics tools and multi-step analytic procedure presents a challenge for building effective pipelines, particularly for researchers without extensive programming expertise. This study demonstrates that large language models (LLMs) hold strong potential for generating end-to-end bioinformatic pipelines through carefully crafted prompts, using a multi-step metaviral workflow as a representative example. Multiple LLMs were tested for their effectiveness, including OpenAI ChatGPT series, Anthropic Claude series, Google Gemini, Meta Llama, and DeepSeek.
Results: Our results show that ChatGPT-4, ChatGPT-5, Claude 4.5, and Gemini 2.5 consistently outperform other LLMs in generating complete bioinformatic pipelines, with statistically significant success rates. These models also handle tool substitutions effectively. Simple prompt engineering and the inclusion of official documentation further enhance performance, especially for newer bioinformatic tools. While capabilities vary, all LLMs tested show potential for both pipeline generation and updates with our designed prompts and strategies.
Availability and implementation: All prompts are available in the paper. The examples are available in GitHub https://github.com/mpckkk/pBio.
{"title":"Prompt-based bioinformatic pipeline generation for a multi-step metaviral workflow.","authors":"Pengchong Ma, Haoze Zheng, Weijun Yi, Li Ma, Brandi Sigmon, Karrie A Weber, Gangqing Hu, Qiuming Yao","doi":"10.1093/bioadv/vbaf308","DOIUrl":"10.1093/bioadv/vbaf308","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid evolution of bioinformatics tools and multi-step analytic procedure presents a challenge for building effective pipelines, particularly for researchers without extensive programming expertise. This study demonstrates that large language models (LLMs) hold strong potential for generating end-to-end bioinformatic pipelines through carefully crafted prompts, using a multi-step metaviral workflow as a representative example. Multiple LLMs were tested for their effectiveness, including OpenAI ChatGPT series, Anthropic Claude series, Google Gemini, Meta Llama, and DeepSeek.</p><p><strong>Results: </strong>Our results show that ChatGPT-4, ChatGPT-5, Claude 4.5, and Gemini 2.5 consistently outperform other LLMs in generating complete bioinformatic pipelines, with statistically significant success rates. These models also handle tool substitutions effectively. Simple prompt engineering and the inclusion of official documentation further enhance performance, especially for newer bioinformatic tools. While capabilities vary, all LLMs tested show potential for both pipeline generation and updates with our designed prompts and strategies.</p><p><strong>Availability and implementation: </strong>All prompts are available in the paper. The examples are available in GitHub https://github.com/mpckkk/pBio.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf308"},"PeriodicalIF":2.8,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12782108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145954028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf304
Sebastian Eschner, Mohammad Alabdullah, Martin Dugas
Summary: Bioinformatics pipelines should meet the FAIR criteria to enable reproducible analysis. FAIR describes four key requirements for reproducible research: findability, accessibility, interoperability and reusability. Software containers such as Singularity are widely used tools that facilitate the reuse of software across different computing environments. However, many biologists and other researchers find command line tools such as Singularity unfamiliar and do not feel productive when using software via the command line. We present a graphical user interface that allows biologists without programming experience to interact with containerized software. We evaluate the feasibility of our approach with software used at the TRR156.
Availability and implementation: Colony can be freely downloaded on its project page: https://clipc-jpg.github.io/ColonyWebsite/. The Colony launcher's code is MIT-licensed and freely available at: https://github.com/clipc-jpg/Colony. All related assets can be found at: https://doi.org/10.7910/DVN/Z3OTWY.
{"title":"Colony: a framework for reproducible and easy-to-use data analysis pipelines for biomedical research with singularity containers.","authors":"Sebastian Eschner, Mohammad Alabdullah, Martin Dugas","doi":"10.1093/bioadv/vbaf304","DOIUrl":"10.1093/bioadv/vbaf304","url":null,"abstract":"<p><strong>Summary: </strong>Bioinformatics pipelines should meet the FAIR criteria to enable reproducible analysis. FAIR describes four key requirements for reproducible research: findability, accessibility, interoperability and reusability. Software containers such as Singularity are widely used tools that facilitate the reuse of software across different computing environments. However, many biologists and other researchers find command line tools such as Singularity unfamiliar and do not feel productive when using software via the command line. We present a graphical user interface that allows biologists without programming experience to interact with containerized software. We evaluate the feasibility of our approach with software used at the TRR156.</p><p><strong>Availability and implementation: </strong>Colony can be freely downloaded on its project page: https://clipc-jpg.github.io/ColonyWebsite/. The Colony launcher's code is MIT-licensed and freely available at: https://github.com/clipc-jpg/Colony. All related assets can be found at: https://doi.org/10.7910/DVN/Z3OTWY.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf304"},"PeriodicalIF":2.8,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12776350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf302
James Kitchens, Yan Wong
Motivation: Ancestral recombination graphs (ARGs) are a complete representation of the genetic relationships between recombining lineages and are of central importance in population genetics. Recent breakthroughs in simulation and inference methods have led to a surge of interest in ARGs. However, understanding how best to take advantage of the graphical structure of ARGs remains an open question for researchers. Here, we introduce tskit_arg_visualizer, a Python package for programmatically drawing ARGs using the interactive D3.js visualization library.
Results: We highlight the usefulness of this visualization tool for both teaching ARG concepts and exploring ARGs inferred from empirical datasets.
Availability and implementation: The latest stable version of tskit_arg_visualizer is available through the Python Package Index (https://pypi.org/project/tskit-arg-visualizer, currently v0.1.1). Documentation and the development version of the package are found on GitHub (https://github.com/kitchensjn/tskit_arg_visualizer).
{"title":"tskit_arg_visualizer: interactive plotting of ancestral recombination graphs.","authors":"James Kitchens, Yan Wong","doi":"10.1093/bioadv/vbaf302","DOIUrl":"10.1093/bioadv/vbaf302","url":null,"abstract":"<p><strong>Motivation: </strong>Ancestral recombination graphs (ARGs) are a complete representation of the genetic relationships between recombining lineages and are of central importance in population genetics. Recent breakthroughs in simulation and inference methods have led to a surge of interest in ARGs. However, understanding how best to take advantage of the graphical structure of ARGs remains an open question for researchers. Here, we introduce tskit_arg_visualizer, a Python package for programmatically drawing ARGs using the interactive D3.js visualization library.</p><p><strong>Results: </strong>We highlight the usefulness of this visualization tool for both teaching ARG concepts and exploring ARGs inferred from empirical datasets.</p><p><strong>Availability and implementation: </strong>The latest stable version of tskit_arg_visualizer is available through the Python Package Index (https://pypi.org/project/tskit-arg-visualizer, currently v0.1.1). Documentation and the development version of the package are found on GitHub (https://github.com/kitchensjn/tskit_arg_visualizer).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf302"},"PeriodicalIF":2.8,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Klebsiella pneumoniae is a highly virulent superbug with rising antibiotic resistance worldwide. While matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) has transformed microbial identification, its application to antimicrobial resistance prediction remains underexplored, particularly for large clinical cohorts. In this study, we developed machine-learning models with feature-level interpretability using MALDI-TOF MS data to rapidly predict resistance to ciprofloxacin (CIP), cefuroxime (CXM), and ceftriaxone (CRO) in K. pneumoniae.
Results: Using more than 28 000 isolates from two hospitals, the best-performing models reached an independent test accuracy of 0.7858, with sensitivity of 0.7289 and specificity of 0.8127. Several resistance-associated m/z signals-including 3657, 4341, 4519, 4709, 5070, 5409, 5921, 5939, and 6516-were consistently enriched in resistant isolates, offering interpretable spectral markers linked to resistance. Performance remained stable in time-based validation but declined across hospitals, suggesting sensitivity to geographic variability in resistance profiles. Overall, this study demonstrates that combining MALDI-TOF MS with machine learning enables rapid and interpretable prediction of resistance to commonly used fluoroquinolone and cephalosporins in K. pneumoniae. These findings highlight the clinical potential of such models for supporting empiric therapy and emphasize the importance of incorporating local data or adaptive strategies to improve generalizability across healthcare settings.
Availability and implementation: Data available on request from the authors.
{"title":"AI-powered rapid detection of multidrug-resistant <i>Klebsiella pneumoniae</i> with informative peaks of MALDI-TOF MS.","authors":"Jang-Jih Lu, Chia-Ru Chung, Hsin-Yao Wang, Yun Tang, Ming-Chien Chiang, Li-Ching Wu, Justin Bo-Kai Hsu, Tzong-Yi Lee, Jorng-Tzong Horng","doi":"10.1093/bioadv/vbaf303","DOIUrl":"10.1093/bioadv/vbaf303","url":null,"abstract":"<p><strong>Motivation: </strong><i>Klebsiella pneumoniae</i> is a highly virulent superbug with rising antibiotic resistance worldwide. While matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) has transformed microbial identification, its application to antimicrobial resistance prediction remains underexplored, particularly for large clinical cohorts. In this study, we developed machine-learning models with feature-level interpretability using MALDI-TOF MS data to rapidly predict resistance to ciprofloxacin (CIP), cefuroxime (CXM), and ceftriaxone (CRO) in <i>K. pneumoniae</i>.</p><p><strong>Results: </strong>Using more than 28 000 isolates from two hospitals, the best-performing models reached an independent test accuracy of 0.7858, with sensitivity of 0.7289 and specificity of 0.8127. Several resistance-associated <i>m/z</i> signals-including 3657, 4341, 4519, 4709, 5070, 5409, 5921, 5939, and 6516-were consistently enriched in resistant isolates, offering interpretable spectral markers linked to resistance. Performance remained stable in time-based validation but declined across hospitals, suggesting sensitivity to geographic variability in resistance profiles. Overall, this study demonstrates that combining MALDI-TOF MS with machine learning enables rapid and interpretable prediction of resistance to commonly used fluoroquinolone and cephalosporins in <i>K. pneumoniae</i>. These findings highlight the clinical potential of such models for supporting empiric therapy and emphasize the importance of incorporating local data or adaptive strategies to improve generalizability across healthcare settings.</p><p><strong>Availability and implementation: </strong>Data available on request from the authors.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf303"},"PeriodicalIF":2.8,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12776345/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf298
Roman Schefzik, Han Cao, Sivanesan Rajan, Xavier Escribà-Montagut, Juan R González, Emanuel Schwarz
Motivation: Multi-task learning (MTL) enables simultaneous learning of related regression or classification tasks by exploiting shared information. The R package dsMTL provides a computational framework for federated MTL approaches, supporting the analysis of sensitive, individual-level data from geographically distributed data sources using the DataSHIELD platform. While the current architecture provides comprehensive data security mechanisms, these are not specifically tailored to MTL models. In particular, these models may still be vulnerable to membership inference attacks, attempting to determine whether a specific individual was included in a given training set using the model.
Results: To further enhance the privacy-preserving capabilities of dsMTL and protect against such attacks, differential privacy using the Laplace mechanism is integrated into dsMTL as a novel optional feature. This approach aims to obscure individual-level characteristics from the model while retaining group-level differences. The differential privacy implementation is validated in both simulation studies and a case study identifying schizophrenia patients from gene expression data. For practical utility, it is crucial to find an adequate balance between the degree of privacy protection and the conservation of model performance by choosing a reasonable privacy parameter within the differential privacy mechanism.
Availability and implementation: dsMTL is open-source and available at https://github.com/transbioZI/dsMTLBase (server-side) and https://github.com/transbioZI/dsMTLClient (client-side).
{"title":"Integrating differential privacy into federated multi-task learning algorithms in <b>dsMTL</b>.","authors":"Roman Schefzik, Han Cao, Sivanesan Rajan, Xavier Escribà-Montagut, Juan R González, Emanuel Schwarz","doi":"10.1093/bioadv/vbaf298","DOIUrl":"10.1093/bioadv/vbaf298","url":null,"abstract":"<p><strong>Motivation: </strong>Multi-task learning (MTL) enables simultaneous learning of related regression or classification tasks by exploiting shared information. The R package dsMTL provides a computational framework for federated MTL approaches, supporting the analysis of sensitive, individual-level data from geographically distributed data sources using the DataSHIELD platform. While the current architecture provides comprehensive data security mechanisms, these are not specifically tailored to MTL models. In particular, these models may still be vulnerable to membership inference attacks, attempting to determine whether a specific individual was included in a given training set using the model.</p><p><strong>Results: </strong>To further enhance the privacy-preserving capabilities of dsMTL and protect against such attacks, differential privacy using the Laplace mechanism is integrated into dsMTL as a novel optional feature. This approach aims to obscure individual-level characteristics from the model while retaining group-level differences. The differential privacy implementation is validated in both simulation studies and a case study identifying schizophrenia patients from gene expression data. For practical utility, it is crucial to find an adequate balance between the degree of privacy protection and the conservation of model performance by choosing a reasonable privacy parameter within the differential privacy mechanism.</p><p><strong>Availability and implementation: </strong>dsMTL is open-source and available at https://github.com/transbioZI/dsMTLBase (server-side) and https://github.com/transbioZI/dsMTLClient (client-side).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf298"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf299
Jianhong Ou, Kenneth D Poss
Motivation: The three-dimensional organization of the genome plays a critical role in regulating gene expression by shaping the spatial and temporal interactions between regulatory elements. High-throughput chromosome conformation capture (Hi-C) technologies, along with immunoprecipitation- or chromatin accessibility-based chromatin architecture mapping methods, enable the measurement of chromatin dynamics at both bulk and single-cell levels. However, effectively exploring and comparing chromatin structures remains challenging, particularly when integrating multiple layers of genomic annotation or comparing structural dynamics across conditions. While several tools support interactive 3D genome visualization, few provide a flexible, R-integrated framework that supports custom annotations, side-by-side comparison of multiple stages or conditions, and deployment in Shiny applications.
Results: To address this need, we have developed geomeTriD, an R/Bioconductor package that enables interactive visualization of chromatin structures using three.js, supports multi-layer annotation, allows parallel comparison of two chromatin states, and is compatible with Shiny-based analysis workflows. As multi-omic and spatial genomic datasets grow in complexity, GeomeTriD will facilitate the reconstruction and comparison of 3D genome structures across conditions, linking chromatin architecture to gene regulation, epigenetic states, and cell-state transitions.
Availability and implementation: geomeTriD is freely available at https://bioconductor.org/packages/geomeTriD.
{"title":"geomeTriD: a Bioconductor package for interactive and integrative visualization of 3D structural model with multi-omics data.","authors":"Jianhong Ou, Kenneth D Poss","doi":"10.1093/bioadv/vbaf299","DOIUrl":"10.1093/bioadv/vbaf299","url":null,"abstract":"<p><strong>Motivation: </strong>The three-dimensional organization of the genome plays a critical role in regulating gene expression by shaping the spatial and temporal interactions between regulatory elements. High-throughput chromosome conformation capture (Hi-C) technologies, along with immunoprecipitation- or chromatin accessibility-based chromatin architecture mapping methods, enable the measurement of chromatin dynamics at both bulk and single-cell levels. However, effectively exploring and comparing chromatin structures remains challenging, particularly when integrating multiple layers of genomic annotation or comparing structural dynamics across conditions. While several tools support interactive 3D genome visualization, few provide a flexible, R-integrated framework that supports custom annotations, side-by-side comparison of multiple stages or conditions, and deployment in Shiny applications.</p><p><strong>Results: </strong>To address this need, we have developed geomeTriD, an R/Bioconductor package that enables interactive visualization of chromatin structures using three.js, supports multi-layer annotation, allows parallel comparison of two chromatin states, and is compatible with Shiny-based analysis workflows. As multi-omic and spatial genomic datasets grow in complexity, GeomeTriD will facilitate the reconstruction and comparison of 3D genome structures across conditions, linking chromatin architecture to gene regulation, epigenetic states, and cell-state transitions.</p><p><strong>Availability and implementation: </strong>geomeTriD is freely available at https://bioconductor.org/packages/geomeTriD.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf299"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12702139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf300
Dimitri Höhler, Julia Haag, Alexey M Kozlov, Benoit Morel, Alexandros Stamatakis
Motivation: The performance of phylogenetic inference tools is commonly evaluated using simulated as well as empirical sequence data alignments. An open question is how representative these alignments are with respect to those, commonly analyzed by users. Using the RAxMLGrove database, it is now possible to simulate DNA and amino acid sequences based on more than 70 000 representative RAxML and RAxML-NG tree inferences on empirical datasets conducted on the RAxML web servers. This allows to assess the phylogenetic tree inference accuracy of various inference tools based on more realistic and representative simulated alignments.
Results: To automate this process, we implement PhyloSmew, a tool for benchmarking phylogenetic inference tools. We use it to simulate ∼20 000 multiple sequence alignments (MSAs) based on representative empirical trees (in terms of signal strength) from RAxMLGrove. We subsequently analyze 5000 empirical MSAs from the TreeBASE database, to assess the inference accuracy of FastTree2, IQ-TREE2, and RAxML-NG. We find that on quantifiably difficult-to-analyze MSAs, all three tree inference tools perform poorly. Hence, the faster FastTree2 tool, constitutes a viable alternative to infer trees on difficult MSAs. We also find that there are substantial differences between accuracy results on simulated versus empirical data.
Availability and implementation: The data underlying this article are available at https://github.com/angtft/PhyloSmew, https://cme.h-its.org/exelixis/material/accuracy-study/data.tar.gz.
{"title":"Performance assessment of phylogenetic inference tools using PhyloSmew.","authors":"Dimitri Höhler, Julia Haag, Alexey M Kozlov, Benoit Morel, Alexandros Stamatakis","doi":"10.1093/bioadv/vbaf300","DOIUrl":"10.1093/bioadv/vbaf300","url":null,"abstract":"<p><strong>Motivation: </strong>The performance of phylogenetic inference tools is commonly evaluated using simulated as well as empirical sequence data alignments. An open question is how representative these alignments are with respect to those, commonly analyzed by users. Using the RAxMLGrove database, it is now possible to simulate DNA and amino acid sequences based on more than 70 000 representative RAxML and RAxML-NG tree inferences on empirical datasets conducted on the RAxML web servers. This allows to assess the phylogenetic tree inference accuracy of various inference tools based on more realistic and representative simulated alignments.</p><p><strong>Results: </strong>To automate this process, we implement PhyloSmew, a tool for benchmarking phylogenetic inference tools. We use it to simulate ∼20 000 multiple sequence alignments (MSAs) based on representative empirical trees (in terms of signal strength) from RAxMLGrove. We subsequently analyze 5000 empirical MSAs from the TreeBASE database, to assess the inference accuracy of FastTree2, IQ-TREE2, and RAxML-NG. We find that on quantifiably difficult-to-analyze MSAs, all three tree inference tools perform poorly. Hence, the faster FastTree2 tool, constitutes a viable alternative to infer trees on difficult MSAs. We also find that there are substantial differences between accuracy results on simulated versus empirical data.</p><p><strong>Availability and implementation: </strong>The data underlying this article are available at https://github.com/angtft/PhyloSmew, https://cme.h-its.org/exelixis/material/accuracy-study/data.tar.gz.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf300"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701799/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf297
Nadeem Khan, Muhammad Muneeb Nasir, Ammar Mushtaq, Masood Ur Rehman Kayani
Motivation: Analysis of genomic variation in microbial genomes is crucial for understanding how microbes adapt, interact with their hosts, and influence health and disease. In metagenomic studies, where genetic material from entire microbial communities is sequenced, thousands of single-nucleotide polymorphisms can be detected across species and samples. However, identifying which of these variations has biologically or functionally relevant impacts remains a significant challenge.
Results: To address this, we present SNPraefentia, a Python-based toolkit for prioritizing microbial SNPs based on their predicted functional relevance. The tool integrates multiple biologically meaningful parameters, including sequencing depth, physicochemical impact of amino acid substitutions, and the structural and functional context of mutations within annotated protein domains. SNPraefentia extracts variation depth and amino acid changes, annotates protein domains using UniProt, and computes individual impact scores. These are then integrated into a composite prioritization score that reflects the potential biological importance of each variant. Overall, SNPraefentia provides researchers with a systematic and reproducible approach to filter and rank microbial variants for downstream functional analysis or experimental validation.
Availability and implementation: The toolkit and test data are freely available at https://github.com/muneebdev7/SNPraefentia.
{"title":"SNPraefentia: a toolkit to prioritize microbial genome variants linked to health and disease.","authors":"Nadeem Khan, Muhammad Muneeb Nasir, Ammar Mushtaq, Masood Ur Rehman Kayani","doi":"10.1093/bioadv/vbaf297","DOIUrl":"10.1093/bioadv/vbaf297","url":null,"abstract":"<p><strong>Motivation: </strong>Analysis of genomic variation in microbial genomes is crucial for understanding how microbes adapt, interact with their hosts, and influence health and disease. In metagenomic studies, where genetic material from entire microbial communities is sequenced, thousands of single-nucleotide polymorphisms can be detected across species and samples. However, identifying which of these variations has biologically or functionally relevant impacts remains a significant challenge.</p><p><strong>Results: </strong>To address this, we present SNPraefentia, a Python-based toolkit for prioritizing microbial SNPs based on their predicted functional relevance. The tool integrates multiple biologically meaningful parameters, including sequencing depth, physicochemical impact of amino acid substitutions, and the structural and functional context of mutations within annotated protein domains. SNPraefentia extracts variation depth and amino acid changes, annotates protein domains using UniProt, and computes individual impact scores. These are then integrated into a composite prioritization score that reflects the potential biological importance of each variant. Overall, SNPraefentia provides researchers with a systematic and reproducible approach to filter and rank microbial variants for downstream functional analysis or experimental validation.</p><p><strong>Availability and implementation: </strong>The toolkit and test data are freely available at https://github.com/muneebdev7/SNPraefentia.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf297"},"PeriodicalIF":2.8,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12671963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145672764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf202
Rui Resende-Pinto, Raquel Ruivo, Josefin Stiller, Rute Fonseca, Luís Filipe C Castro
Summary: High-fidelity genome assemblies provide unprecedented opportunities to decipher mechanisms of molecular evolution and phenotype landscapes. Here, we present PseudoChecker2, a command-line version of the web-tool PseudoChecker with expanded functions. It identifies gene loss via drastic mutational events such as premature stop codons, deletions and insertions. It enables the investigation of cross-species genomic datasets through: (i) integration into automated workflows, (ii) multiprocessing capability, and (iii) creation of a functional reference from annotation files. In addition, we introduce PseudoViz, a novel graphical interface designed to help interpret the results of PseudoChecker2 with intuitive visualizations. These tools combine the versatility and automation of a command-line tool with the user-friendliness of a graphical interface to tackle the challenges of the Genome Era.
Availability and implementation: PseudoChecker2 and PseudoViz are fully available at https://github.com/rresendepinto/PseudoChecker2and https://github.com/rresendepinto/PseudoViz.
{"title":"PseudoChecker2 and PseudoViz: automation and visualization of gene loss in the Genome Era.","authors":"Rui Resende-Pinto, Raquel Ruivo, Josefin Stiller, Rute Fonseca, Luís Filipe C Castro","doi":"10.1093/bioadv/vbaf202","DOIUrl":"10.1093/bioadv/vbaf202","url":null,"abstract":"<p><strong>Summary: </strong>High-fidelity genome assemblies provide unprecedented opportunities to decipher mechanisms of molecular evolution and phenotype landscapes. Here, we present PseudoChecker2, a command-line version of the web-tool PseudoChecker with expanded functions. It identifies gene loss via drastic mutational events such as premature stop codons, deletions and insertions. It enables the investigation of cross-species genomic datasets through: (i) integration into automated workflows, (ii) multiprocessing capability, and (iii) creation of a functional reference from annotation files. In addition, we introduce PseudoViz, a novel graphical interface designed to help interpret the results of PseudoChecker2 with intuitive visualizations. These tools combine the versatility and automation of a command-line tool with the user-friendliness of a graphical interface to tackle the challenges of the Genome Era.</p><p><strong>Availability and implementation: </strong>PseudoChecker2 and PseudoViz are fully available at https://github.com/rresendepinto/PseudoChecker2and https://github.com/rresendepinto/PseudoViz.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf202"},"PeriodicalIF":2.8,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12679834/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf296
Dmitry N Konanov, Danil V Krivonos, Vladislav V Babenko, Elena N Ilina
Motivation: Nowadays, DNA methylation in bacteria is studied mainly using single-molecule sequencing technologies like PacBio and Oxford Nanopore. In nanopore sequencing, calling of methylated positions is provided by special models implemented directly in basecallers. Prokaryotic DNA methyltransferases are site-specific enzymes, which catalyze methylation in specific methylation motifs. Inference of these motifs is usually performed using third party software like MEME providing classical motif enrichment based only on sequence data. However, currently used motif enrichment algorithms rely only on sequence data, and do not use additional base modification information provided by the basecaller.
Results: Herein, we present a new tool Snappy, which is actually rethinking of the original Snapper algorithm but does not use any enrichment heuristics and does not require control sample sequencing. Snappy combines basecalling data processing with a new graph-based enrichment algorithm, thus significantly enhancing the enrichment sensitivity and accuracy. The versatility of the method was shown on both our and external data, representing different bacterial species with complex and simple methylome.
Availability and implementation: Source code and documentation is hosted on GitHub (https://github.com/DNKonanov/ont-snappy) and Zenodo (zenodo.org/records/16731817). For accessibility, Snappy is installable from PyPi using "pip install ont-snappy" command.
{"title":"Snappy: fast identification of DNA methylation motifs based on oxford nanopore reads.","authors":"Dmitry N Konanov, Danil V Krivonos, Vladislav V Babenko, Elena N Ilina","doi":"10.1093/bioadv/vbaf296","DOIUrl":"10.1093/bioadv/vbaf296","url":null,"abstract":"<p><strong>Motivation: </strong>Nowadays, DNA methylation in bacteria is studied mainly using single-molecule sequencing technologies like PacBio and Oxford Nanopore. In nanopore sequencing, calling of methylated positions is provided by special models implemented directly in basecallers. Prokaryotic DNA methyltransferases are site-specific enzymes, which catalyze methylation in specific methylation motifs. Inference of these motifs is usually performed using third party software like MEME providing classical motif enrichment based only on sequence data. However, currently used motif enrichment algorithms rely only on sequence data, and do not use additional base modification information provided by the basecaller.</p><p><strong>Results: </strong>Herein, we present a new tool Snappy, which is actually rethinking of the original Snapper algorithm but does not use any enrichment heuristics and does not require control sample sequencing. Snappy combines basecalling data processing with a new graph-based enrichment algorithm, thus significantly enhancing the enrichment sensitivity and accuracy. The versatility of the method was shown on both our and external data, representing different bacterial species with complex and simple methylome.</p><p><strong>Availability and implementation: </strong>Source code and documentation is hosted on GitHub (https://github.com/DNKonanov/ont-snappy) and Zenodo (zenodo.org/records/16731817). For accessibility, Snappy is installable from PyPi using \"pip install ont-snappy\" command.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf296"},"PeriodicalIF":2.8,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12679398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}