Pub Date : 2025-12-02eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf312
Lapo Doni, Alessia Marotta, Luigi Vezzulli, Emanuele Bosi
Motivation: The revolution of next-generation sequencing has driven the establishment of metabarcoding as an efficient and cost-effective method for exploring community composition. Amplicon sequencing of taxonomic marker genes, such as the 16S rRNA gene in prokaryotes, provides an efficient method for high-throughput taxonomic profiling. The advent of long read technologies made it feasible to sequence the whole 16S rRNA gene rather than only a few regions, with the potential to achieve species-level resolution. Despite the affordability and scalability of such experiments, a major bottleneck remains the lack of integrated and user-friendly analytical workflows. Current pipelines often require the use of multiple tools with complex dependencies, and parameter optimization is frequently performed manually, limiting reproducibility and overall efficiency.
Results: To address these limitations, we developed, AmpWrap, an automated, one line workflow designed to analyse both Illumina and Nanopore amplicons, requiring minimal efforts by the user and automatically optimizing the trimming parameter to retain the maximum number of reads and information while reducing noise.
Availability and implementation: AmpWrap is available at: https://github.com/LDoni/AmpWrap.
{"title":"AmpWrap: a one-line fully automated amplicon metabarcoding 16S and 18S rRNA gene analysis.","authors":"Lapo Doni, Alessia Marotta, Luigi Vezzulli, Emanuele Bosi","doi":"10.1093/bioadv/vbaf312","DOIUrl":"10.1093/bioadv/vbaf312","url":null,"abstract":"<p><strong>Motivation: </strong>The revolution of next-generation sequencing has driven the establishment of metabarcoding as an efficient and cost-effective method for exploring community composition. Amplicon sequencing of taxonomic marker genes, such as the 16S rRNA gene in prokaryotes, provides an efficient method for high-throughput taxonomic profiling. The advent of long read technologies made it feasible to sequence the whole 16S rRNA gene rather than only a few regions, with the potential to achieve species-level resolution. Despite the affordability and scalability of such experiments, a major bottleneck remains the lack of integrated and user-friendly analytical workflows. Current pipelines often require the use of multiple tools with complex dependencies, and parameter optimization is frequently performed manually, limiting reproducibility and overall efficiency.</p><p><strong>Results: </strong>To address these limitations, we developed, AmpWrap, an automated, one line workflow designed to analyse both Illumina and Nanopore amplicons, requiring minimal efforts by the user and automatically optimizing the trimming parameter to retain the maximum number of reads and information while reducing noise.</p><p><strong>Availability and implementation: </strong>AmpWrap is available at: https://github.com/LDoni/AmpWrap.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf312"},"PeriodicalIF":2.8,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701800/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf310
James Urban, Roman Joeres, Daniel Bojar
Motivation: As the field of glycobiology has developed, so too have different glycan nomenclature systems. While each system serves specific purposes, this multiplicity creates challenges for usability, data integration, and knowledge sharing across different databases and computational tools.
Results: We present a practical framework for automated nomenclature conversion that takes any glycan nomenclature as input without requiring declaration of the specific language and outputs a canonicalized IUPAC-condensed format as a standardized representation. Our implementation handles all common nomenclatures including WURCS, GlycoCT, IUPAC-condensed/extended, GLYCAM, CSDB-linear, LinearCode, GlycoWorkbench, GlySeeker, Oxford, and KCF, along with common typos, and manages complex cases including structural ambiguities, modifications, uncertainty in linkage information, and different compositional representations. This Universal Input framework can translate more than 10 nomenclatures in <1 ms per glycan, tested on over 150 000 sequences with 98%-100% coverage, enabling seamless integration of existing glycan databases and tools while maintaining the specific advantages of each representation system.
Availability and implementation: Universal Input is implemented within the glycowork Python package, available at https://github.com/BojarLab/glycowork and our web app https://canonicalize.streamlit.app/.
{"title":"Bridging worlds: connecting glycan representations with glycoinformatics via Universal Input and a canonicalized nomenclature.","authors":"James Urban, Roman Joeres, Daniel Bojar","doi":"10.1093/bioadv/vbaf310","DOIUrl":"10.1093/bioadv/vbaf310","url":null,"abstract":"<p><strong>Motivation: </strong>As the field of glycobiology has developed, so too have different glycan nomenclature systems. While each system serves specific purposes, this multiplicity creates challenges for usability, data integration, and knowledge sharing across different databases and computational tools.</p><p><strong>Results: </strong>We present a practical framework for automated nomenclature conversion that takes any glycan nomenclature as input without requiring declaration of the specific language and outputs a canonicalized IUPAC-condensed format as a standardized representation. Our implementation handles all common nomenclatures including WURCS, GlycoCT, IUPAC-condensed/extended, GLYCAM, CSDB-linear, LinearCode, GlycoWorkbench, GlySeeker, Oxford, and KCF, along with common typos, and manages complex cases including structural ambiguities, modifications, uncertainty in linkage information, and different compositional representations. This Universal Input framework can translate more than 10 nomenclatures in <1 ms per glycan, tested on over 150 000 sequences with 98%-100% coverage, enabling seamless integration of existing glycan databases and tools while maintaining the specific advantages of each representation system.</p><p><strong>Availability and implementation: </strong>Universal Input is implemented within the glycowork Python package, available at https://github.com/BojarLab/glycowork and our web app https://canonicalize.streamlit.app/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf310"},"PeriodicalIF":2.8,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12702141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf302
James Kitchens, Yan Wong
Motivation: Ancestral recombination graphs (ARGs) are a complete representation of the genetic relationships between recombining lineages and are of central importance in population genetics. Recent breakthroughs in simulation and inference methods have led to a surge of interest in ARGs. However, understanding how best to take advantage of the graphical structure of ARGs remains an open question for researchers. Here, we introduce tskit_arg_visualizer, a Python package for programmatically drawing ARGs using the interactive D3.js visualization library.
Results: We highlight the usefulness of this visualization tool for both teaching ARG concepts and exploring ARGs inferred from empirical datasets.
Availability and implementation: The latest stable version of tskit_arg_visualizer is available through the Python Package Index (https://pypi.org/project/tskit-arg-visualizer, currently v0.1.1). Documentation and the development version of the package are found on GitHub (https://github.com/kitchensjn/tskit_arg_visualizer).
{"title":"tskit_arg_visualizer: interactive plotting of ancestral recombination graphs.","authors":"James Kitchens, Yan Wong","doi":"10.1093/bioadv/vbaf302","DOIUrl":"10.1093/bioadv/vbaf302","url":null,"abstract":"<p><strong>Motivation: </strong>Ancestral recombination graphs (ARGs) are a complete representation of the genetic relationships between recombining lineages and are of central importance in population genetics. Recent breakthroughs in simulation and inference methods have led to a surge of interest in ARGs. However, understanding how best to take advantage of the graphical structure of ARGs remains an open question for researchers. Here, we introduce tskit_arg_visualizer, a Python package for programmatically drawing ARGs using the interactive D3.js visualization library.</p><p><strong>Results: </strong>We highlight the usefulness of this visualization tool for both teaching ARG concepts and exploring ARGs inferred from empirical datasets.</p><p><strong>Availability and implementation: </strong>The latest stable version of tskit_arg_visualizer is available through the Python Package Index (https://pypi.org/project/tskit-arg-visualizer, currently v0.1.1). Documentation and the development version of the package are found on GitHub (https://github.com/kitchensjn/tskit_arg_visualizer).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf302"},"PeriodicalIF":2.8,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf298
Roman Schefzik, Han Cao, Sivanesan Rajan, Xavier Escribà-Montagut, Juan R González, Emanuel Schwarz
Motivation: Multi-task learning (MTL) enables simultaneous learning of related regression or classification tasks by exploiting shared information. The R package dsMTL provides a computational framework for federated MTL approaches, supporting the analysis of sensitive, individual-level data from geographically distributed data sources using the DataSHIELD platform. While the current architecture provides comprehensive data security mechanisms, these are not specifically tailored to MTL models. In particular, these models may still be vulnerable to membership inference attacks, attempting to determine whether a specific individual was included in a given training set using the model.
Results: To further enhance the privacy-preserving capabilities of dsMTL and protect against such attacks, differential privacy using the Laplace mechanism is integrated into dsMTL as a novel optional feature. This approach aims to obscure individual-level characteristics from the model while retaining group-level differences. The differential privacy implementation is validated in both simulation studies and a case study identifying schizophrenia patients from gene expression data. For practical utility, it is crucial to find an adequate balance between the degree of privacy protection and the conservation of model performance by choosing a reasonable privacy parameter within the differential privacy mechanism.
Availability and implementation: dsMTL is open-source and available at https://github.com/transbioZI/dsMTLBase (server-side) and https://github.com/transbioZI/dsMTLClient (client-side).
{"title":"Integrating differential privacy into federated multi-task learning algorithms in <b>dsMTL</b>.","authors":"Roman Schefzik, Han Cao, Sivanesan Rajan, Xavier Escribà-Montagut, Juan R González, Emanuel Schwarz","doi":"10.1093/bioadv/vbaf298","DOIUrl":"10.1093/bioadv/vbaf298","url":null,"abstract":"<p><strong>Motivation: </strong>Multi-task learning (MTL) enables simultaneous learning of related regression or classification tasks by exploiting shared information. The R package dsMTL provides a computational framework for federated MTL approaches, supporting the analysis of sensitive, individual-level data from geographically distributed data sources using the DataSHIELD platform. While the current architecture provides comprehensive data security mechanisms, these are not specifically tailored to MTL models. In particular, these models may still be vulnerable to membership inference attacks, attempting to determine whether a specific individual was included in a given training set using the model.</p><p><strong>Results: </strong>To further enhance the privacy-preserving capabilities of dsMTL and protect against such attacks, differential privacy using the Laplace mechanism is integrated into dsMTL as a novel optional feature. This approach aims to obscure individual-level characteristics from the model while retaining group-level differences. The differential privacy implementation is validated in both simulation studies and a case study identifying schizophrenia patients from gene expression data. For practical utility, it is crucial to find an adequate balance between the degree of privacy protection and the conservation of model performance by choosing a reasonable privacy parameter within the differential privacy mechanism.</p><p><strong>Availability and implementation: </strong>dsMTL is open-source and available at https://github.com/transbioZI/dsMTLBase (server-side) and https://github.com/transbioZI/dsMTLClient (client-side).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf298"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf299
Jianhong Ou, Kenneth D Poss
Motivation: The three-dimensional organization of the genome plays a critical role in regulating gene expression by shaping the spatial and temporal interactions between regulatory elements. High-throughput chromosome conformation capture (Hi-C) technologies, along with immunoprecipitation- or chromatin accessibility-based chromatin architecture mapping methods, enable the measurement of chromatin dynamics at both bulk and single-cell levels. However, effectively exploring and comparing chromatin structures remains challenging, particularly when integrating multiple layers of genomic annotation or comparing structural dynamics across conditions. While several tools support interactive 3D genome visualization, few provide a flexible, R-integrated framework that supports custom annotations, side-by-side comparison of multiple stages or conditions, and deployment in Shiny applications.
Results: To address this need, we have developed geomeTriD, an R/Bioconductor package that enables interactive visualization of chromatin structures using three.js, supports multi-layer annotation, allows parallel comparison of two chromatin states, and is compatible with Shiny-based analysis workflows. As multi-omic and spatial genomic datasets grow in complexity, GeomeTriD will facilitate the reconstruction and comparison of 3D genome structures across conditions, linking chromatin architecture to gene regulation, epigenetic states, and cell-state transitions.
Availability and implementation: geomeTriD is freely available at https://bioconductor.org/packages/geomeTriD.
{"title":"geomeTriD: a Bioconductor package for interactive and integrative visualization of 3D structural model with multi-omics data.","authors":"Jianhong Ou, Kenneth D Poss","doi":"10.1093/bioadv/vbaf299","DOIUrl":"10.1093/bioadv/vbaf299","url":null,"abstract":"<p><strong>Motivation: </strong>The three-dimensional organization of the genome plays a critical role in regulating gene expression by shaping the spatial and temporal interactions between regulatory elements. High-throughput chromosome conformation capture (Hi-C) technologies, along with immunoprecipitation- or chromatin accessibility-based chromatin architecture mapping methods, enable the measurement of chromatin dynamics at both bulk and single-cell levels. However, effectively exploring and comparing chromatin structures remains challenging, particularly when integrating multiple layers of genomic annotation or comparing structural dynamics across conditions. While several tools support interactive 3D genome visualization, few provide a flexible, R-integrated framework that supports custom annotations, side-by-side comparison of multiple stages or conditions, and deployment in Shiny applications.</p><p><strong>Results: </strong>To address this need, we have developed geomeTriD, an R/Bioconductor package that enables interactive visualization of chromatin structures using three.js, supports multi-layer annotation, allows parallel comparison of two chromatin states, and is compatible with Shiny-based analysis workflows. As multi-omic and spatial genomic datasets grow in complexity, GeomeTriD will facilitate the reconstruction and comparison of 3D genome structures across conditions, linking chromatin architecture to gene regulation, epigenetic states, and cell-state transitions.</p><p><strong>Availability and implementation: </strong>geomeTriD is freely available at https://bioconductor.org/packages/geomeTriD.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf299"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12702139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf300
Dimitri Höhler, Julia Haag, Alexey M Kozlov, Benoit Morel, Alexandros Stamatakis
Motivation: The performance of phylogenetic inference tools is commonly evaluated using simulated as well as empirical sequence data alignments. An open question is how representative these alignments are with respect to those, commonly analyzed by users. Using the RAxMLGrove database, it is now possible to simulate DNA and amino acid sequences based on more than 70 000 representative RAxML and RAxML-NG tree inferences on empirical datasets conducted on the RAxML web servers. This allows to assess the phylogenetic tree inference accuracy of various inference tools based on more realistic and representative simulated alignments.
Results: To automate this process, we implement PhyloSmew, a tool for benchmarking phylogenetic inference tools. We use it to simulate ∼20 000 multiple sequence alignments (MSAs) based on representative empirical trees (in terms of signal strength) from RAxMLGrove. We subsequently analyze 5000 empirical MSAs from the TreeBASE database, to assess the inference accuracy of FastTree2, IQ-TREE2, and RAxML-NG. We find that on quantifiably difficult-to-analyze MSAs, all three tree inference tools perform poorly. Hence, the faster FastTree2 tool, constitutes a viable alternative to infer trees on difficult MSAs. We also find that there are substantial differences between accuracy results on simulated versus empirical data.
Availability and implementation: The data underlying this article are available at https://github.com/angtft/PhyloSmew, https://cme.h-its.org/exelixis/material/accuracy-study/data.tar.gz.
{"title":"Performance assessment of phylogenetic inference tools using PhyloSmew.","authors":"Dimitri Höhler, Julia Haag, Alexey M Kozlov, Benoit Morel, Alexandros Stamatakis","doi":"10.1093/bioadv/vbaf300","DOIUrl":"10.1093/bioadv/vbaf300","url":null,"abstract":"<p><strong>Motivation: </strong>The performance of phylogenetic inference tools is commonly evaluated using simulated as well as empirical sequence data alignments. An open question is how representative these alignments are with respect to those, commonly analyzed by users. Using the RAxMLGrove database, it is now possible to simulate DNA and amino acid sequences based on more than 70 000 representative RAxML and RAxML-NG tree inferences on empirical datasets conducted on the RAxML web servers. This allows to assess the phylogenetic tree inference accuracy of various inference tools based on more realistic and representative simulated alignments.</p><p><strong>Results: </strong>To automate this process, we implement PhyloSmew, a tool for benchmarking phylogenetic inference tools. We use it to simulate ∼20 000 multiple sequence alignments (MSAs) based on representative empirical trees (in terms of signal strength) from RAxMLGrove. We subsequently analyze 5000 empirical MSAs from the TreeBASE database, to assess the inference accuracy of FastTree2, IQ-TREE2, and RAxML-NG. We find that on quantifiably difficult-to-analyze MSAs, all three tree inference tools perform poorly. Hence, the faster FastTree2 tool, constitutes a viable alternative to infer trees on difficult MSAs. We also find that there are substantial differences between accuracy results on simulated versus empirical data.</p><p><strong>Availability and implementation: </strong>The data underlying this article are available at https://github.com/angtft/PhyloSmew, https://cme.h-its.org/exelixis/material/accuracy-study/data.tar.gz.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf300"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701799/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf297
Nadeem Khan, Muhammad Muneeb Nasir, Ammar Mushtaq, Masood Ur Rehman Kayani
Motivation: Analysis of genomic variation in microbial genomes is crucial for understanding how microbes adapt, interact with their hosts, and influence health and disease. In metagenomic studies, where genetic material from entire microbial communities is sequenced, thousands of single-nucleotide polymorphisms can be detected across species and samples. However, identifying which of these variations has biologically or functionally relevant impacts remains a significant challenge.
Results: To address this, we present SNPraefentia, a Python-based toolkit for prioritizing microbial SNPs based on their predicted functional relevance. The tool integrates multiple biologically meaningful parameters, including sequencing depth, physicochemical impact of amino acid substitutions, and the structural and functional context of mutations within annotated protein domains. SNPraefentia extracts variation depth and amino acid changes, annotates protein domains using UniProt, and computes individual impact scores. These are then integrated into a composite prioritization score that reflects the potential biological importance of each variant. Overall, SNPraefentia provides researchers with a systematic and reproducible approach to filter and rank microbial variants for downstream functional analysis or experimental validation.
Availability and implementation: The toolkit and test data are freely available at https://github.com/muneebdev7/SNPraefentia.
{"title":"SNPraefentia: a toolkit to prioritize microbial genome variants linked to health and disease.","authors":"Nadeem Khan, Muhammad Muneeb Nasir, Ammar Mushtaq, Masood Ur Rehman Kayani","doi":"10.1093/bioadv/vbaf297","DOIUrl":"10.1093/bioadv/vbaf297","url":null,"abstract":"<p><strong>Motivation: </strong>Analysis of genomic variation in microbial genomes is crucial for understanding how microbes adapt, interact with their hosts, and influence health and disease. In metagenomic studies, where genetic material from entire microbial communities is sequenced, thousands of single-nucleotide polymorphisms can be detected across species and samples. However, identifying which of these variations has biologically or functionally relevant impacts remains a significant challenge.</p><p><strong>Results: </strong>To address this, we present SNPraefentia, a Python-based toolkit for prioritizing microbial SNPs based on their predicted functional relevance. The tool integrates multiple biologically meaningful parameters, including sequencing depth, physicochemical impact of amino acid substitutions, and the structural and functional context of mutations within annotated protein domains. SNPraefentia extracts variation depth and amino acid changes, annotates protein domains using UniProt, and computes individual impact scores. These are then integrated into a composite prioritization score that reflects the potential biological importance of each variant. Overall, SNPraefentia provides researchers with a systematic and reproducible approach to filter and rank microbial variants for downstream functional analysis or experimental validation.</p><p><strong>Availability and implementation: </strong>The toolkit and test data are freely available at https://github.com/muneebdev7/SNPraefentia.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf297"},"PeriodicalIF":2.8,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12671963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145672764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf202
Rui Resende-Pinto, Raquel Ruivo, Josefin Stiller, Rute Fonseca, Luís Filipe C Castro
Summary: High-fidelity genome assemblies provide unprecedented opportunities to decipher mechanisms of molecular evolution and phenotype landscapes. Here, we present PseudoChecker2, a command-line version of the web-tool PseudoChecker with expanded functions. It identifies gene loss via drastic mutational events such as premature stop codons, deletions and insertions. It enables the investigation of cross-species genomic datasets through: (i) integration into automated workflows, (ii) multiprocessing capability, and (iii) creation of a functional reference from annotation files. In addition, we introduce PseudoViz, a novel graphical interface designed to help interpret the results of PseudoChecker2 with intuitive visualizations. These tools combine the versatility and automation of a command-line tool with the user-friendliness of a graphical interface to tackle the challenges of the Genome Era.
Availability and implementation: PseudoChecker2 and PseudoViz are fully available at https://github.com/rresendepinto/PseudoChecker2and https://github.com/rresendepinto/PseudoViz.
{"title":"PseudoChecker2 and PseudoViz: automation and visualization of gene loss in the Genome Era.","authors":"Rui Resende-Pinto, Raquel Ruivo, Josefin Stiller, Rute Fonseca, Luís Filipe C Castro","doi":"10.1093/bioadv/vbaf202","DOIUrl":"10.1093/bioadv/vbaf202","url":null,"abstract":"<p><strong>Summary: </strong>High-fidelity genome assemblies provide unprecedented opportunities to decipher mechanisms of molecular evolution and phenotype landscapes. Here, we present PseudoChecker2, a command-line version of the web-tool PseudoChecker with expanded functions. It identifies gene loss via drastic mutational events such as premature stop codons, deletions and insertions. It enables the investigation of cross-species genomic datasets through: (i) integration into automated workflows, (ii) multiprocessing capability, and (iii) creation of a functional reference from annotation files. In addition, we introduce PseudoViz, a novel graphical interface designed to help interpret the results of PseudoChecker2 with intuitive visualizations. These tools combine the versatility and automation of a command-line tool with the user-friendliness of a graphical interface to tackle the challenges of the Genome Era.</p><p><strong>Availability and implementation: </strong>PseudoChecker2 and PseudoViz are fully available at https://github.com/rresendepinto/PseudoChecker2and https://github.com/rresendepinto/PseudoViz.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf202"},"PeriodicalIF":2.8,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12679834/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf296
Dmitry N Konanov, Danil V Krivonos, Vladislav V Babenko, Elena N Ilina
Motivation: Nowadays, DNA methylation in bacteria is studied mainly using single-molecule sequencing technologies like PacBio and Oxford Nanopore. In nanopore sequencing, calling of methylated positions is provided by special models implemented directly in basecallers. Prokaryotic DNA methyltransferases are site-specific enzymes, which catalyze methylation in specific methylation motifs. Inference of these motifs is usually performed using third party software like MEME providing classical motif enrichment based only on sequence data. However, currently used motif enrichment algorithms rely only on sequence data, and do not use additional base modification information provided by the basecaller.
Results: Herein, we present a new tool Snappy, which is actually rethinking of the original Snapper algorithm but does not use any enrichment heuristics and does not require control sample sequencing. Snappy combines basecalling data processing with a new graph-based enrichment algorithm, thus significantly enhancing the enrichment sensitivity and accuracy. The versatility of the method was shown on both our and external data, representing different bacterial species with complex and simple methylome.
Availability and implementation: Source code and documentation is hosted on GitHub (https://github.com/DNKonanov/ont-snappy) and Zenodo (zenodo.org/records/16731817). For accessibility, Snappy is installable from PyPi using "pip install ont-snappy" command.
{"title":"Snappy: fast identification of DNA methylation motifs based on oxford nanopore reads.","authors":"Dmitry N Konanov, Danil V Krivonos, Vladislav V Babenko, Elena N Ilina","doi":"10.1093/bioadv/vbaf296","DOIUrl":"10.1093/bioadv/vbaf296","url":null,"abstract":"<p><strong>Motivation: </strong>Nowadays, DNA methylation in bacteria is studied mainly using single-molecule sequencing technologies like PacBio and Oxford Nanopore. In nanopore sequencing, calling of methylated positions is provided by special models implemented directly in basecallers. Prokaryotic DNA methyltransferases are site-specific enzymes, which catalyze methylation in specific methylation motifs. Inference of these motifs is usually performed using third party software like MEME providing classical motif enrichment based only on sequence data. However, currently used motif enrichment algorithms rely only on sequence data, and do not use additional base modification information provided by the basecaller.</p><p><strong>Results: </strong>Herein, we present a new tool Snappy, which is actually rethinking of the original Snapper algorithm but does not use any enrichment heuristics and does not require control sample sequencing. Snappy combines basecalling data processing with a new graph-based enrichment algorithm, thus significantly enhancing the enrichment sensitivity and accuracy. The versatility of the method was shown on both our and external data, representing different bacterial species with complex and simple methylome.</p><p><strong>Availability and implementation: </strong>Source code and documentation is hosted on GitHub (https://github.com/DNKonanov/ont-snappy) and Zenodo (zenodo.org/records/16731817). For accessibility, Snappy is installable from PyPi using \"pip install ont-snappy\" command.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf296"},"PeriodicalIF":2.8,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12679398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf293
Siddharth Sethi, Emil K Gustavsson, Harpreet Saini, Mina Ryten
Motivation: Long-read RNA sequencing has the potential to accurately quantify transcriptomes and reveal the isoform diversity of disease-causing genes. However, despite the recent advances in analysis tools for transcript discovery, long-read RNA sequencing data is still challenging to analyse, due to the detection of hundreds or even thousands of novel transcripts per gene.
Results: Here, we introduce PSQAN, a workflow to help researchers prioritize high-confidence and potentially biologically relevant transcripts associated with candidate genes and make transcript characterization results more interpretable. PSQAN performs a gene-based analysis on characterized transcripts generated by SQANTI3 and TALON. PSQAN re-groups transcripts into easily interpretable categories to facilitate their prioritization, allows transcript-level expression thresholds, and generates visualizations to determine optimal expression thresholds. Overall, we demonstrate that PSQAN is a useful tool which enables users to identify known and novel transcripts of potential biological importance.
Availability and implementation: PSQAN is an analysis workflow implemented in Snakemake and R and is licensed under the GNU General Public License version 3. The source code and documentation of this tool is available at https://github.com/sid-sethi/PSQAN.
{"title":"PSQAN: a pipeline to prioritize novel and biologically relevant transcripts from long-read RNA sequencing.","authors":"Siddharth Sethi, Emil K Gustavsson, Harpreet Saini, Mina Ryten","doi":"10.1093/bioadv/vbaf293","DOIUrl":"10.1093/bioadv/vbaf293","url":null,"abstract":"<p><strong>Motivation: </strong>Long-read RNA sequencing has the potential to accurately quantify transcriptomes and reveal the isoform diversity of disease-causing genes. However, despite the recent advances in analysis tools for transcript discovery, long-read RNA sequencing data is still challenging to analyse, due to the detection of hundreds or even thousands of novel transcripts per gene.</p><p><strong>Results: </strong>Here, we introduce PSQAN, a workflow to help researchers prioritize high-confidence and potentially biologically relevant transcripts associated with candidate genes and make transcript characterization results more interpretable. PSQAN performs a gene-based analysis on characterized transcripts generated by SQANTI3 and TALON. PSQAN re-groups transcripts into easily interpretable categories to facilitate their prioritization, allows transcript-level expression thresholds, and generates visualizations to determine optimal expression thresholds. Overall, we demonstrate that PSQAN is a useful tool which enables users to identify known and novel transcripts of potential biological importance.</p><p><strong>Availability and implementation: </strong>PSQAN is an analysis workflow implemented in Snakemake and R and is licensed under the GNU General Public License version 3. The source code and documentation of this tool is available at https://github.com/sid-sethi/PSQAN.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf293"},"PeriodicalIF":2.8,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}