Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf298
Roman Schefzik, Han Cao, Sivanesan Rajan, Xavier Escribà-Montagut, Juan R González, Emanuel Schwarz
Motivation: Multi-task learning (MTL) enables simultaneous learning of related regression or classification tasks by exploiting shared information. The R package dsMTL provides a computational framework for federated MTL approaches, supporting the analysis of sensitive, individual-level data from geographically distributed data sources using the DataSHIELD platform. While the current architecture provides comprehensive data security mechanisms, these are not specifically tailored to MTL models. In particular, these models may still be vulnerable to membership inference attacks, attempting to determine whether a specific individual was included in a given training set using the model.
Results: To further enhance the privacy-preserving capabilities of dsMTL and protect against such attacks, differential privacy using the Laplace mechanism is integrated into dsMTL as a novel optional feature. This approach aims to obscure individual-level characteristics from the model while retaining group-level differences. The differential privacy implementation is validated in both simulation studies and a case study identifying schizophrenia patients from gene expression data. For practical utility, it is crucial to find an adequate balance between the degree of privacy protection and the conservation of model performance by choosing a reasonable privacy parameter within the differential privacy mechanism.
Availability and implementation: dsMTL is open-source and available at https://github.com/transbioZI/dsMTLBase (server-side) and https://github.com/transbioZI/dsMTLClient (client-side).
{"title":"Integrating differential privacy into federated multi-task learning algorithms in <b>dsMTL</b>.","authors":"Roman Schefzik, Han Cao, Sivanesan Rajan, Xavier Escribà-Montagut, Juan R González, Emanuel Schwarz","doi":"10.1093/bioadv/vbaf298","DOIUrl":"10.1093/bioadv/vbaf298","url":null,"abstract":"<p><strong>Motivation: </strong>Multi-task learning (MTL) enables simultaneous learning of related regression or classification tasks by exploiting shared information. The R package dsMTL provides a computational framework for federated MTL approaches, supporting the analysis of sensitive, individual-level data from geographically distributed data sources using the DataSHIELD platform. While the current architecture provides comprehensive data security mechanisms, these are not specifically tailored to MTL models. In particular, these models may still be vulnerable to membership inference attacks, attempting to determine whether a specific individual was included in a given training set using the model.</p><p><strong>Results: </strong>To further enhance the privacy-preserving capabilities of dsMTL and protect against such attacks, differential privacy using the Laplace mechanism is integrated into dsMTL as a novel optional feature. This approach aims to obscure individual-level characteristics from the model while retaining group-level differences. The differential privacy implementation is validated in both simulation studies and a case study identifying schizophrenia patients from gene expression data. For practical utility, it is crucial to find an adequate balance between the degree of privacy protection and the conservation of model performance by choosing a reasonable privacy parameter within the differential privacy mechanism.</p><p><strong>Availability and implementation: </strong>dsMTL is open-source and available at https://github.com/transbioZI/dsMTLBase (server-side) and https://github.com/transbioZI/dsMTLClient (client-side).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf298"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf299
Jianhong Ou, Kenneth D Poss
Motivation: The three-dimensional organization of the genome plays a critical role in regulating gene expression by shaping the spatial and temporal interactions between regulatory elements. High-throughput chromosome conformation capture (Hi-C) technologies, along with immunoprecipitation- or chromatin accessibility-based chromatin architecture mapping methods, enable the measurement of chromatin dynamics at both bulk and single-cell levels. However, effectively exploring and comparing chromatin structures remains challenging, particularly when integrating multiple layers of genomic annotation or comparing structural dynamics across conditions. While several tools support interactive 3D genome visualization, few provide a flexible, R-integrated framework that supports custom annotations, side-by-side comparison of multiple stages or conditions, and deployment in Shiny applications.
Results: To address this need, we have developed geomeTriD, an R/Bioconductor package that enables interactive visualization of chromatin structures using three.js, supports multi-layer annotation, allows parallel comparison of two chromatin states, and is compatible with Shiny-based analysis workflows. As multi-omic and spatial genomic datasets grow in complexity, GeomeTriD will facilitate the reconstruction and comparison of 3D genome structures across conditions, linking chromatin architecture to gene regulation, epigenetic states, and cell-state transitions.
Availability and implementation: geomeTriD is freely available at https://bioconductor.org/packages/geomeTriD.
{"title":"geomeTriD: a Bioconductor package for interactive and integrative visualization of 3D structural model with multi-omics data.","authors":"Jianhong Ou, Kenneth D Poss","doi":"10.1093/bioadv/vbaf299","DOIUrl":"10.1093/bioadv/vbaf299","url":null,"abstract":"<p><strong>Motivation: </strong>The three-dimensional organization of the genome plays a critical role in regulating gene expression by shaping the spatial and temporal interactions between regulatory elements. High-throughput chromosome conformation capture (Hi-C) technologies, along with immunoprecipitation- or chromatin accessibility-based chromatin architecture mapping methods, enable the measurement of chromatin dynamics at both bulk and single-cell levels. However, effectively exploring and comparing chromatin structures remains challenging, particularly when integrating multiple layers of genomic annotation or comparing structural dynamics across conditions. While several tools support interactive 3D genome visualization, few provide a flexible, R-integrated framework that supports custom annotations, side-by-side comparison of multiple stages or conditions, and deployment in Shiny applications.</p><p><strong>Results: </strong>To address this need, we have developed geomeTriD, an R/Bioconductor package that enables interactive visualization of chromatin structures using three.js, supports multi-layer annotation, allows parallel comparison of two chromatin states, and is compatible with Shiny-based analysis workflows. As multi-omic and spatial genomic datasets grow in complexity, GeomeTriD will facilitate the reconstruction and comparison of 3D genome structures across conditions, linking chromatin architecture to gene regulation, epigenetic states, and cell-state transitions.</p><p><strong>Availability and implementation: </strong>geomeTriD is freely available at https://bioconductor.org/packages/geomeTriD.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf299"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12702139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf300
Dimitri Höhler, Julia Haag, Alexey M Kozlov, Benoit Morel, Alexandros Stamatakis
Motivation: The performance of phylogenetic inference tools is commonly evaluated using simulated as well as empirical sequence data alignments. An open question is how representative these alignments are with respect to those, commonly analyzed by users. Using the RAxMLGrove database, it is now possible to simulate DNA and amino acid sequences based on more than 70 000 representative RAxML and RAxML-NG tree inferences on empirical datasets conducted on the RAxML web servers. This allows to assess the phylogenetic tree inference accuracy of various inference tools based on more realistic and representative simulated alignments.
Results: To automate this process, we implement PhyloSmew, a tool for benchmarking phylogenetic inference tools. We use it to simulate ∼20 000 multiple sequence alignments (MSAs) based on representative empirical trees (in terms of signal strength) from RAxMLGrove. We subsequently analyze 5000 empirical MSAs from the TreeBASE database, to assess the inference accuracy of FastTree2, IQ-TREE2, and RAxML-NG. We find that on quantifiably difficult-to-analyze MSAs, all three tree inference tools perform poorly. Hence, the faster FastTree2 tool, constitutes a viable alternative to infer trees on difficult MSAs. We also find that there are substantial differences between accuracy results on simulated versus empirical data.
Availability and implementation: The data underlying this article are available at https://github.com/angtft/PhyloSmew, https://cme.h-its.org/exelixis/material/accuracy-study/data.tar.gz.
{"title":"Performance assessment of phylogenetic inference tools using PhyloSmew.","authors":"Dimitri Höhler, Julia Haag, Alexey M Kozlov, Benoit Morel, Alexandros Stamatakis","doi":"10.1093/bioadv/vbaf300","DOIUrl":"10.1093/bioadv/vbaf300","url":null,"abstract":"<p><strong>Motivation: </strong>The performance of phylogenetic inference tools is commonly evaluated using simulated as well as empirical sequence data alignments. An open question is how representative these alignments are with respect to those, commonly analyzed by users. Using the RAxMLGrove database, it is now possible to simulate DNA and amino acid sequences based on more than 70 000 representative RAxML and RAxML-NG tree inferences on empirical datasets conducted on the RAxML web servers. This allows to assess the phylogenetic tree inference accuracy of various inference tools based on more realistic and representative simulated alignments.</p><p><strong>Results: </strong>To automate this process, we implement PhyloSmew, a tool for benchmarking phylogenetic inference tools. We use it to simulate ∼20 000 multiple sequence alignments (MSAs) based on representative empirical trees (in terms of signal strength) from RAxMLGrove. We subsequently analyze 5000 empirical MSAs from the TreeBASE database, to assess the inference accuracy of FastTree2, IQ-TREE2, and RAxML-NG. We find that on quantifiably difficult-to-analyze MSAs, all three tree inference tools perform poorly. Hence, the faster FastTree2 tool, constitutes a viable alternative to infer trees on difficult MSAs. We also find that there are substantial differences between accuracy results on simulated versus empirical data.</p><p><strong>Availability and implementation: </strong>The data underlying this article are available at https://github.com/angtft/PhyloSmew, https://cme.h-its.org/exelixis/material/accuracy-study/data.tar.gz.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf300"},"PeriodicalIF":2.8,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701799/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf297
Nadeem Khan, Muhammad Muneeb Nasir, Ammar Mushtaq, Masood Ur Rehman Kayani
Motivation: Analysis of genomic variation in microbial genomes is crucial for understanding how microbes adapt, interact with their hosts, and influence health and disease. In metagenomic studies, where genetic material from entire microbial communities is sequenced, thousands of single-nucleotide polymorphisms can be detected across species and samples. However, identifying which of these variations has biologically or functionally relevant impacts remains a significant challenge.
Results: To address this, we present SNPraefentia, a Python-based toolkit for prioritizing microbial SNPs based on their predicted functional relevance. The tool integrates multiple biologically meaningful parameters, including sequencing depth, physicochemical impact of amino acid substitutions, and the structural and functional context of mutations within annotated protein domains. SNPraefentia extracts variation depth and amino acid changes, annotates protein domains using UniProt, and computes individual impact scores. These are then integrated into a composite prioritization score that reflects the potential biological importance of each variant. Overall, SNPraefentia provides researchers with a systematic and reproducible approach to filter and rank microbial variants for downstream functional analysis or experimental validation.
Availability and implementation: The toolkit and test data are freely available at https://github.com/muneebdev7/SNPraefentia.
{"title":"SNPraefentia: a toolkit to prioritize microbial genome variants linked to health and disease.","authors":"Nadeem Khan, Muhammad Muneeb Nasir, Ammar Mushtaq, Masood Ur Rehman Kayani","doi":"10.1093/bioadv/vbaf297","DOIUrl":"10.1093/bioadv/vbaf297","url":null,"abstract":"<p><strong>Motivation: </strong>Analysis of genomic variation in microbial genomes is crucial for understanding how microbes adapt, interact with their hosts, and influence health and disease. In metagenomic studies, where genetic material from entire microbial communities is sequenced, thousands of single-nucleotide polymorphisms can be detected across species and samples. However, identifying which of these variations has biologically or functionally relevant impacts remains a significant challenge.</p><p><strong>Results: </strong>To address this, we present SNPraefentia, a Python-based toolkit for prioritizing microbial SNPs based on their predicted functional relevance. The tool integrates multiple biologically meaningful parameters, including sequencing depth, physicochemical impact of amino acid substitutions, and the structural and functional context of mutations within annotated protein domains. SNPraefentia extracts variation depth and amino acid changes, annotates protein domains using UniProt, and computes individual impact scores. These are then integrated into a composite prioritization score that reflects the potential biological importance of each variant. Overall, SNPraefentia provides researchers with a systematic and reproducible approach to filter and rank microbial variants for downstream functional analysis or experimental validation.</p><p><strong>Availability and implementation: </strong>The toolkit and test data are freely available at https://github.com/muneebdev7/SNPraefentia.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf297"},"PeriodicalIF":2.8,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12671963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145672764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf202
Rui Resende-Pinto, Raquel Ruivo, Josefin Stiller, Rute Fonseca, Luís Filipe C Castro
Summary: High-fidelity genome assemblies provide unprecedented opportunities to decipher mechanisms of molecular evolution and phenotype landscapes. Here, we present PseudoChecker2, a command-line version of the web-tool PseudoChecker with expanded functions. It identifies gene loss via drastic mutational events such as premature stop codons, deletions and insertions. It enables the investigation of cross-species genomic datasets through: (i) integration into automated workflows, (ii) multiprocessing capability, and (iii) creation of a functional reference from annotation files. In addition, we introduce PseudoViz, a novel graphical interface designed to help interpret the results of PseudoChecker2 with intuitive visualizations. These tools combine the versatility and automation of a command-line tool with the user-friendliness of a graphical interface to tackle the challenges of the Genome Era.
Availability and implementation: PseudoChecker2 and PseudoViz are fully available at https://github.com/rresendepinto/PseudoChecker2and https://github.com/rresendepinto/PseudoViz.
{"title":"PseudoChecker2 and PseudoViz: automation and visualization of gene loss in the Genome Era.","authors":"Rui Resende-Pinto, Raquel Ruivo, Josefin Stiller, Rute Fonseca, Luís Filipe C Castro","doi":"10.1093/bioadv/vbaf202","DOIUrl":"10.1093/bioadv/vbaf202","url":null,"abstract":"<p><strong>Summary: </strong>High-fidelity genome assemblies provide unprecedented opportunities to decipher mechanisms of molecular evolution and phenotype landscapes. Here, we present PseudoChecker2, a command-line version of the web-tool PseudoChecker with expanded functions. It identifies gene loss via drastic mutational events such as premature stop codons, deletions and insertions. It enables the investigation of cross-species genomic datasets through: (i) integration into automated workflows, (ii) multiprocessing capability, and (iii) creation of a functional reference from annotation files. In addition, we introduce PseudoViz, a novel graphical interface designed to help interpret the results of PseudoChecker2 with intuitive visualizations. These tools combine the versatility and automation of a command-line tool with the user-friendliness of a graphical interface to tackle the challenges of the Genome Era.</p><p><strong>Availability and implementation: </strong>PseudoChecker2 and PseudoViz are fully available at https://github.com/rresendepinto/PseudoChecker2and https://github.com/rresendepinto/PseudoViz.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf202"},"PeriodicalIF":2.8,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12679834/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf296
Dmitry N Konanov, Danil V Krivonos, Vladislav V Babenko, Elena N Ilina
Motivation: Nowadays, DNA methylation in bacteria is studied mainly using single-molecule sequencing technologies like PacBio and Oxford Nanopore. In nanopore sequencing, calling of methylated positions is provided by special models implemented directly in basecallers. Prokaryotic DNA methyltransferases are site-specific enzymes, which catalyze methylation in specific methylation motifs. Inference of these motifs is usually performed using third party software like MEME providing classical motif enrichment based only on sequence data. However, currently used motif enrichment algorithms rely only on sequence data, and do not use additional base modification information provided by the basecaller.
Results: Herein, we present a new tool Snappy, which is actually rethinking of the original Snapper algorithm but does not use any enrichment heuristics and does not require control sample sequencing. Snappy combines basecalling data processing with a new graph-based enrichment algorithm, thus significantly enhancing the enrichment sensitivity and accuracy. The versatility of the method was shown on both our and external data, representing different bacterial species with complex and simple methylome.
Availability and implementation: Source code and documentation is hosted on GitHub (https://github.com/DNKonanov/ont-snappy) and Zenodo (zenodo.org/records/16731817). For accessibility, Snappy is installable from PyPi using "pip install ont-snappy" command.
{"title":"Snappy: fast identification of DNA methylation motifs based on oxford nanopore reads.","authors":"Dmitry N Konanov, Danil V Krivonos, Vladislav V Babenko, Elena N Ilina","doi":"10.1093/bioadv/vbaf296","DOIUrl":"10.1093/bioadv/vbaf296","url":null,"abstract":"<p><strong>Motivation: </strong>Nowadays, DNA methylation in bacteria is studied mainly using single-molecule sequencing technologies like PacBio and Oxford Nanopore. In nanopore sequencing, calling of methylated positions is provided by special models implemented directly in basecallers. Prokaryotic DNA methyltransferases are site-specific enzymes, which catalyze methylation in specific methylation motifs. Inference of these motifs is usually performed using third party software like MEME providing classical motif enrichment based only on sequence data. However, currently used motif enrichment algorithms rely only on sequence data, and do not use additional base modification information provided by the basecaller.</p><p><strong>Results: </strong>Herein, we present a new tool Snappy, which is actually rethinking of the original Snapper algorithm but does not use any enrichment heuristics and does not require control sample sequencing. Snappy combines basecalling data processing with a new graph-based enrichment algorithm, thus significantly enhancing the enrichment sensitivity and accuracy. The versatility of the method was shown on both our and external data, representing different bacterial species with complex and simple methylome.</p><p><strong>Availability and implementation: </strong>Source code and documentation is hosted on GitHub (https://github.com/DNKonanov/ont-snappy) and Zenodo (zenodo.org/records/16731817). For accessibility, Snappy is installable from PyPi using \"pip install ont-snappy\" command.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf296"},"PeriodicalIF":2.8,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12679398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf293
Siddharth Sethi, Emil K Gustavsson, Harpreet Saini, Mina Ryten
Motivation: Long-read RNA sequencing has the potential to accurately quantify transcriptomes and reveal the isoform diversity of disease-causing genes. However, despite the recent advances in analysis tools for transcript discovery, long-read RNA sequencing data is still challenging to analyse, due to the detection of hundreds or even thousands of novel transcripts per gene.
Results: Here, we introduce PSQAN, a workflow to help researchers prioritize high-confidence and potentially biologically relevant transcripts associated with candidate genes and make transcript characterization results more interpretable. PSQAN performs a gene-based analysis on characterized transcripts generated by SQANTI3 and TALON. PSQAN re-groups transcripts into easily interpretable categories to facilitate their prioritization, allows transcript-level expression thresholds, and generates visualizations to determine optimal expression thresholds. Overall, we demonstrate that PSQAN is a useful tool which enables users to identify known and novel transcripts of potential biological importance.
Availability and implementation: PSQAN is an analysis workflow implemented in Snakemake and R and is licensed under the GNU General Public License version 3. The source code and documentation of this tool is available at https://github.com/sid-sethi/PSQAN.
{"title":"PSQAN: a pipeline to prioritize novel and biologically relevant transcripts from long-read RNA sequencing.","authors":"Siddharth Sethi, Emil K Gustavsson, Harpreet Saini, Mina Ryten","doi":"10.1093/bioadv/vbaf293","DOIUrl":"10.1093/bioadv/vbaf293","url":null,"abstract":"<p><strong>Motivation: </strong>Long-read RNA sequencing has the potential to accurately quantify transcriptomes and reveal the isoform diversity of disease-causing genes. However, despite the recent advances in analysis tools for transcript discovery, long-read RNA sequencing data is still challenging to analyse, due to the detection of hundreds or even thousands of novel transcripts per gene.</p><p><strong>Results: </strong>Here, we introduce PSQAN, a workflow to help researchers prioritize high-confidence and potentially biologically relevant transcripts associated with candidate genes and make transcript characterization results more interpretable. PSQAN performs a gene-based analysis on characterized transcripts generated by SQANTI3 and TALON. PSQAN re-groups transcripts into easily interpretable categories to facilitate their prioritization, allows transcript-level expression thresholds, and generates visualizations to determine optimal expression thresholds. Overall, we demonstrate that PSQAN is a useful tool which enables users to identify known and novel transcripts of potential biological importance.</p><p><strong>Availability and implementation: </strong>PSQAN is an analysis workflow implemented in Snakemake and R and is licensed under the GNU General Public License version 3. The source code and documentation of this tool is available at https://github.com/sid-sethi/PSQAN.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf293"},"PeriodicalIF":2.8,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf285
Yijun Li, Stefan Stanojevic, Bing He, Zheng Jing, Qianhui Huang, Jian Kang, Lana X Garmire
Motivation: Spatial transcriptomics has allowed researchers to analyze transcriptome data in its tissue sample's spatial context. Various methods have been developed for detecting spatially variable genes (SV genes), whose gene expression over the tissue space shows strong spatial autocorrelation. Such genes are often used to define clusters in cells or spots downstream. However, highly variable (HV) genes, whose quantitative gene expressions show significant variation from cell to cell, are conventionally used in clustering analyses.
Results: In this report, we investigate whether adding highly variable genes to spatially variable genes can improve the cell type clustering performance in spatial transcriptomics data. We tested the clustering performance of HV genes, SV genes, and the union of both gene sets (concatenation) on over 50 real spatial transcriptomics datasets across multiple platforms, using a variety of spatial and non-spatial metrics. Our results show that combining HV genes and SV genes can improve overall cell-type clustering performance.
Availability and implementation: All data and code used in this evaluation study can be found in the following link: https://github.com/lanagarmire/ST_benchmark.
{"title":"Adding highly variable genes to spatially variable genes can improve cell type clustering performance in spatial transcriptomics data.","authors":"Yijun Li, Stefan Stanojevic, Bing He, Zheng Jing, Qianhui Huang, Jian Kang, Lana X Garmire","doi":"10.1093/bioadv/vbaf285","DOIUrl":"https://doi.org/10.1093/bioadv/vbaf285","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics has allowed researchers to analyze transcriptome data in its tissue sample's spatial context. Various methods have been developed for detecting spatially variable genes (SV genes), whose gene expression over the tissue space shows strong spatial autocorrelation. Such genes are often used to define clusters in cells or spots downstream. However, highly variable (HV) genes, whose quantitative gene expressions show significant variation from cell to cell, are conventionally used in clustering analyses.</p><p><strong>Results: </strong>In this report, we investigate whether adding highly variable genes to spatially variable genes can improve the cell type clustering performance in spatial transcriptomics data. We tested the clustering performance of HV genes, SV genes, and the union of both gene sets (concatenation) on over 50 real spatial transcriptomics datasets across multiple platforms, using a variety of spatial and non-spatial metrics. Our results show that combining HV genes and SV genes can improve overall cell-type clustering performance.</p><p><strong>Availability and implementation: </strong>All data and code used in this evaluation study can be found in the following link: https://github.com/lanagarmire/ST_benchmark.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf285"},"PeriodicalIF":2.8,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12809558/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Cancer immunotherapy uses the immune system to recognize and eliminate tumor cells by presenting tumor antigens through Human Leukocyte Antigen (HLA) molecules. Accurate prediction of HLA-peptide interactions is essential for personalized immunotherapy development. Allele-specific models achieve high accuracy and handle variable peptide lengths but require separate training for each allele, limiting scalability to rare or unseen HLAs. Pan-specific models generalize across multiple alleles and match or surpass allele-specific methods. Ensemble methods improve prediction by combining outputs from multiple predictors, often via linear combinations, though nonlinear strategies may better capture HLA-peptide complexities.We propose SiaScoreNet, a three-step predictive pipeline enhancing HLA-peptide interaction prediction. First, ESM, a pretrained transformer-based protein language model, embeds HLA and peptide sequences into fixed-length representations, accommodating varying sequence lengths. Second, we integrate predicted scores from state-of-the-art models into a comprehensive feature vector. Third, a nonlinear ensemble strategy combines features, capturing complex dependencies and boosting performance.
Results: Benchmark evaluations show SiaScoreNet outperforms existing models in accuracy, comparable to TransPHLA, BigMHC, and CapHLA. Recent models prioritize recall over precision, valuable for identifying potential binders but resource-intensive. SiaScoreNet offers improved performance and runtime efficiency compared to these models, evaluated against HPV viruses for HLA-peptide prediction.
Availability and implementation: The data and source code for prediction and experiments presented in this study is publicly available in the SiaScoreNet repository hosted on GitHub: https://github.com/CBRC-lab/SiaScoreNet.
{"title":"SiaScoreNet: a siamese neural network-based model integrating prediction scores for HLA-peptide interaction prediction.","authors":"Mahsa Saadat, Fatemeh Zare-Mirakabad, Milad Besharatifard","doi":"10.1093/bioadv/vbaf248","DOIUrl":"10.1093/bioadv/vbaf248","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer immunotherapy uses the immune system to recognize and eliminate tumor cells by presenting tumor antigens through Human Leukocyte Antigen (HLA) molecules. Accurate prediction of HLA-peptide interactions is essential for personalized immunotherapy development. Allele-specific models achieve high accuracy and handle variable peptide lengths but require separate training for each allele, limiting scalability to rare or unseen HLAs. Pan-specific models generalize across multiple alleles and match or surpass allele-specific methods. Ensemble methods improve prediction by combining outputs from multiple predictors, often via linear combinations, though nonlinear strategies may better capture HLA-peptide complexities.We propose <i>SiaScoreNet</i>, a three-step predictive pipeline enhancing HLA-peptide interaction prediction. First, ESM, a pretrained transformer-based protein language model, embeds HLA and peptide sequences into fixed-length representations, accommodating varying sequence lengths. Second, we integrate predicted scores from state-of-the-art models into a comprehensive feature vector. Third, a nonlinear ensemble strategy combines features, capturing complex dependencies and boosting performance.</p><p><strong>Results: </strong>Benchmark evaluations show <i>SiaScoreNet</i> outperforms existing models in accuracy, comparable to TransPHLA, BigMHC, and CapHLA. Recent models prioritize recall over precision, valuable for identifying potential binders but resource-intensive. <i>SiaScoreNet</i> offers improved performance and runtime efficiency compared to these models, evaluated against HPV viruses for HLA-peptide prediction.</p><p><strong>Availability and implementation: </strong>The data and source code for prediction and experiments presented in this study is publicly available in the <i>SiaScoreNet</i> repository hosted on GitHub: https://github.com/CBRC-lab/SiaScoreNet.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf248"},"PeriodicalIF":2.8,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12641608/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf294
Marvin van Aalst, Tim Nies, Tobias Pfennig, Anna Matuszyńska
Summary: Recent advances in artificial intelligence have accelerated the adoption of machine learning (ML) in biology, enabling powerful predictive models across diverse applications. However, in scientific research, the need for interpretability and mechanistic insight remains crucial. To address this, we introduce MxlPy, a Python package that combines mechanistic modelling with ML to deliver explainable, data-informed solutions. MxlPy facilitates mechanistic learning, an emerging approach that integrates the transparency of mathematical models with the flexibility of data-driven methods. By streamlining tasks such as data integration, model formulation, output analysis, and surrogate modelling, MxlPy enhances the modelling experience without sacrificing interpretability. Designed for both computational biologists and interdisciplinary researchers, it supports the development of accurate, efficient, and explainable models, making it a valuable tool for advancing bioinformatics, systems biology, and biomedical research.
Availability and implementation: MxlPy source code is freely available at https://github.com/Computational-Biology-Aachen/MxlPy. The full documentation with features and examples can be found here https://computational-biology-aachen.github.io/MxlPy.
{"title":"MxlPy-Python package for mechanistic learning and hybrid modelling in life science.","authors":"Marvin van Aalst, Tim Nies, Tobias Pfennig, Anna Matuszyńska","doi":"10.1093/bioadv/vbaf294","DOIUrl":"10.1093/bioadv/vbaf294","url":null,"abstract":"<p><strong>Summary: </strong>Recent advances in artificial intelligence have accelerated the adoption of machine learning (ML) in biology, enabling powerful predictive models across diverse applications. However, in scientific research, the need for interpretability and mechanistic insight remains crucial. To address this, we introduce MxlPy, a Python package that combines mechanistic modelling with ML to deliver explainable, data-informed solutions. MxlPy facilitates mechanistic learning, an emerging approach that integrates the transparency of mathematical models with the flexibility of data-driven methods. By streamlining tasks such as data integration, model formulation, output analysis, and surrogate modelling, MxlPy enhances the modelling experience without sacrificing interpretability. Designed for both computational biologists and interdisciplinary researchers, it supports the development of accurate, efficient, and explainable models, making it a valuable tool for advancing bioinformatics, systems biology, and biomedical research.</p><p><strong>Availability and implementation: </strong>MxlPy source code is freely available at https://github.com/Computational-Biology-Aachen/MxlPy. The full documentation with features and examples can be found here https://computational-biology-aachen.github.io/MxlPy.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf294"},"PeriodicalIF":2.8,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12668773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145662876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}