Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag049
Chaojie Wang, Xin Yu
Motivation: Capturing spatial structure is fundamental to the analysis of spatial transcriptomics data. However, most existing methods focus on clustering within individual tissue slices and often ignore the high inter-slice similarity inherent in multi-slice datasets.
Results: To address this limitation, we propose STransfer, a novel transfer learning framework that combines graph convolutional networks (GCNs) with positive pointwise mutual information (PPMI) to model both local and global spatial dependencies. An attention-based module is introduced to fuse features from multiple graphs into unified node representations, facilitating the learning of low-dimensional embeddings that jointly encode gene expression and spatial context. By transferring knowledge from labeled slices to adjacent unlabeled ones, STransfer significantly enhances clustering accuracy while reducing manual annotation costs. Extensive experiments demonstrate that STransfer consistently outperforms state-of-the-art methods in both spatial modeling and cross-slice transfer performance.
Availability and implementation: The code for STransfer has been uploaded to GitHub: https://github.com/Saki-JSU/Publications/tree/main/STransfer.
{"title":"STransfer: a transfer learning-enhanced graph convolutional network for clustering spatial transcriptomics data.","authors":"Chaojie Wang, Xin Yu","doi":"10.1093/bioinformatics/btag049","DOIUrl":"10.1093/bioinformatics/btag049","url":null,"abstract":"<p><strong>Motivation: </strong>Capturing spatial structure is fundamental to the analysis of spatial transcriptomics data. However, most existing methods focus on clustering within individual tissue slices and often ignore the high inter-slice similarity inherent in multi-slice datasets.</p><p><strong>Results: </strong>To address this limitation, we propose STransfer, a novel transfer learning framework that combines graph convolutional networks (GCNs) with positive pointwise mutual information (PPMI) to model both local and global spatial dependencies. An attention-based module is introduced to fuse features from multiple graphs into unified node representations, facilitating the learning of low-dimensional embeddings that jointly encode gene expression and spatial context. By transferring knowledge from labeled slices to adjacent unlabeled ones, STransfer significantly enhances clustering accuracy while reducing manual annotation costs. Extensive experiments demonstrate that STransfer consistently outperforms state-of-the-art methods in both spatial modeling and cross-slice transfer performance.</p><p><strong>Availability and implementation: </strong>The code for STransfer has been uploaded to GitHub: https://github.com/Saki-JSU/Publications/tree/main/STransfer.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12900540/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag008
Adriana Carolina Gonzalez-Cavazos, Roger Tu, Meghamala Sinha, Andrew I Su
Drug repositioning offers a cost-effective alternative to traditional drug development by identifying new uses for existing drugs. Recent advances leverage Graph Neural Networks (GNNs) to model complex biological data, showing promise in predicting novel drug-disease associations; however, these frameworks often lack explainability, a critical factor for validating predictions and understanding drug mechanisms. Here, we introduce Drug-Based Reasoning Explainer (DBR-X), an explainable GNN model that integrates a link-prediction module with a path-identification module to generate interpretable and faithful explanations. When benchmarked against other GNN-based link-prediction frameworks, DBR-X achieves superior performance in identifying known drug-disease associations, demonstrating higher accuracy across all evaluation metrics. The quality of DBR-X biological explanations was evaluated through multiple complementary approaches, including comparison with manually curated drug mechanisms, assessment of explanation faithfulness using deletion and insertion studies, and measurement of stability under graph perturbations. Together, these results show that DBR-X advances the state of the art in drug repositioning while providing multi-hop mechanistic explanations that can facilitate the translation of computational predictions into clinical applications. Availability and implementation: DBR-X package is freely accessible from online repository https://github.com/SuLab/DBR-X.
{"title":"A case-based explainable graph neural network framework for mechanistic drug repositioning.","authors":"Adriana Carolina Gonzalez-Cavazos, Roger Tu, Meghamala Sinha, Andrew I Su","doi":"10.1093/bioinformatics/btag008","DOIUrl":"10.1093/bioinformatics/btag008","url":null,"abstract":"<p><p>Drug repositioning offers a cost-effective alternative to traditional drug development by identifying new uses for existing drugs. Recent advances leverage Graph Neural Networks (GNNs) to model complex biological data, showing promise in predicting novel drug-disease associations; however, these frameworks often lack explainability, a critical factor for validating predictions and understanding drug mechanisms. Here, we introduce Drug-Based Reasoning Explainer (DBR-X), an explainable GNN model that integrates a link-prediction module with a path-identification module to generate interpretable and faithful explanations. When benchmarked against other GNN-based link-prediction frameworks, DBR-X achieves superior performance in identifying known drug-disease associations, demonstrating higher accuracy across all evaluation metrics. The quality of DBR-X biological explanations was evaluated through multiple complementary approaches, including comparison with manually curated drug mechanisms, assessment of explanation faithfulness using deletion and insertion studies, and measurement of stability under graph perturbations. Together, these results show that DBR-X advances the state of the art in drug repositioning while providing multi-hop mechanistic explanations that can facilitate the translation of computational predictions into clinical applications. Availability and implementation: DBR-X package is freely accessible from online repository https://github.com/SuLab/DBR-X.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12891909/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag041
Anja Hess, Dominik Seelow, Helene Kretzmer
Summary: DNAvi is a Python-based tool for rapid grouped analysis and visualization of cell-free DNA fragment size profiles directly from electrophoresis data, overcoming the need for sequencing in basic fragmentomic screenings. It enables normalization, statistical comparison, and publication-ready plotting of multiple samples, supporting quality control and exploratory fragmentomics in clinical and research workflows.
Availability and implementation: DNAvi is implemented in Python and freely available on GitHub at https://github.com/anjahess/DNAvi under a GNU General Public License v3.0, along with source code, documentation, and examples. An archived version is available under https://doi.org/10.5281/zenodo.18401705.
{"title":"DNAvi: integration, statistics, and visualization of cell-free DNA fragment traces.","authors":"Anja Hess, Dominik Seelow, Helene Kretzmer","doi":"10.1093/bioinformatics/btag041","DOIUrl":"10.1093/bioinformatics/btag041","url":null,"abstract":"<p><strong>Summary: </strong>DNAvi is a Python-based tool for rapid grouped analysis and visualization of cell-free DNA fragment size profiles directly from electrophoresis data, overcoming the need for sequencing in basic fragmentomic screenings. It enables normalization, statistical comparison, and publication-ready plotting of multiple samples, supporting quality control and exploratory fragmentomics in clinical and research workflows.</p><p><strong>Availability and implementation: </strong>DNAvi is implemented in Python and freely available on GitHub at https://github.com/anjahess/DNAvi under a GNU General Public License v3.0, along with source code, documentation, and examples. An archived version is available under https://doi.org/10.5281/zenodo.18401705.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12904835/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: The American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) guidelines represent the gold standard for clinical variant interpretation. Despite the widespread adoption of ACMG/AMP guidelines, a comprehensive comparison of the software tools designed to implement them has been lacking. This represents a significant gap, as clinicians require evidence-based guidance on which tools to use in their practice.
Results: We benchmarked four ACMG/AMP-based tools (Franklin, InterVar, TAPES, Genebe) selected from 22 tools, and compared their performance with LIRICAL, a top-performing phenotype-driven tool, using 151 expert-curated datasets from Mendelian disorders. Selection criteria included free availability, VCF compatibility, operational reliability, and not being disease-specific. Our evaluation framework assessed top-N accuracy (N = 1, 5, 10, 20, 50), retention rates, precision, recall, F1 scores, and area under the curve (AUC). Statistical validation employed bootstrap confidence intervals (n = 1000) and Friedman tests. LIRICAL (68.21%) and Franklin (61.59%) demonstrated superior top-10 variant prioritization accuracy in Mendelian disorders, significantly outperforming other tools (P = .0000). Results demonstrate that tools with advanced phenotypic integration significantly outperform those relying primarily on genomic features.
Availability and implementation: All data and source code required to reproduce the findings of this study are openly available in the Code Ocean repository at https://doi.org/10.24433/CO.6562438.v1.
{"title":"Comprehensive evaluation of ACMG/AMP-based variant classification tools.","authors":"Tohid Ghasemnejad, Yuheng Liang, Khadijeh Hoda Jahanian, Milad Eidi, Arash Salmaninejad, Seyedeh Sedigheh Abedini, Fabrizzio Horta, Nigel H Lovell, Thantrira Porntaveetus, Mark Grosser, Mahmoud Aarabi, Hamid Alinejad-Rokny","doi":"10.1093/bioinformatics/btaf623","DOIUrl":"10.1093/bioinformatics/btaf623","url":null,"abstract":"<p><strong>Motivation: </strong>The American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) guidelines represent the gold standard for clinical variant interpretation. Despite the widespread adoption of ACMG/AMP guidelines, a comprehensive comparison of the software tools designed to implement them has been lacking. This represents a significant gap, as clinicians require evidence-based guidance on which tools to use in their practice.</p><p><strong>Results: </strong>We benchmarked four ACMG/AMP-based tools (Franklin, InterVar, TAPES, Genebe) selected from 22 tools, and compared their performance with LIRICAL, a top-performing phenotype-driven tool, using 151 expert-curated datasets from Mendelian disorders. Selection criteria included free availability, VCF compatibility, operational reliability, and not being disease-specific. Our evaluation framework assessed top-N accuracy (N = 1, 5, 10, 20, 50), retention rates, precision, recall, F1 scores, and area under the curve (AUC). Statistical validation employed bootstrap confidence intervals (n = 1000) and Friedman tests. LIRICAL (68.21%) and Franklin (61.59%) demonstrated superior top-10 variant prioritization accuracy in Mendelian disorders, significantly outperforming other tools (P = .0000). Results demonstrate that tools with advanced phenotypic integration significantly outperform those relying primarily on genomic features.</p><p><strong>Availability and implementation: </strong>All data and source code required to reproduce the findings of this study are openly available in the Code Ocean repository at https://doi.org/10.24433/CO.6562438.v1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146196100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag044
Christoph Stelz, Lukas Hübner, Alexandros Stamatakis
<p><strong>Motivation: </strong>Phylogenetic trees describe the evolutionary history among biological species based on their genomic data. Maximum likelihood (ML) based phylogenetic inference tools search for the tree and evolutionary model that best explain the observed genomic data. Given the independence of likelihood score calculations between different genomic sites, parallel computation is commonly deployed. This is followed by a parallel summation over the per-site scores to obtain the overall likelihood score of the tree. However, basic arithmetic operations on IEEE 754 floating-point numbers, such as addition and multiplication, inherently introduce rounding errors. Consequently, the order by which floating-point operations are executed affects the exact resulting likelihood value since these operations are not associative. Moreover, parallel reduction algorithms in numerical codes re-associate operations as a function of the core count and cluster network topology, inducing different round-off errors. These low-level deviations can cause heuristic searches to diverge and induce high-level result discrepancies (e.g. yield topologically distinct phylogenies). This effect has also been observed in multiple scientific fields beyond phylogenetics.</p><p><strong>Results: </strong>We observe that varying the degree of parallelism results in diverging phylogenetic tree searches (high-level results) for over 31% out of 10 179 empirical datasets. More importantly, 8% of these diverging datasets yield trees that are statistically significantly worse than the best-known ML tree for the dataset (AU-test, P < .05). To alleviate this, we develop a variant of the widely used phylogenetic inference tool RAxML-NG, which does yield bit-reproducible results under varying core-counts, with a slowdown of only 0%-12.7% (median 0.8%) on up to 768 cores. For this, we introduce the ReproRed reduction algorithm, which yields bit-identical results under varying core-counts, by maintaining a fixed operation order that is independent of the communication pattern. ReproRed is thus applicable to all associative reduction operations-in contrast to competitors, which are confined to summation. Our ReproRed reduction algorithm only exchanges the theoretical minimum number of messages, overlaps communication with computation, and utilizes fast base-cases for local reductions. ReproRed is able to all-reduce (via a subsequent broadcast) 4.1×106 operands across 48-768 cores in 19.7-48.61 μs, thereby exhibiting a slowdown of 13%-93% over a non-reproducible all-reduce algorithm. ReproRed outperforms the state-of-the-art reproducible all-reduction algorithm ReproBLAS (offers summation only) beyond 10 000 elements per core. In summary, we re-assess non-reproducibility in parallel phylogenetic inference, present the first bit-reproducible parallel phylogenetic inference tool, as well as introduce a general algorithm and open-source code for conducting reproducible associative par
{"title":"Bit-reproducible parallel phylogenetic tree inference.","authors":"Christoph Stelz, Lukas Hübner, Alexandros Stamatakis","doi":"10.1093/bioinformatics/btag044","DOIUrl":"10.1093/bioinformatics/btag044","url":null,"abstract":"<p><strong>Motivation: </strong>Phylogenetic trees describe the evolutionary history among biological species based on their genomic data. Maximum likelihood (ML) based phylogenetic inference tools search for the tree and evolutionary model that best explain the observed genomic data. Given the independence of likelihood score calculations between different genomic sites, parallel computation is commonly deployed. This is followed by a parallel summation over the per-site scores to obtain the overall likelihood score of the tree. However, basic arithmetic operations on IEEE 754 floating-point numbers, such as addition and multiplication, inherently introduce rounding errors. Consequently, the order by which floating-point operations are executed affects the exact resulting likelihood value since these operations are not associative. Moreover, parallel reduction algorithms in numerical codes re-associate operations as a function of the core count and cluster network topology, inducing different round-off errors. These low-level deviations can cause heuristic searches to diverge and induce high-level result discrepancies (e.g. yield topologically distinct phylogenies). This effect has also been observed in multiple scientific fields beyond phylogenetics.</p><p><strong>Results: </strong>We observe that varying the degree of parallelism results in diverging phylogenetic tree searches (high-level results) for over 31% out of 10 179 empirical datasets. More importantly, 8% of these diverging datasets yield trees that are statistically significantly worse than the best-known ML tree for the dataset (AU-test, P < .05). To alleviate this, we develop a variant of the widely used phylogenetic inference tool RAxML-NG, which does yield bit-reproducible results under varying core-counts, with a slowdown of only 0%-12.7% (median 0.8%) on up to 768 cores. For this, we introduce the ReproRed reduction algorithm, which yields bit-identical results under varying core-counts, by maintaining a fixed operation order that is independent of the communication pattern. ReproRed is thus applicable to all associative reduction operations-in contrast to competitors, which are confined to summation. Our ReproRed reduction algorithm only exchanges the theoretical minimum number of messages, overlaps communication with computation, and utilizes fast base-cases for local reductions. ReproRed is able to all-reduce (via a subsequent broadcast) 4.1×106 operands across 48-768 cores in 19.7-48.61 μs, thereby exhibiting a slowdown of 13%-93% over a non-reproducible all-reduce algorithm. ReproRed outperforms the state-of-the-art reproducible all-reduction algorithm ReproBLAS (offers summation only) beyond 10 000 elements per core. In summary, we re-assess non-reproducibility in parallel phylogenetic inference, present the first bit-reproducible parallel phylogenetic inference tool, as well as introduce a general algorithm and open-source code for conducting reproducible associative par","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12904832/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag046
Tien-Thanh Bui, Rui Xie, Wei Zhang
Motivation: The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) framework that simultaneously integrates intra- and cross-omics feature interactions into a shared low-dimensional representation (see Fig. 1). By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.
Results: For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.
Availability and implementation: The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).
{"title":"X-intNMF: a cross- and intra-omics regularized NMF framework for multi-omics integration.","authors":"Tien-Thanh Bui, Rui Xie, Wei Zhang","doi":"10.1093/bioinformatics/btag046","DOIUrl":"10.1093/bioinformatics/btag046","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) framework that simultaneously integrates intra- and cross-omics feature interactions into a shared low-dimensional representation (see Fig. 1). By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.</p><p><strong>Results: </strong>For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.</p><p><strong>Availability and implementation: </strong>The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12910379/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag043
Rosan C M Kuin, Alexander T Julian, Jagriti Chander, Sunah Lee, Gerard J P van Westen
Motivation: Protein sequence variation analysis is a topic of broad interest in drug discovery and protein engineering to support modulation of protein function for diverse biotechnological and therapeutic applications. To assist in the analysis of multiple sequence alignments (MSAs) and identify residues that account for protein function specificity, computational tools have been developed. Yet, existing programs often omit consideration of amino acid properties, flexibility beyond fixed webserver interfaces, accessible source code, or compatibility with small MSAs.
Results: To address these limitations, we present PyTEA-O, a Python implementation of Two-Entropies Analysis that has been developed to be easy to use for the analysis of protein sequence variation. To help users analyze the MSA and screen for residues of interest, we generate modifiable and intuitive visualizations. These visualizations, together with a scoring approach for identifying alignment positions with (dis-)similar physicochemical properties, presents a powerful tool for sequence variability analysis. To demonstrate its capabilities, we present a case study based on the deubiquitinase OTUD7B (Cezanne) where we identify a crucial position that modulates its affinity for its substrate.
Availability and implementation: PyTEA-O is available at https://github.com/CDDLeiden/PyTEA-O/ and archived via Zenodo (https://doi.org/10.5281/zenodo.15914598).
{"title":"PyTEA-O: a Python implementation of Two-Entropies Analysis for protein sequence variation analysis.","authors":"Rosan C M Kuin, Alexander T Julian, Jagriti Chander, Sunah Lee, Gerard J P van Westen","doi":"10.1093/bioinformatics/btag043","DOIUrl":"10.1093/bioinformatics/btag043","url":null,"abstract":"<p><strong>Motivation: </strong>Protein sequence variation analysis is a topic of broad interest in drug discovery and protein engineering to support modulation of protein function for diverse biotechnological and therapeutic applications. To assist in the analysis of multiple sequence alignments (MSAs) and identify residues that account for protein function specificity, computational tools have been developed. Yet, existing programs often omit consideration of amino acid properties, flexibility beyond fixed webserver interfaces, accessible source code, or compatibility with small MSAs.</p><p><strong>Results: </strong>To address these limitations, we present PyTEA-O, a Python implementation of Two-Entropies Analysis that has been developed to be easy to use for the analysis of protein sequence variation. To help users analyze the MSA and screen for residues of interest, we generate modifiable and intuitive visualizations. These visualizations, together with a scoring approach for identifying alignment positions with (dis-)similar physicochemical properties, presents a powerful tool for sequence variability analysis. To demonstrate its capabilities, we present a case study based on the deubiquitinase OTUD7B (Cezanne) where we identify a crucial position that modulates its affinity for its substrate.</p><p><strong>Availability and implementation: </strong>PyTEA-O is available at https://github.com/CDDLeiden/PyTEA-O/ and archived via Zenodo (https://doi.org/10.5281/zenodo.15914598).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12910380/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag048
Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman
Motivation: Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.
Results: In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.
Availability and implementation: The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).
{"title":"How negative sampling shapes the performance of transcription factor binding site prediction models.","authors":"Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman","doi":"10.1093/bioinformatics/btag048","DOIUrl":"10.1093/bioinformatics/btag048","url":null,"abstract":"<p><strong>Motivation: </strong>Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.</p><p><strong>Results: </strong>In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.</p><p><strong>Availability and implementation: </strong>The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12910371/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag051
Yingtong Liu, Aaron G Baugh, Evanthia T Roussos Torres, Adam L MacLean
Motivation: Differential expression and marker gene selection methods for single-cell RNA sequencing (scRNA-seq) data can struggle to identify small sets of informative genes, especially for subtle differences between cell states, as can be induced by disease or treatment.
Results: We present iterative logistic regression (iLR) for the identification of small sets of informative marker genes. iLR applied logistic regression iteratively with a Pareto front optimization to balance gene set size with classification performance. We benchmark iLR on in silico datasets, demonstrating comparable performance to the state-of-the-art at single-cell classification using only a fraction of the genes. We test iLR on its ability to distinguish neuronal cell subtypes in healthy vs. autism spectrum disorder patients and find that it achieves high accuracy with small sets of disease-relevant genes. We apply iLR to investigate immunotherapeutic effects in cell types from different tumor microenvironments and find that iLR infers informative genes that translate across organs and even species (mouse-to-human) comparison. We predicted via iLR that entinostat acts in part through the modulation of myeloid cell differentiation routes in the lung microenvironment. Overall, iLR provides means to infer interpretable transcriptional signatures from complex datasets with prognostic or therapeutic potential.
Availability and implementation: iLR is freely available at GitHub https://github.com/maclean-lab/iLR and Zenodo https://zenodo.org/records/17728797.
Supplementary information: Supplementary data are available at Bioinformaticss online.
{"title":"Inference of marker genes of subtle cell state changes via iLR: iterative logistic regression.","authors":"Yingtong Liu, Aaron G Baugh, Evanthia T Roussos Torres, Adam L MacLean","doi":"10.1093/bioinformatics/btag051","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag051","url":null,"abstract":"<p><strong>Motivation: </strong>Differential expression and marker gene selection methods for single-cell RNA sequencing (scRNA-seq) data can struggle to identify small sets of informative genes, especially for subtle differences between cell states, as can be induced by disease or treatment.</p><p><strong>Results: </strong>We present iterative logistic regression (iLR) for the identification of small sets of informative marker genes. iLR applied logistic regression iteratively with a Pareto front optimization to balance gene set size with classification performance. We benchmark iLR on in silico datasets, demonstrating comparable performance to the state-of-the-art at single-cell classification using only a fraction of the genes. We test iLR on its ability to distinguish neuronal cell subtypes in healthy vs. autism spectrum disorder patients and find that it achieves high accuracy with small sets of disease-relevant genes. We apply iLR to investigate immunotherapeutic effects in cell types from different tumor microenvironments and find that iLR infers informative genes that translate across organs and even species (mouse-to-human) comparison. We predicted via iLR that entinostat acts in part through the modulation of myeloid cell differentiation routes in the lung microenvironment. Overall, iLR provides means to infer interpretable transcriptional signatures from complex datasets with prognostic or therapeutic potential.</p><p><strong>Availability and implementation: </strong>iLR is freely available at GitHub https://github.com/maclean-lab/iLR and Zenodo https://zenodo.org/records/17728797.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformaticss online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag054
Georgios A Manios, Sophia Nteli, Panagiota I Kontou, Pantelis G Bagos
Motivation: Genome-wide association study (GWAS) meta-analysis tools are essential for integrating summary statistics across multiple cohorts, thereby increasing statistical power and validating genetic associations. Widely cited tools, such as METAL, PLINK, and GWAMA, have facilitated numerous significant discoveries in the field of GWAS. Nevertheless, these tools offer a limited set of meta-analysis methods and typically require users to have prior experience with command-line tools to be executed.
Results: We present here PYRAMA, an open-source tool which is designed for meta-analysis of genome wide association studies. This work introduces an easy-to-use software package that includes several meta-analysis methods that are absent in similar software packages. PYRAMA is faster compared to other tools, supports robust methods for analysis and meta-analysis, fixed-effects, random-effects and Bayesian meta-analysis and it is currently the only tool that supports meta-analysis with imputation of summary statistics. It is available both as a standalone tool and as a freely available web server.
{"title":"PYRAMA: An open-source tool for advanced meta-analysis of genome wide association studies.","authors":"Georgios A Manios, Sophia Nteli, Panagiota I Kontou, Pantelis G Bagos","doi":"10.1093/bioinformatics/btag054","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag054","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association study (GWAS) meta-analysis tools are essential for integrating summary statistics across multiple cohorts, thereby increasing statistical power and validating genetic associations. Widely cited tools, such as METAL, PLINK, and GWAMA, have facilitated numerous significant discoveries in the field of GWAS. Nevertheless, these tools offer a limited set of meta-analysis methods and typically require users to have prior experience with command-line tools to be executed.</p><p><strong>Results: </strong>We present here PYRAMA, an open-source tool which is designed for meta-analysis of genome wide association studies. This work introduces an easy-to-use software package that includes several meta-analysis methods that are absent in similar software packages. PYRAMA is faster compared to other tools, supports robust methods for analysis and meta-analysis, fixed-effects, random-effects and Bayesian meta-analysis and it is currently the only tool that supports meta-analysis with imputation of summary statistics. It is available both as a standalone tool and as a freely available web server.</p><p><strong>Availability: </strong>https://github.com/pbagos/PYRAMA, https://doi.org/10.5281/zenodo.17830449.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}