Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag008
Adriana Carolina Gonzalez-Cavazos, Roger Tu, Meghamala Sinha, Andrew I Su
Drug repositioning offers a cost-effective alternative to traditional drug development by identifying new uses for existing drugs. Recent advances leverage Graph Neural Networks (GNNs) to model complex biological data, showing promise in predicting novel drug-disease associations; however, these frameworks often lack explainability, a critical factor for validating predictions and understanding drug mechanisms. Here, we introduce Drug-Based Reasoning Explainer (DBR-X), an explainable GNN model that integrates a link-prediction module with a path-identification module to generate interpretable and faithful explanations. When benchmarked against other GNN-based link-prediction frameworks, DBR-X achieves superior performance in identifying known drug-disease associations, demonstrating higher accuracy across all evaluation metrics. The quality of DBR-X biological explanations was evaluated through multiple complementary approaches, including comparison with manually curated drug mechanisms, assessment of explanation faithfulness using deletion and insertion studies, and measurement of stability under graph perturbations. Together, these results show that DBR-X advances the state of the art in drug repositioning while providing multi-hop mechanistic explanations that can facilitate the translation of computational predictions into clinical applications. Availability and implementation: DBR-X package is freely accessible from online repository https://github.com/SuLab/DBR-X.
{"title":"A case-based explainable graph neural network framework for mechanistic drug repositioning.","authors":"Adriana Carolina Gonzalez-Cavazos, Roger Tu, Meghamala Sinha, Andrew I Su","doi":"10.1093/bioinformatics/btag008","DOIUrl":"10.1093/bioinformatics/btag008","url":null,"abstract":"<p><p>Drug repositioning offers a cost-effective alternative to traditional drug development by identifying new uses for existing drugs. Recent advances leverage Graph Neural Networks (GNNs) to model complex biological data, showing promise in predicting novel drug-disease associations; however, these frameworks often lack explainability, a critical factor for validating predictions and understanding drug mechanisms. Here, we introduce Drug-Based Reasoning Explainer (DBR-X), an explainable GNN model that integrates a link-prediction module with a path-identification module to generate interpretable and faithful explanations. When benchmarked against other GNN-based link-prediction frameworks, DBR-X achieves superior performance in identifying known drug-disease associations, demonstrating higher accuracy across all evaluation metrics. The quality of DBR-X biological explanations was evaluated through multiple complementary approaches, including comparison with manually curated drug mechanisms, assessment of explanation faithfulness using deletion and insertion studies, and measurement of stability under graph perturbations. Together, these results show that DBR-X advances the state of the art in drug repositioning while providing multi-hop mechanistic explanations that can facilitate the translation of computational predictions into clinical applications. Availability and implementation: DBR-X package is freely accessible from online repository https://github.com/SuLab/DBR-X.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag051
Yingtong Liu, Aaron G Baugh, Evanthia T Roussos Torres, Adam L MacLean
Motivation: Differential expression and marker gene selection methods for single-cell RNA sequencing (scRNA-seq) data can struggle to identify small sets of informative genes, especially for subtle differences between cell states, as can be induced by disease or treatment.
Results: We present iterative logistic regression (iLR) for the identification of small sets of informative marker genes. iLR applied logistic regression iteratively with a Pareto front optimization to balance gene set size with classification performance. We benchmark iLR on in silico datasets, demonstrating comparable performance to the state-of-the-art at single-cell classification using only a fraction of the genes. We test iLR on its ability to distinguish neuronal cell subtypes in healthy vs. autism spectrum disorder patients and find that it achieves high accuracy with small sets of disease-relevant genes. We apply iLR to investigate immunotherapeutic effects in cell types from different tumor microenvironments and find that iLR infers informative genes that translate across organs and even species (mouse-to-human) comparison. We predicted via iLR that entinostat acts in part through the modulation of myeloid cell differentiation routes in the lung microenvironment. Overall, iLR provides means to infer interpretable transcriptional signatures from complex datasets with prognostic or therapeutic potential.
Availability and implementation: iLR is freely available at GitHub https://github.com/maclean-lab/iLR and Zenodo https://zenodo.org/records/17728797.
Supplementary information: Supplementary data are available at Bioinformaticss online.
{"title":"Inference of marker genes of subtle cell state changes via iLR: iterative logistic regression.","authors":"Yingtong Liu, Aaron G Baugh, Evanthia T Roussos Torres, Adam L MacLean","doi":"10.1093/bioinformatics/btag051","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag051","url":null,"abstract":"<p><strong>Motivation: </strong>Differential expression and marker gene selection methods for single-cell RNA sequencing (scRNA-seq) data can struggle to identify small sets of informative genes, especially for subtle differences between cell states, as can be induced by disease or treatment.</p><p><strong>Results: </strong>We present iterative logistic regression (iLR) for the identification of small sets of informative marker genes. iLR applied logistic regression iteratively with a Pareto front optimization to balance gene set size with classification performance. We benchmark iLR on in silico datasets, demonstrating comparable performance to the state-of-the-art at single-cell classification using only a fraction of the genes. We test iLR on its ability to distinguish neuronal cell subtypes in healthy vs. autism spectrum disorder patients and find that it achieves high accuracy with small sets of disease-relevant genes. We apply iLR to investigate immunotherapeutic effects in cell types from different tumor microenvironments and find that iLR infers informative genes that translate across organs and even species (mouse-to-human) comparison. We predicted via iLR that entinostat acts in part through the modulation of myeloid cell differentiation routes in the lung microenvironment. Overall, iLR provides means to infer interpretable transcriptional signatures from complex datasets with prognostic or therapeutic potential.</p><p><strong>Availability and implementation: </strong>iLR is freely available at GitHub https://github.com/maclean-lab/iLR and Zenodo https://zenodo.org/records/17728797.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformaticss online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag054
Georgios A Manios, Sophia Nteli, Panagiota I Kontou, Pantelis G Bagos
Motivation: Genome-wide association study (GWAS) meta-analysis tools are essential for integrating summary statistics across multiple cohorts, thereby increasing statistical power and validating genetic associations. Widely cited tools, such as METAL, PLINK, and GWAMA, have facilitated numerous significant discoveries in the field of GWAS. Nevertheless, these tools offer a limited set of meta-analysis methods and typically require users to have prior experience with command-line tools to be executed.
Results: We present here PYRAMA, an open-source tool which is designed for meta-analysis of genome wide association studies. This work introduces an easy-to-use software package that includes several meta-analysis methods that are absent in similar software packages. PYRAMA is faster compared to other tools, supports robust methods for analysis and meta-analysis, fixed-effects, random-effects and Bayesian meta-analysis and it is currently the only tool that supports meta-analysis with imputation of summary statistics. It is available both as a standalone tool and as a freely available web server.
{"title":"PYRAMA: An open-source tool for advanced meta-analysis of genome wide association studies.","authors":"Georgios A Manios, Sophia Nteli, Panagiota I Kontou, Pantelis G Bagos","doi":"10.1093/bioinformatics/btag054","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag054","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association study (GWAS) meta-analysis tools are essential for integrating summary statistics across multiple cohorts, thereby increasing statistical power and validating genetic associations. Widely cited tools, such as METAL, PLINK, and GWAMA, have facilitated numerous significant discoveries in the field of GWAS. Nevertheless, these tools offer a limited set of meta-analysis methods and typically require users to have prior experience with command-line tools to be executed.</p><p><strong>Results: </strong>We present here PYRAMA, an open-source tool which is designed for meta-analysis of genome wide association studies. This work introduces an easy-to-use software package that includes several meta-analysis methods that are absent in similar software packages. PYRAMA is faster compared to other tools, supports robust methods for analysis and meta-analysis, fixed-effects, random-effects and Bayesian meta-analysis and it is currently the only tool that supports meta-analysis with imputation of summary statistics. It is available both as a standalone tool and as a freely available web server.</p><p><strong>Availability: </strong>https://github.com/pbagos/PYRAMA, https://doi.org/10.5281/zenodo.17830449.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag053
Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo
Motivation: Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.
Results: We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.
Availability: Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Using semantic search to find publicly available gene-expression datasets.","authors":"Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo","doi":"10.1093/bioinformatics/btag053","DOIUrl":"10.1093/bioinformatics/btag053","url":null,"abstract":"<p><strong>Motivation: </strong>Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.</p><p><strong>Results: </strong>We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.</p><p><strong>Availability: </strong>Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-31DOI: 10.1093/bioinformatics/btag052
David Köhler, Niklas Kleinenkuhnen, Kiarash Rastegar, Till Baar, Chrysa Nikopoulou, Vangelis Kondylis, Vlada Milchevskaya, Matthias Schmid, Peter Tessarz, Achim Tresch
Motivation: We introduce a statistical approach for pattern recognition in multivariate spatial transcriptomics data.
Results: Our algorithm constructs a projection of the data onto a low-dimensional feature space which is optimal in maximising Moran's I, a measure of spatial dependency. This projection mitigates non-spatial variation and outperforms principal components analysis for pre-processing. Patterns of spatially variable genes are well represented in this feature space, and their projection can be shown to be a denoising operation. Our framework does not require any parameter tuning, and it furthermore gives rise to a calibrated, powerful test of spatial gene expression.
Availability and implementation: The algorithm is implemented in the open source software R and is available at https://github.com/IMSBCompBio/SpaCo.
{"title":"A spectral dimension reduction technique that improves pattern detection in multivariate spatial data.","authors":"David Köhler, Niklas Kleinenkuhnen, Kiarash Rastegar, Till Baar, Chrysa Nikopoulou, Vangelis Kondylis, Vlada Milchevskaya, Matthias Schmid, Peter Tessarz, Achim Tresch","doi":"10.1093/bioinformatics/btag052","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag052","url":null,"abstract":"<p><strong>Motivation: </strong>We introduce a statistical approach for pattern recognition in multivariate spatial transcriptomics data.</p><p><strong>Results: </strong>Our algorithm constructs a projection of the data onto a low-dimensional feature space which is optimal in maximising Moran's I, a measure of spatial dependency. This projection mitigates non-spatial variation and outperforms principal components analysis for pre-processing. Patterns of spatially variable genes are well represented in this feature space, and their projection can be shown to be a denoising operation. Our framework does not require any parameter tuning, and it furthermore gives rise to a calibrated, powerful test of spatial gene expression.</p><p><strong>Availability and implementation: </strong>The algorithm is implemented in the open source software R and is available at https://github.com/IMSBCompBio/SpaCo.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146097705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1093/bioinformatics/btag050
Kari Salokas, Salla Keskitalo, Markku Varjosalo
Mass spectrometry-based proteomics generates increasingly large datasets requiring rapid quality control (QC) and preliminary analysis. Current software solutions often require specialized knowledge, limiting their routine use. We developed ProteoGyver (PG), an accessible, lightweight software solution designed for rapid QC and preliminary proteomics data analysis. PG provides automated QC metrics, intuitive graphical reports, and streamlined workflows for whole-proteome and interactomics datasets, significantly lowering the barrier to regular QC practices. The platform includes additional tools such as MS Inspector for longitudinal chromatogram inspection and Colocalizer for microscopy data. PG is easily deployed as a Docker container or standalone Python installation. PG is open-source and freely available in dockerhub and source code in github at github.com/varjolab/Proteogyver. Availability PG image and source code are available in github and dockerhub under LGPL-2.1.
{"title":"ProteoGyver: A Fast, User-Friendly Tool for Routine QC and Analysis of MS-based Proteomics Data.","authors":"Kari Salokas, Salla Keskitalo, Markku Varjosalo","doi":"10.1093/bioinformatics/btag050","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag050","url":null,"abstract":"<p><p>Mass spectrometry-based proteomics generates increasingly large datasets requiring rapid quality control (QC) and preliminary analysis. Current software solutions often require specialized knowledge, limiting their routine use. We developed ProteoGyver (PG), an accessible, lightweight software solution designed for rapid QC and preliminary proteomics data analysis. PG provides automated QC metrics, intuitive graphical reports, and streamlined workflows for whole-proteome and interactomics datasets, significantly lowering the barrier to regular QC practices. The platform includes additional tools such as MS Inspector for longitudinal chromatogram inspection and Colocalizer for microscopy data. PG is easily deployed as a Docker container or standalone Python installation. PG is open-source and freely available in dockerhub and source code in github at github.com/varjolab/Proteogyver. Availability PG image and source code are available in github and dockerhub under LGPL-2.1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1093/bioinformatics/btag044
Christoph Stelz, Lukas Hübner, Alexandros Stamatakis
<p><strong>Motivation: </strong>Phylogenetic trees describe the evolutionary history among biological species based on their genomic data. Maximum Likelihood (ML) based phylogenetic inference tools search for the tree and evolutionary model that best explain the observed genomic data. Given the independence of likelihood score calculations between different genomic sites, parallel computation is commonly deployed. This is followed by a parallel summation over the per-site scores to obtain the overall likelihood score of the tree. However, basic arithmetic operations on IEEE 754 floating-point numbers, such as addition and multiplication, inherently introduce rounding errors. Consequently, the order by which floating-point operations are executed affects the exact resulting likelihood value since these operations are not associative. Moreover, parallel reduction algorithms in numerical codes re-associate operations as a function of the core count and cluster network topology, inducing different round-off errors. These low-level deviations can cause heuristic searches to diverge and induce high-level result discrepancies (e.g., yield topologically distinct phylogenies). This effect has also been observed in multiple scientific fields beyond phylogenetics.</p><p><strong>Results: </strong>We observe that varying the degree of parallelism results in diverging phylogenetic tree searches (high level results) for over 31% out of 10 179 empirical datasets. More importantly, 8% of these diverging datasets yield trees that are statistically significantly worse than the best known ML tree for the dataset (AU-test, p<0.05). To alleviate this, we develop a variant of the widely used phylogenetic inference tool RAxML-NG, which does yield bit-reproducible results under varying core-counts, with a slowdown of only 0 to 12.7% (median 0.8%) on up to 768 cores. For this, we introduce the ReproRed reduction algorithm, which yields bit-identical results under varying core-counts, by maintaining a fixed operation order that is independent of the communication pattern. ReproRed is thus applicable to all associative reduction operations-in contrast to competitors, which are confined to summation. Our ReproRed reduction algorithm only exchanges the theoretical minimum number of messages, overlaps communication with computation, and utilizes fast base-cases for local reductions. ReproRed is able to all-reduce (via a subsequent broadcast) 4.1×106 operands across 48 to 768 cores in 19.7 to 48.61 μs, thereby exhibiting a slowdown of 13 to 93% over a non-reproducible all-reduce algorithm. ReproRed outperforms the state-of-the-art reproducible all-reduction algorithm ReproBLAS (offers summation only) beyond 10 000 elements per core. In summary, we re-assess non-reproducibility in parallel phylogenetic inference, present the first bit-reproducible parallel phylogenetic inference tool, as well as introduce a general algorithm and open-source code for conducting reproducible assoc
{"title":"Bit-Reproducible Parallel Phylogenetic Tree Inference.","authors":"Christoph Stelz, Lukas Hübner, Alexandros Stamatakis","doi":"10.1093/bioinformatics/btag044","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag044","url":null,"abstract":"<p><strong>Motivation: </strong>Phylogenetic trees describe the evolutionary history among biological species based on their genomic data. Maximum Likelihood (ML) based phylogenetic inference tools search for the tree and evolutionary model that best explain the observed genomic data. Given the independence of likelihood score calculations between different genomic sites, parallel computation is commonly deployed. This is followed by a parallel summation over the per-site scores to obtain the overall likelihood score of the tree. However, basic arithmetic operations on IEEE 754 floating-point numbers, such as addition and multiplication, inherently introduce rounding errors. Consequently, the order by which floating-point operations are executed affects the exact resulting likelihood value since these operations are not associative. Moreover, parallel reduction algorithms in numerical codes re-associate operations as a function of the core count and cluster network topology, inducing different round-off errors. These low-level deviations can cause heuristic searches to diverge and induce high-level result discrepancies (e.g., yield topologically distinct phylogenies). This effect has also been observed in multiple scientific fields beyond phylogenetics.</p><p><strong>Results: </strong>We observe that varying the degree of parallelism results in diverging phylogenetic tree searches (high level results) for over 31% out of 10 179 empirical datasets. More importantly, 8% of these diverging datasets yield trees that are statistically significantly worse than the best known ML tree for the dataset (AU-test, p<0.05). To alleviate this, we develop a variant of the widely used phylogenetic inference tool RAxML-NG, which does yield bit-reproducible results under varying core-counts, with a slowdown of only 0 to 12.7% (median 0.8%) on up to 768 cores. For this, we introduce the ReproRed reduction algorithm, which yields bit-identical results under varying core-counts, by maintaining a fixed operation order that is independent of the communication pattern. ReproRed is thus applicable to all associative reduction operations-in contrast to competitors, which are confined to summation. Our ReproRed reduction algorithm only exchanges the theoretical minimum number of messages, overlaps communication with computation, and utilizes fast base-cases for local reductions. ReproRed is able to all-reduce (via a subsequent broadcast) 4.1×106 operands across 48 to 768 cores in 19.7 to 48.61 μs, thereby exhibiting a slowdown of 13 to 93% over a non-reproducible all-reduce algorithm. ReproRed outperforms the state-of-the-art reproducible all-reduction algorithm ReproBLAS (offers summation only) beyond 10 000 elements per core. In summary, we re-assess non-reproducibility in parallel phylogenetic inference, present the first bit-reproducible parallel phylogenetic inference tool, as well as introduce a general algorithm and open-source code for conducting reproducible assoc","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1093/bioinformatics/btag041
Anja Hess, Dominik Seelow, Helene Kretzmer
Summary: DNAvi is a Python-based tool for rapid grouped analysis and visualization of cell-free DNA fragment size profiles directly from electrophoresis data, overcoming the need for sequencing in basic fragmentomic screenings. It enables normalization, statistical comparison, and publication-ready plotting of multiple samples, supporting quality control and exploratory fragmentomics in clinical and research workflows.
Availability and implementation: DNAvi is implemented in Python and freely available on GitHub at https://github.com/anjahess/DNAvi under a GNU General Public License v3.0, along with source code, documentation, and examples. An archived version is available under https://doi.org/10.5281/zenodo.18097730.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"DNAvi: Integration, statistics, and visualization of cell-free DNA fragment traces.","authors":"Anja Hess, Dominik Seelow, Helene Kretzmer","doi":"10.1093/bioinformatics/btag041","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag041","url":null,"abstract":"<p><strong>Summary: </strong>DNAvi is a Python-based tool for rapid grouped analysis and visualization of cell-free DNA fragment size profiles directly from electrophoresis data, overcoming the need for sequencing in basic fragmentomic screenings. It enables normalization, statistical comparison, and publication-ready plotting of multiple samples, supporting quality control and exploratory fragmentomics in clinical and research workflows.</p><p><strong>Availability and implementation: </strong>DNAvi is implemented in Python and freely available on GitHub at https://github.com/anjahess/DNAvi under a GNU General Public License v3.0, along with source code, documentation, and examples. An archived version is available under https://doi.org/10.5281/zenodo.18097730.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1093/bioinformatics/btag049
Chaojie Wang, Xin Yu
Motivation: Capturing spatial structure is fundamental to the analysis of spatial transcriptomics data. However, most existing methods focus on clustering within individual tissue slices and often ignore the high inter-slice similarity inherent in multi-slice datasets.
Results: To address this limitation, we propose STransfer, a novel transfer learning framework that combines graph convolutional networks (GCNs) with positive pointwise mutual information (PPMI) to model both local and global spatial dependencies. An attention-based module is introduced to fuse features from multiple graphs into unified node representations, facilitating the learning of low-dimensional embeddings that jointly encode gene expression and spatial context. By transferring knowledge from labeled slices to adjacent unlabeled ones, STransfer significantly enhances clustering accuracy while reducing manual annotation costs. Extensive experiments demonstrate that STransfer consistently outperforms state-of-the-art methods in both spatial modeling and cross-slice transfer performance.
Availability and implementation: The code for STransfer has been uploaded to GitHub: https://github.com/Saki-JSU/Publications/tree/main/STransfer.
Supplementary information: No supplementary information is available for this manuscript.
{"title":"STransfer: A Transfer Learning-Enhanced Graph Convolutional Network for Clustering Spatial Transcriptomics Data.","authors":"Chaojie Wang, Xin Yu","doi":"10.1093/bioinformatics/btag049","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag049","url":null,"abstract":"<p><strong>Motivation: </strong>Capturing spatial structure is fundamental to the analysis of spatial transcriptomics data. However, most existing methods focus on clustering within individual tissue slices and often ignore the high inter-slice similarity inherent in multi-slice datasets.</p><p><strong>Results: </strong>To address this limitation, we propose STransfer, a novel transfer learning framework that combines graph convolutional networks (GCNs) with positive pointwise mutual information (PPMI) to model both local and global spatial dependencies. An attention-based module is introduced to fuse features from multiple graphs into unified node representations, facilitating the learning of low-dimensional embeddings that jointly encode gene expression and spatial context. By transferring knowledge from labeled slices to adjacent unlabeled ones, STransfer significantly enhances clustering accuracy while reducing manual annotation costs. Extensive experiments demonstrate that STransfer consistently outperforms state-of-the-art methods in both spatial modeling and cross-slice transfer performance.</p><p><strong>Availability and implementation: </strong>The code for STransfer has been uploaded to GitHub: https://github.com/Saki-JSU/Publications/tree/main/STransfer.</p><p><strong>Supplementary information: </strong>No supplementary information is available for this manuscript.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1093/bioinformatics/btag046
Tien-Thanh Bui, Rui Xie, Wei Zhang
Motivation: The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.
Results: For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.
Availability and implementation: The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"X-intNMF: A Cross- and Intra-Omics Regularized NMF Framework for Multi-Omics Integration.","authors":"Tien-Thanh Bui, Rui Xie, Wei Zhang","doi":"10.1093/bioinformatics/btag046","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag046","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.</p><p><strong>Results: </strong>For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.</p><p><strong>Availability and implementation: </strong>The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}