Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag053
Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo
Motivation: Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.
Results: We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.
Availability: Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Using semantic search to find publicly available gene-expression datasets.","authors":"Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo","doi":"10.1093/bioinformatics/btag053","DOIUrl":"10.1093/bioinformatics/btag053","url":null,"abstract":"<p><strong>Motivation: </strong>Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.</p><p><strong>Results: </strong>We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.</p><p><strong>Availability: </strong>Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-31DOI: 10.1093/bioinformatics/btag052
David Köhler, Niklas Kleinenkuhnen, Kiarash Rastegar, Till Baar, Chrysa Nikopoulou, Vangelis Kondylis, Vlada Milchevskaya, Matthias Schmid, Peter Tessarz, Achim Tresch
Motivation: We introduce a statistical approach for pattern recognition in multivariate spatial transcriptomics data.
Results: Our algorithm constructs a projection of the data onto a low-dimensional feature space which is optimal in maximising Moran's I, a measure of spatial dependency. This projection mitigates non-spatial variation and outperforms principal components analysis for pre-processing. Patterns of spatially variable genes are well represented in this feature space, and their projection can be shown to be a denoising operation. Our framework does not require any parameter tuning, and it furthermore gives rise to a calibrated, powerful test of spatial gene expression.
Availability and implementation: The algorithm is implemented in the open source software R and is available at https://github.com/IMSBCompBio/SpaCo.
{"title":"A spectral dimension reduction technique that improves pattern detection in multivariate spatial data.","authors":"David Köhler, Niklas Kleinenkuhnen, Kiarash Rastegar, Till Baar, Chrysa Nikopoulou, Vangelis Kondylis, Vlada Milchevskaya, Matthias Schmid, Peter Tessarz, Achim Tresch","doi":"10.1093/bioinformatics/btag052","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag052","url":null,"abstract":"<p><strong>Motivation: </strong>We introduce a statistical approach for pattern recognition in multivariate spatial transcriptomics data.</p><p><strong>Results: </strong>Our algorithm constructs a projection of the data onto a low-dimensional feature space which is optimal in maximising Moran's I, a measure of spatial dependency. This projection mitigates non-spatial variation and outperforms principal components analysis for pre-processing. Patterns of spatially variable genes are well represented in this feature space, and their projection can be shown to be a denoising operation. Our framework does not require any parameter tuning, and it furthermore gives rise to a calibrated, powerful test of spatial gene expression.</p><p><strong>Availability and implementation: </strong>The algorithm is implemented in the open source software R and is available at https://github.com/IMSBCompBio/SpaCo.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146097705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1093/bioinformatics/btag050
Kari Salokas, Salla Keskitalo, Markku Varjosalo
Mass spectrometry-based proteomics generates increasingly large datasets requiring rapid quality control (QC) and preliminary analysis. Current software solutions often require specialized knowledge, limiting their routine use. We developed ProteoGyver (PG), an accessible, lightweight software solution designed for rapid QC and preliminary proteomics data analysis. PG provides automated QC metrics, intuitive graphical reports, and streamlined workflows for whole-proteome and interactomics datasets, significantly lowering the barrier to regular QC practices. The platform includes additional tools such as MS Inspector for longitudinal chromatogram inspection and Colocalizer for microscopy data. PG is easily deployed as a Docker container or standalone Python installation. PG is open-source and freely available in dockerhub and source code in github at github.com/varjolab/Proteogyver. Availability PG image and source code are available in github and dockerhub under LGPL-2.1.
{"title":"ProteoGyver: A Fast, User-Friendly Tool for Routine QC and Analysis of MS-based Proteomics Data.","authors":"Kari Salokas, Salla Keskitalo, Markku Varjosalo","doi":"10.1093/bioinformatics/btag050","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag050","url":null,"abstract":"<p><p>Mass spectrometry-based proteomics generates increasingly large datasets requiring rapid quality control (QC) and preliminary analysis. Current software solutions often require specialized knowledge, limiting their routine use. We developed ProteoGyver (PG), an accessible, lightweight software solution designed for rapid QC and preliminary proteomics data analysis. PG provides automated QC metrics, intuitive graphical reports, and streamlined workflows for whole-proteome and interactomics datasets, significantly lowering the barrier to regular QC practices. The platform includes additional tools such as MS Inspector for longitudinal chromatogram inspection and Colocalizer for microscopy data. PG is easily deployed as a Docker container or standalone Python installation. PG is open-source and freely available in dockerhub and source code in github at github.com/varjolab/Proteogyver. Availability PG image and source code are available in github and dockerhub under LGPL-2.1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1093/bioinformatics/btag046
Tien-Thanh Bui, Rui Xie, Wei Zhang
Motivation: The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.
Results: For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.
Availability and implementation: The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"X-intNMF: A Cross- and Intra-Omics Regularized NMF Framework for Multi-Omics Integration.","authors":"Tien-Thanh Bui, Rui Xie, Wei Zhang","doi":"10.1093/bioinformatics/btag046","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag046","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.</p><p><strong>Results: </strong>For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.</p><p><strong>Availability and implementation: </strong>The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1093/bioinformatics/btag048
Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman
Motivation: Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.
Results: In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.
Availability and implementation: The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"How Negative Sampling Shapes the Performance of Transcription Factor Binding Site Prediction Models.","authors":"Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman","doi":"10.1093/bioinformatics/btag048","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag048","url":null,"abstract":"<p><strong>Motivation: </strong>Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.</p><p><strong>Results: </strong>In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.</p><p><strong>Availability and implementation: </strong>The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1093/bioinformatics/btag040
Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercument Cicek, Oznur Tastan
Motivation: Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.
Results: In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.
Implementation: The scritps to run the experiments are available at: https://github.com/tastanlab/ohe.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"One-Hot News: Drug Synergy Models Shortcut Molecular Features.","authors":"Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercument Cicek, Oznur Tastan","doi":"10.1093/bioinformatics/btag040","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag040","url":null,"abstract":"<p><strong>Motivation: </strong>Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.</p><p><strong>Results: </strong>In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.</p><p><strong>Implementation: </strong>The scritps to run the experiments are available at: https://github.com/tastanlab/ohe.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1093/bioinformatics/btag042
Songjian Wei, Jinxiong Zhang, Yan Chen, Chunyan Tang, Jiayang Tan
Motivation: Antibodies, as pivotal effector molecules of the immune system, neutralize pathogens through specific binding to antigens mediated by complementarity-determining regions (CDRs), highlighting the critical importance for precise antibody design in diagnostics and therapeutics. Despite significant advances in CDR design, current methods remain limited by inadequate modeling of geometric constraints, omission of multi-scale spatial relationships, and insufficient conformational representation capacity-factors that collectively degrade prediction accuracy.
Results: To overcome these limitations, we present GeoGAD, a geometry-aware antibody design framework with Gaussian attention mechanisms. Key innovations include: (1) the introduction of rotational positional encoding to enhance geometric sensitivity; (2) a geometry-aware module that integrates multi-scale spatial features through dynamic message passing, adaptive edge refinement, and multi-edge-type coordinate optimization; and (3) a Gaussian attention mechanism that employs an edge-type-sensitive spatial Gaussian kernel to model long-range sequence dependencies, enabling focused attention on local critical residues while preserving global contextual modeling. Experimental evaluations demonstrate that GeoGAD achieves superior or comparable performance to state-of-the-art models across antibody sequence-structure co-modeling, CDR design, and affinity optimization benchmarks, particularly excelling in amino acid recovery rates (AAR) and structural accuracy metrics (RMSD, TM-score). By enhancing the design precision of antibody CDR regions, GeoGAD offers a geometrically consistent framework for the computational design of therapeutic antibodies.
Availability and implementation: The source code and implementation are available at https://github.com/WeiSongJian/GeoGAD, and the archival version for this manuscript is deposited at https://doi.org/10.5281/zenodo.18073443.
{"title":"GeoGAD: Geometry-Aware Antibody Design Framework for Complementarity-Determining Region Precision Engineering.","authors":"Songjian Wei, Jinxiong Zhang, Yan Chen, Chunyan Tang, Jiayang Tan","doi":"10.1093/bioinformatics/btag042","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag042","url":null,"abstract":"<p><strong>Motivation: </strong>Antibodies, as pivotal effector molecules of the immune system, neutralize pathogens through specific binding to antigens mediated by complementarity-determining regions (CDRs), highlighting the critical importance for precise antibody design in diagnostics and therapeutics. Despite significant advances in CDR design, current methods remain limited by inadequate modeling of geometric constraints, omission of multi-scale spatial relationships, and insufficient conformational representation capacity-factors that collectively degrade prediction accuracy.</p><p><strong>Results: </strong>To overcome these limitations, we present GeoGAD, a geometry-aware antibody design framework with Gaussian attention mechanisms. Key innovations include: (1) the introduction of rotational positional encoding to enhance geometric sensitivity; (2) a geometry-aware module that integrates multi-scale spatial features through dynamic message passing, adaptive edge refinement, and multi-edge-type coordinate optimization; and (3) a Gaussian attention mechanism that employs an edge-type-sensitive spatial Gaussian kernel to model long-range sequence dependencies, enabling focused attention on local critical residues while preserving global contextual modeling. Experimental evaluations demonstrate that GeoGAD achieves superior or comparable performance to state-of-the-art models across antibody sequence-structure co-modeling, CDR design, and affinity optimization benchmarks, particularly excelling in amino acid recovery rates (AAR) and structural accuracy metrics (RMSD, TM-score). By enhancing the design precision of antibody CDR regions, GeoGAD offers a geometrically consistent framework for the computational design of therapeutic antibodies.</p><p><strong>Availability and implementation: </strong>The source code and implementation are available at https://github.com/WeiSongJian/GeoGAD, and the archival version for this manuscript is deposited at https://doi.org/10.5281/zenodo.18073443.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag032
Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri
Motivation: Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.
Results: We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.
Availability: Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs.","authors":"Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri","doi":"10.1093/bioinformatics/btag032","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag032","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.</p><p><strong>Results: </strong>We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.</p><p><strong>Availability: </strong>Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag038
Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu
Motivation: Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.
Results: Motivated by the TEDDY study, we propose "JM-NCC", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.
Availability: Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset.","authors":"Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu","doi":"10.1093/bioinformatics/btag038","DOIUrl":"10.1093/bioinformatics/btag038","url":null,"abstract":"<p><strong>Motivation: </strong>Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.</p><p><strong>Results: </strong>Motivated by the TEDDY study, we propose \"JM-NCC\", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.</p><p><strong>Availability: </strong>Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag027
Jakob Agamia, Martin Zacharias
Motivation: The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.
Results: Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.
Availability and implementation: Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.
{"title":"De novo protein ligand design including protein flexibility and conformational adaptation.","authors":"Jakob Agamia, Martin Zacharias","doi":"10.1093/bioinformatics/btag027","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag027","url":null,"abstract":"<p><strong>Motivation: </strong>The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.</p><p><strong>Results: </strong>Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.</p><p><strong>Availability and implementation: </strong>Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}