Pub Date : 2026-01-27DOI: 10.1093/bioinformatics/btag046
Tien-Thanh Bui, Rui Xie, Wei Zhang
Motivation: The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.
Results: For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.
Availability and implementation: The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"X-intNMF: A Cross- and Intra-Omics Regularized NMF Framework for Multi-Omics Integration.","authors":"Tien-Thanh Bui, Rui Xie, Wei Zhang","doi":"10.1093/bioinformatics/btag046","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag046","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.</p><p><strong>Results: </strong>For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.</p><p><strong>Availability and implementation: </strong>The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1093/bioinformatics/btag048
Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman
Motivation: Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.
Results: In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.
Availability and implementation: The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"How Negative Sampling Shapes the Performance of Transcription Factor Binding Site Prediction Models.","authors":"Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman","doi":"10.1093/bioinformatics/btag048","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag048","url":null,"abstract":"<p><strong>Motivation: </strong>Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.</p><p><strong>Results: </strong>In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.</p><p><strong>Availability and implementation: </strong>The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1093/bioinformatics/btag040
Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercument Cicek, Oznur Tastan
Motivation: Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.
Results: In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.
Implementation: The scritps to run the experiments are available at: https://github.com/tastanlab/ohe.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"One-Hot News: Drug Synergy Models Shortcut Molecular Features.","authors":"Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercument Cicek, Oznur Tastan","doi":"10.1093/bioinformatics/btag040","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag040","url":null,"abstract":"<p><strong>Motivation: </strong>Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.</p><p><strong>Results: </strong>In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.</p><p><strong>Implementation: </strong>The scritps to run the experiments are available at: https://github.com/tastanlab/ohe.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1093/bioinformatics/btag042
Songjian Wei, Jinxiong Zhang, Yan Chen, Chunyan Tang, Jiayang Tan
Motivation: Antibodies, as pivotal effector molecules of the immune system, neutralize pathogens through specific binding to antigens mediated by complementarity-determining regions (CDRs), highlighting the critical importance for precise antibody design in diagnostics and therapeutics. Despite significant advances in CDR design, current methods remain limited by inadequate modeling of geometric constraints, omission of multi-scale spatial relationships, and insufficient conformational representation capacity-factors that collectively degrade prediction accuracy.
Results: To overcome these limitations, we present GeoGAD, a geometry-aware antibody design framework with Gaussian attention mechanisms. Key innovations include: (1) the introduction of rotational positional encoding to enhance geometric sensitivity; (2) a geometry-aware module that integrates multi-scale spatial features through dynamic message passing, adaptive edge refinement, and multi-edge-type coordinate optimization; and (3) a Gaussian attention mechanism that employs an edge-type-sensitive spatial Gaussian kernel to model long-range sequence dependencies, enabling focused attention on local critical residues while preserving global contextual modeling. Experimental evaluations demonstrate that GeoGAD achieves superior or comparable performance to state-of-the-art models across antibody sequence-structure co-modeling, CDR design, and affinity optimization benchmarks, particularly excelling in amino acid recovery rates (AAR) and structural accuracy metrics (RMSD, TM-score). By enhancing the design precision of antibody CDR regions, GeoGAD offers a geometrically consistent framework for the computational design of therapeutic antibodies.
Availability and implementation: The source code and implementation are available at https://github.com/WeiSongJian/GeoGAD, and the archival version for this manuscript is deposited at https://doi.org/10.5281/zenodo.18073443.
{"title":"GeoGAD: Geometry-Aware Antibody Design Framework for Complementarity-Determining Region Precision Engineering.","authors":"Songjian Wei, Jinxiong Zhang, Yan Chen, Chunyan Tang, Jiayang Tan","doi":"10.1093/bioinformatics/btag042","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag042","url":null,"abstract":"<p><strong>Motivation: </strong>Antibodies, as pivotal effector molecules of the immune system, neutralize pathogens through specific binding to antigens mediated by complementarity-determining regions (CDRs), highlighting the critical importance for precise antibody design in diagnostics and therapeutics. Despite significant advances in CDR design, current methods remain limited by inadequate modeling of geometric constraints, omission of multi-scale spatial relationships, and insufficient conformational representation capacity-factors that collectively degrade prediction accuracy.</p><p><strong>Results: </strong>To overcome these limitations, we present GeoGAD, a geometry-aware antibody design framework with Gaussian attention mechanisms. Key innovations include: (1) the introduction of rotational positional encoding to enhance geometric sensitivity; (2) a geometry-aware module that integrates multi-scale spatial features through dynamic message passing, adaptive edge refinement, and multi-edge-type coordinate optimization; and (3) a Gaussian attention mechanism that employs an edge-type-sensitive spatial Gaussian kernel to model long-range sequence dependencies, enabling focused attention on local critical residues while preserving global contextual modeling. Experimental evaluations demonstrate that GeoGAD achieves superior or comparable performance to state-of-the-art models across antibody sequence-structure co-modeling, CDR design, and affinity optimization benchmarks, particularly excelling in amino acid recovery rates (AAR) and structural accuracy metrics (RMSD, TM-score). By enhancing the design precision of antibody CDR regions, GeoGAD offers a geometrically consistent framework for the computational design of therapeutic antibodies.</p><p><strong>Availability and implementation: </strong>The source code and implementation are available at https://github.com/WeiSongJian/GeoGAD, and the archival version for this manuscript is deposited at https://doi.org/10.5281/zenodo.18073443.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag035
David Kou, Trevor Manz, Tereza Clarence, Nils Gehlenborg
Summary: Uchimata is a toolkit for visualization of 3D structures of genomes. It consists of two packages: a Javascript library facilitating the rendering of 3D models of genomes, and a Python widget for visualization in Jupyter Notebooks. Main features include an expressive way to specify visual encodings, and filtering of 3D genome structures based on genomic semantics and spatial aspects. Uchimata is designed to be highly integratable with biological tooling available in Python.
Availability and implementation: Uchimata is released under the MIT License. The Javascript library is available on NPM, while the widget is available as a Python package hosted on PyPI. The source code for both is available publicly on Github (https://github.com/hms-dbmi/uchimata and https://github.com/hms-dbmi/uchimata-py) and Zenodo (https://doi.org/10.5281/zenodo.17831959 and https://doi.org/10.5281/zenodo.17832045). The documentation with examples is hosted at https://hms-dbmi.github.io/uchimata/.
{"title":"Uchimata: a toolkit for visualization of 3D genome structures on the web and in computational notebooks.","authors":"David Kou, Trevor Manz, Tereza Clarence, Nils Gehlenborg","doi":"10.1093/bioinformatics/btag035","DOIUrl":"10.1093/bioinformatics/btag035","url":null,"abstract":"<p><strong>Summary: </strong>Uchimata is a toolkit for visualization of 3D structures of genomes. It consists of two packages: a Javascript library facilitating the rendering of 3D models of genomes, and a Python widget for visualization in Jupyter Notebooks. Main features include an expressive way to specify visual encodings, and filtering of 3D genome structures based on genomic semantics and spatial aspects. Uchimata is designed to be highly integratable with biological tooling available in Python.</p><p><strong>Availability and implementation: </strong>Uchimata is released under the MIT License. The Javascript library is available on NPM, while the widget is available as a Python package hosted on PyPI. The source code for both is available publicly on Github (https://github.com/hms-dbmi/uchimata and https://github.com/hms-dbmi/uchimata-py) and Zenodo (https://doi.org/10.5281/zenodo.17831959 and https://doi.org/10.5281/zenodo.17832045). The documentation with examples is hosted at https://hms-dbmi.github.io/uchimata/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146020897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag032
Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri
Motivation: Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.
Results: We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.
Availability: Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs.","authors":"Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri","doi":"10.1093/bioinformatics/btag032","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag032","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.</p><p><strong>Results: </strong>We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.</p><p><strong>Availability: </strong>Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag038
Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu
Motivation: Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.
Results: Motivated by the TEDDY study, we propose "JM-NCC", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.
Availability: Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset.","authors":"Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu","doi":"10.1093/bioinformatics/btag038","DOIUrl":"10.1093/bioinformatics/btag038","url":null,"abstract":"<p><strong>Motivation: </strong>Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.</p><p><strong>Results: </strong>Motivated by the TEDDY study, we propose \"JM-NCC\", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.</p><p><strong>Availability: </strong>Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag027
Jakob Agamia, Martin Zacharias
Motivation: The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.
Results: Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.
Availability and implementation: Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.
{"title":"De novo protein ligand design including protein flexibility and conformational adaptation.","authors":"Jakob Agamia, Martin Zacharias","doi":"10.1093/bioinformatics/btag027","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag027","url":null,"abstract":"<p><strong>Motivation: </strong>The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.</p><p><strong>Results: </strong>Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.</p><p><strong>Availability and implementation: </strong>Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btaf676
Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Simon Boutry, Veronika Stadler, Sabine Österle, Jan Armida, David Haerry, D Sean Froese, Luregn J Schlapbach, Jacques Fellay
Motivation: Qualifying variants (QVs) are genomic alterations selected by defined criteria within analysis pipelines. Although crucial for both research and clinical diagnostics, QVs are often seen as simple filters rather than dynamic elements that influence the entire workflow. In practice these rules are embedded within pipelines, which hinders transparency, audit, and reuse across tools. A unified, portable specification for QV criteria is needed.
Results: Our aim is to embed the concept of a "QV" into the genomic analysis vernacular, moving beyond its treatment as a single filtering step. By decoupling QV criteria from pipeline variables and code, the framework enables clearer discussion, application, and reuse. It provides a flexible reference model for integrating QVs into analysis pipelines, improving reproducibility, interpretability, and interdisciplinary communication. Validation across diverse applications confirmed that QV based workflows match conventional methods while offering greater clarity and scalability.
Availability: The source code and data are accessible at the Zenodo repository https://doi.org/10.5281/zenodo.17414191. Manuscript files are available at https://github.com/DylanLawless/qvApp2025lawless. The QV framework is available under the MIT licence, and the dataset will be maintained for at least two years following publication.
{"title":"Application of qualifying variants for genomic analysis.","authors":"Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Simon Boutry, Veronika Stadler, Sabine Österle, Jan Armida, David Haerry, D Sean Froese, Luregn J Schlapbach, Jacques Fellay","doi":"10.1093/bioinformatics/btaf676","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf676","url":null,"abstract":"<p><strong>Motivation: </strong>Qualifying variants (QVs) are genomic alterations selected by defined criteria within analysis pipelines. Although crucial for both research and clinical diagnostics, QVs are often seen as simple filters rather than dynamic elements that influence the entire workflow. In practice these rules are embedded within pipelines, which hinders transparency, audit, and reuse across tools. A unified, portable specification for QV criteria is needed.</p><p><strong>Results: </strong>Our aim is to embed the concept of a \"QV\" into the genomic analysis vernacular, moving beyond its treatment as a single filtering step. By decoupling QV criteria from pipeline variables and code, the framework enables clearer discussion, application, and reuse. It provides a flexible reference model for integrating QVs into analysis pipelines, improving reproducibility, interpretability, and interdisciplinary communication. Validation across diverse applications confirmed that QV based workflows match conventional methods while offering greater clarity and scalability.</p><p><strong>Availability: </strong>The source code and data are accessible at the Zenodo repository https://doi.org/10.5281/zenodo.17414191. Manuscript files are available at https://github.com/DylanLawless/qvApp2025lawless. The QV framework is available under the MIT licence, and the dataset will be maintained for at least two years following publication.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1093/bioinformatics/btag020
Aydin Wells, Khalique Newaz, Jennifer Morones, Jianlin Cheng, Tijana Milenković
Motivation: Protein folding is a dynamic process during which a protein's amino acid sequence undergoes a series of 3-dimensional (3D) conformational changes en route to reaching a native 3D structure; these conformations are called folding intermediates. While data on native 3D structures are abundant, data on 3D structures of non-native intermediates remain sparse, due to limitations of current technologies for experimental determination of 3D structures. Yet, analyzing folding intermediates is crucial for understanding folding dynamics and misfolding-related diseases. Hence, we search the literature for available (experimentally and computationally obtained) 3D structural data on folding intermediates, organizing the data in a centralized resource. Also, we assess whether existing methods, designed for predicting native structures, can be utilized to predict structures of non-native intermediates.
Results: Our literature search reveals six studies that provide 3D structural data on folding intermediates (two for post-translational and four for co-translational folding), each focused on a single protein, with 2-4 intermediates. Our assessment shows that an established method for predicting native structures, AlphaFold2, does not perform well for non-native intermediates in the context of co-translational folding; a recent study on post-translational folding concluded the same for even more existing methods. Yet, we identify in the literature recent pioneering methods designed explicitly to predict 3D structures of folding intermediates by incorporating intrinsic biophysical characteristics of folding dynamics, which show promise. This study assesses the current landscape and future directions of the field of 3D structural analysis of protein folding dynamics.
Availability and implementation: https://github.com/Aywells/3Dpfi or https://www3.nd.edu/ cone/3Dpfi.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Unavailability of experimental 3D structural data on protein folding dynamics and necessity for a new generation of structure prediction methods in this context.","authors":"Aydin Wells, Khalique Newaz, Jennifer Morones, Jianlin Cheng, Tijana Milenković","doi":"10.1093/bioinformatics/btag020","DOIUrl":"10.1093/bioinformatics/btag020","url":null,"abstract":"<p><strong>Motivation: </strong>Protein folding is a dynamic process during which a protein's amino acid sequence undergoes a series of 3-dimensional (3D) conformational changes en route to reaching a native 3D structure; these conformations are called folding intermediates. While data on native 3D structures are abundant, data on 3D structures of non-native intermediates remain sparse, due to limitations of current technologies for experimental determination of 3D structures. Yet, analyzing folding intermediates is crucial for understanding folding dynamics and misfolding-related diseases. Hence, we search the literature for available (experimentally and computationally obtained) 3D structural data on folding intermediates, organizing the data in a centralized resource. Also, we assess whether existing methods, designed for predicting native structures, can be utilized to predict structures of non-native intermediates.</p><p><strong>Results: </strong>Our literature search reveals six studies that provide 3D structural data on folding intermediates (two for post-translational and four for co-translational folding), each focused on a single protein, with 2-4 intermediates. Our assessment shows that an established method for predicting native structures, AlphaFold2, does not perform well for non-native intermediates in the context of co-translational folding; a recent study on post-translational folding concluded the same for even more existing methods. Yet, we identify in the literature recent pioneering methods designed explicitly to predict 3D structures of folding intermediates by incorporating intrinsic biophysical characteristics of folding dynamics, which show promise. This study assesses the current landscape and future directions of the field of 3D structural analysis of protein folding dynamics.</p><p><strong>Availability and implementation: </strong>https://github.com/Aywells/3Dpfi or https://www3.nd.edu/ cone/3Dpfi.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}