Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag032
Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri
Motivation: Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.
Results: We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.
Availability: Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs.","authors":"Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri","doi":"10.1093/bioinformatics/btag032","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag032","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.</p><p><strong>Results: </strong>We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.</p><p><strong>Availability: </strong>Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag038
Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu
Motivation: Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.
Results: Motivated by the TEDDY study, we propose "JM-NCC", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.
Availability: Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset.","authors":"Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu","doi":"10.1093/bioinformatics/btag038","DOIUrl":"10.1093/bioinformatics/btag038","url":null,"abstract":"<p><strong>Motivation: </strong>Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.</p><p><strong>Results: </strong>Motivated by the TEDDY study, we propose \"JM-NCC\", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.</p><p><strong>Availability: </strong>Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btag027
Jakob Agamia, Martin Zacharias
Motivation: The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.
Results: Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.
Availability and implementation: Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.
{"title":"De novo protein ligand design including protein flexibility and conformational adaptation.","authors":"Jakob Agamia, Martin Zacharias","doi":"10.1093/bioinformatics/btag027","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag027","url":null,"abstract":"<p><strong>Motivation: </strong>The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.</p><p><strong>Results: </strong>Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.</p><p><strong>Availability and implementation: </strong>Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1093/bioinformatics/btaf676
Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Simon Boutry, Veronika Stadler, Sabine Österle, Jan Armida, David Haerry, D Sean Froese, Luregn J Schlapbach, Jacques Fellay
Motivation: Qualifying variants (QVs) are genomic alterations selected by defined criteria within analysis pipelines. Although crucial for both research and clinical diagnostics, QVs are often seen as simple filters rather than dynamic elements that influence the entire workflow. In practice these rules are embedded within pipelines, which hinders transparency, audit, and reuse across tools. A unified, portable specification for QV criteria is needed.
Results: Our aim is to embed the concept of a "QV" into the genomic analysis vernacular, moving beyond its treatment as a single filtering step. By decoupling QV criteria from pipeline variables and code, the framework enables clearer discussion, application, and reuse. It provides a flexible reference model for integrating QVs into analysis pipelines, improving reproducibility, interpretability, and interdisciplinary communication. Validation across diverse applications confirmed that QV based workflows match conventional methods while offering greater clarity and scalability.
Availability: The source code and data are accessible at the Zenodo repository https://doi.org/10.5281/zenodo.17414191. Manuscript files are available at https://github.com/DylanLawless/qvApp2025lawless. The QV framework is available under the MIT licence, and the dataset will be maintained for at least two years following publication.
{"title":"Application of qualifying variants for genomic analysis.","authors":"Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Simon Boutry, Veronika Stadler, Sabine Österle, Jan Armida, David Haerry, D Sean Froese, Luregn J Schlapbach, Jacques Fellay","doi":"10.1093/bioinformatics/btaf676","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf676","url":null,"abstract":"<p><strong>Motivation: </strong>Qualifying variants (QVs) are genomic alterations selected by defined criteria within analysis pipelines. Although crucial for both research and clinical diagnostics, QVs are often seen as simple filters rather than dynamic elements that influence the entire workflow. In practice these rules are embedded within pipelines, which hinders transparency, audit, and reuse across tools. A unified, portable specification for QV criteria is needed.</p><p><strong>Results: </strong>Our aim is to embed the concept of a \"QV\" into the genomic analysis vernacular, moving beyond its treatment as a single filtering step. By decoupling QV criteria from pipeline variables and code, the framework enables clearer discussion, application, and reuse. It provides a flexible reference model for integrating QVs into analysis pipelines, improving reproducibility, interpretability, and interdisciplinary communication. Validation across diverse applications confirmed that QV based workflows match conventional methods while offering greater clarity and scalability.</p><p><strong>Availability: </strong>The source code and data are accessible at the Zenodo repository https://doi.org/10.5281/zenodo.17414191. Manuscript files are available at https://github.com/DylanLawless/qvApp2025lawless. The QV framework is available under the MIT licence, and the dataset will be maintained for at least two years following publication.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1093/bioinformatics/btag020
Aydin Wells, Khalique Newaz, Jennifer Morones, Jianlin Cheng, Tijana Milenković
Motivation: Protein folding is a dynamic process during which a protein's amino acid sequence undergoes a series of 3-dimensional (3D) conformational changes en route to reaching a native 3D structure; these conformations are called folding intermediates. While data on native 3D structures are abundant, data on 3D structures of non-native intermediates remain sparse, due to limitations of current technologies for experimental determination of 3D structures. Yet, analyzing folding intermediates is crucial for understanding folding dynamics and misfolding-related diseases. Hence, we search the literature for available (experimentally and computationally obtained) 3D structural data on folding intermediates, organizing the data in a centralized resource. Also, we assess whether existing methods, designed for predicting native structures, can be utilized to predict structures of non-native intermediates.
Results: Our literature search reveals six studies that provide 3D structural data on folding intermediates (two for post-translational and four for co-translational folding), each focused on a single protein, with 2-4 intermediates. Our assessment shows that an established method for predicting native structures, AlphaFold2, does not perform well for non-native intermediates in the context of co-translational folding; a recent study on post-translational folding concluded the same for even more existing methods. Yet, we identify in the literature recent pioneering methods designed explicitly to predict 3D structures of folding intermediates by incorporating intrinsic biophysical characteristics of folding dynamics, which show promise. This study assesses the current landscape and future directions of the field of 3D structural analysis of protein folding dynamics.
Availability and implementation: https://github.com/Aywells/3Dpfi or https://www3.nd.edu/ cone/3Dpfi.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Unavailability of experimental 3D structural data on protein folding dynamics and necessity for a new generation of structure prediction methods in this context.","authors":"Aydin Wells, Khalique Newaz, Jennifer Morones, Jianlin Cheng, Tijana Milenković","doi":"10.1093/bioinformatics/btag020","DOIUrl":"10.1093/bioinformatics/btag020","url":null,"abstract":"<p><strong>Motivation: </strong>Protein folding is a dynamic process during which a protein's amino acid sequence undergoes a series of 3-dimensional (3D) conformational changes en route to reaching a native 3D structure; these conformations are called folding intermediates. While data on native 3D structures are abundant, data on 3D structures of non-native intermediates remain sparse, due to limitations of current technologies for experimental determination of 3D structures. Yet, analyzing folding intermediates is crucial for understanding folding dynamics and misfolding-related diseases. Hence, we search the literature for available (experimentally and computationally obtained) 3D structural data on folding intermediates, organizing the data in a centralized resource. Also, we assess whether existing methods, designed for predicting native structures, can be utilized to predict structures of non-native intermediates.</p><p><strong>Results: </strong>Our literature search reveals six studies that provide 3D structural data on folding intermediates (two for post-translational and four for co-translational folding), each focused on a single protein, with 2-4 intermediates. Our assessment shows that an established method for predicting native structures, AlphaFold2, does not perform well for non-native intermediates in the context of co-translational folding; a recent study on post-translational folding concluded the same for even more existing methods. Yet, we identify in the literature recent pioneering methods designed explicitly to predict 3D structures of folding intermediates by incorporating intrinsic biophysical characteristics of folding dynamics, which show promise. This study assesses the current landscape and future directions of the field of 3D structural analysis of protein folding dynamics.</p><p><strong>Availability and implementation: </strong>https://github.com/Aywells/3Dpfi or https://www3.nd.edu/ cone/3Dpfi.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1093/bioinformatics/btag013
Keyang Yu, Haoquan Zhao, Andrea Wilderman, Tierra Farris, Jessie Arce, David Chen, Andrew R Jackson, Yiran Guo, Qi Li, Bosko Jevtic, Dubravka Jevtic, Vuk Milinovic, Yuankun Zhu, Jeremy Costanza, Eric Wenger, Chris Nemarich, Lisa Anderson, Aleksandar Mihajlović, Kristin Ardlie, Shaine A Morris, Matthew Roth, Deanne M Taylor, Adam C Resnick, Lilei Zhang, Aleksandar Milosavljevic
Motivation: Methods for sharing gene regulatory information and knowledge on FAIR principles-particularly in the context of tissue-specific gene regulation-remain poorly defined and implemented, hampering discovery and clinical genetic diagnosis.
Results: We specified FAIR principles for tissue-specific gene regulatory information and knowledge; implemented them by developing a registry of regulatory elements and aggregating FAIR gene regulatory information from several major sources; developed computational tools that utilize these FAIR resources; and demonstrated their utility by associating gene regulatory variants with major subtypes of congenital heart disease.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Aggregation of gene regulatory information and knowledge on FAIR principles enables discovery of pathogenic gene regulatory variants.","authors":"Keyang Yu, Haoquan Zhao, Andrea Wilderman, Tierra Farris, Jessie Arce, David Chen, Andrew R Jackson, Yiran Guo, Qi Li, Bosko Jevtic, Dubravka Jevtic, Vuk Milinovic, Yuankun Zhu, Jeremy Costanza, Eric Wenger, Chris Nemarich, Lisa Anderson, Aleksandar Mihajlović, Kristin Ardlie, Shaine A Morris, Matthew Roth, Deanne M Taylor, Adam C Resnick, Lilei Zhang, Aleksandar Milosavljevic","doi":"10.1093/bioinformatics/btag013","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag013","url":null,"abstract":"<p><strong>Motivation: </strong>Methods for sharing gene regulatory information and knowledge on FAIR principles-particularly in the context of tissue-specific gene regulation-remain poorly defined and implemented, hampering discovery and clinical genetic diagnosis.</p><p><strong>Results: </strong>We specified FAIR principles for tissue-specific gene regulatory information and knowledge; implemented them by developing a registry of regulatory elements and aggregating FAIR gene regulatory information from several major sources; developed computational tools that utilize these FAIR resources; and demonstrated their utility by associating gene regulatory variants with major subtypes of congenital heart disease.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1093/bioinformatics/btag031
Favour James, Dexter Pratt, Christopher Churas, Augustin Luna
Motivation: Knowledge graphs (KGs) are powerful tools for structuring and analyzing biological information due to their ability to represent data and improve queries across heterogeneous datasets. However, constructing KGs from unstructured literature remains challenging due to the cost and expertise required for manual curation. Prior works have explored text-mining techniques to automate this process, but have limitations that impact their ability to capture complex relationships fully. Traditional text-mining methods struggle with understanding context across sentences. Additionally, these methods lack expert-level background knowledge, making it difficult to infer relationships that require awareness of concepts indirectly described in the text. Large Language Models (LLMs) present an opportunity to overcome these challenges. LLMs are trained on diverse literature, equipping them with contextual knowledge that enables more accurate information extraction.
Results: We present textToKnowledgeGraph, an artificial intelligence tool using LLMs to extract interactions from individual publications directly in Biological Expression Language (BEL). BEL was chosen for its compact, detailed representation of biological relationships, enabling structured, computationally accessible encoding. This work makes several contributions. 1. Development of the open-source Python textToKnowledgeGraph package (pypi.org/project/texttoknowledgegraph) for BEL extraction from scientific articles, usable from the command line and within other projects, 2. An interactive application within Cytoscape Web to simplify extraction and exploration, 3. A dataset of extractions that have been both computationally and manually reviewed to support future fine-tuning efforts.
{"title":"textToKnowledgeGraph: Generation of Molecular Interaction Knowledge Graphs Using Large Language Models for Exploration in Cytoscape.","authors":"Favour James, Dexter Pratt, Christopher Churas, Augustin Luna","doi":"10.1093/bioinformatics/btag031","DOIUrl":"10.1093/bioinformatics/btag031","url":null,"abstract":"<p><strong>Motivation: </strong>Knowledge graphs (KGs) are powerful tools for structuring and analyzing biological information due to their ability to represent data and improve queries across heterogeneous datasets. However, constructing KGs from unstructured literature remains challenging due to the cost and expertise required for manual curation. Prior works have explored text-mining techniques to automate this process, but have limitations that impact their ability to capture complex relationships fully. Traditional text-mining methods struggle with understanding context across sentences. Additionally, these methods lack expert-level background knowledge, making it difficult to infer relationships that require awareness of concepts indirectly described in the text. Large Language Models (LLMs) present an opportunity to overcome these challenges. LLMs are trained on diverse literature, equipping them with contextual knowledge that enables more accurate information extraction.</p><p><strong>Results: </strong>We present textToKnowledgeGraph, an artificial intelligence tool using LLMs to extract interactions from individual publications directly in Biological Expression Language (BEL). BEL was chosen for its compact, detailed representation of biological relationships, enabling structured, computationally accessible encoding. This work makes several contributions. 1. Development of the open-source Python textToKnowledgeGraph package (pypi.org/project/texttoknowledgegraph) for BEL extraction from scientific articles, usable from the command line and within other projects, 2. An interactive application within Cytoscape Web to simplify extraction and exploration, 3. A dataset of extractions that have been both computationally and manually reviewed to support future fine-tuning efforts.</p><p><strong>Availability: </strong>https://github.com/ndexbio/llm-text-to-knowledge-graph.</p><p><strong>Contact: </strong>augustin@nih.gov; favour.ujames196@gmail.com; depratt@health.ucsd.edu.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1093/bioinformatics/btag036
Zhihui Zhu, Huapeng Liu, Xuechen Li, Haojin Zhou, Jiaqi Wang
Motivation: Short peptides hold significant promise in drug discovery and materials science due to their biocompatibility, multifunctionality, ease of synthesis, etc. However, accurately predicting their physicochemical properties, a prerequisite for application development, remains a grand challenge due to the sheet quantity of peptides.
Results: This study presents an innovative approach integrating uniform design (UD) on the sampling over the whole space with artificial intelligence (AI) on the sampled data to enhance prediction of key physicochemical properties, including aggregation propensity (AP), hydrophilicity (logP), and isoelectric point (pI), within the complete sequence space of tetrapeptides (160,000 sequences). Using UD, we generate 31 distinct peptide datasets, with a consistent amino acid occupation fraction of 5% at each position, thereby creating unbiased training data without any amino acid preferences for training AI models. This work provides comprehensive datasets on the physicochemical properties of all tetrapeptides, develops robust AI-based predictive models, and quantitatively elucidates the relationships between key physicochemical attributes and self-assembly behaviors of short peptides by Shapley Additive Explanations (SHAP) analysis. By integrating the strategic experimental design (i.e., UD), AI modeling, and peptide domain knowledge, our approach facilitates the discovery and optimization of functional peptides, offering new opportunities for peptide-based therapeutic applications.
Availability: The complete datasets, source code, and pre-trained models are made available at the Github repository (https://github.com/JiaqiBenWang/UD-AI-Peptide) and Zenodo (https://doi.org/10.5281/zenodo.17984124).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Uniform Design-Embedded Predictions of (Tetra-)Peptide Physicochemical Properties.","authors":"Zhihui Zhu, Huapeng Liu, Xuechen Li, Haojin Zhou, Jiaqi Wang","doi":"10.1093/bioinformatics/btag036","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag036","url":null,"abstract":"<p><strong>Motivation: </strong>Short peptides hold significant promise in drug discovery and materials science due to their biocompatibility, multifunctionality, ease of synthesis, etc. However, accurately predicting their physicochemical properties, a prerequisite for application development, remains a grand challenge due to the sheet quantity of peptides.</p><p><strong>Results: </strong>This study presents an innovative approach integrating uniform design (UD) on the sampling over the whole space with artificial intelligence (AI) on the sampled data to enhance prediction of key physicochemical properties, including aggregation propensity (AP), hydrophilicity (logP), and isoelectric point (pI), within the complete sequence space of tetrapeptides (160,000 sequences). Using UD, we generate 31 distinct peptide datasets, with a consistent amino acid occupation fraction of 5% at each position, thereby creating unbiased training data without any amino acid preferences for training AI models. This work provides comprehensive datasets on the physicochemical properties of all tetrapeptides, develops robust AI-based predictive models, and quantitatively elucidates the relationships between key physicochemical attributes and self-assembly behaviors of short peptides by Shapley Additive Explanations (SHAP) analysis. By integrating the strategic experimental design (i.e., UD), AI modeling, and peptide domain knowledge, our approach facilitates the discovery and optimization of functional peptides, offering new opportunities for peptide-based therapeutic applications.</p><p><strong>Availability: </strong>The complete datasets, source code, and pre-trained models are made available at the Github repository (https://github.com/JiaqiBenWang/UD-AI-Peptide) and Zenodo (https://doi.org/10.5281/zenodo.17984124).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1093/bioinformatics/btag030
Isis Narváez-Bandera, Ashley Lui, Yonatan Ayalew Mekonnen, Vanessa Rubio, Augustine Takyi, Noah Sulman, Christopher Wilson, Hayley D Ackerman, Oscar E Ospina, Guillermo Gonzalez-Calderon, Elsa Flores, Qian Li, Ann Chen, Brooke Fridley, Paul Stewart
Summary: Integrative Module Analysis for Multi-omics Data (iModMix) is a biology-agnostic framework that enables the discovery of novel associations across any type of quantitative abundance data, including but not limited to transcriptomics, proteomics, and metabolomics. Instead of relying on pathway annotations or prior biological knowledge, iModMix constructs data-driven modules using graphical lasso to estimate sparse networks from omics features. These modules are summarized into eigenfeatures and correlated across datasets for horizontal integration, while preserving the distinct feature sets and interpretability of each omics type. iModMix operates directly on matrices containing expression or abundances for a wide range of features, including but not limited to genes, proteins, and metabolites. Because it does not rely on annotations (e.g., KEGG identifiers), it can seamlessly incorporate both identified and unidentified metabolites, addressing a key limitation of many existing metabolomics tools. iModMix is available as a user-friendly R Shiny application requiring no programming expertise (https://imodmix.moffitt.org), and as a Bioconductor R package for advanced users (https://bioconductor.org/packages/release/bioc/html/iModMix.html). The tool includes several public and in-house datasets to illustrate its utility in identifying novel multi-omics relationships in diverse biological contexts.
Availability and implementation: iModMix is freely available from Bioconductor (https://bioconductor.org/packages/release/bioc/html/iModMix.html) and the example dataset package (iModMixData) is also available from Bioconductor (https://bioconductor.org/packages/release/ data/experiment/html/iModMixData.html). The R package source code and Docker is available from GitHub: https://github.com/biodatalab/iModMix. Shiny application can be accessed at: https://imodmix.moffitt.org.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"iModMix: Integrative Module Analysis for Multi-omics Data.","authors":"Isis Narváez-Bandera, Ashley Lui, Yonatan Ayalew Mekonnen, Vanessa Rubio, Augustine Takyi, Noah Sulman, Christopher Wilson, Hayley D Ackerman, Oscar E Ospina, Guillermo Gonzalez-Calderon, Elsa Flores, Qian Li, Ann Chen, Brooke Fridley, Paul Stewart","doi":"10.1093/bioinformatics/btag030","DOIUrl":"10.1093/bioinformatics/btag030","url":null,"abstract":"<p><strong>Summary: </strong>Integrative Module Analysis for Multi-omics Data (iModMix) is a biology-agnostic framework that enables the discovery of novel associations across any type of quantitative abundance data, including but not limited to transcriptomics, proteomics, and metabolomics. Instead of relying on pathway annotations or prior biological knowledge, iModMix constructs data-driven modules using graphical lasso to estimate sparse networks from omics features. These modules are summarized into eigenfeatures and correlated across datasets for horizontal integration, while preserving the distinct feature sets and interpretability of each omics type. iModMix operates directly on matrices containing expression or abundances for a wide range of features, including but not limited to genes, proteins, and metabolites. Because it does not rely on annotations (e.g., KEGG identifiers), it can seamlessly incorporate both identified and unidentified metabolites, addressing a key limitation of many existing metabolomics tools. iModMix is available as a user-friendly R Shiny application requiring no programming expertise (https://imodmix.moffitt.org), and as a Bioconductor R package for advanced users (https://bioconductor.org/packages/release/bioc/html/iModMix.html). The tool includes several public and in-house datasets to illustrate its utility in identifying novel multi-omics relationships in diverse biological contexts.</p><p><strong>Availability and implementation: </strong>iModMix is freely available from Bioconductor (https://bioconductor.org/packages/release/bioc/html/iModMix.html) and the example dataset package (iModMixData) is also available from Bioconductor (https://bioconductor.org/packages/release/ data/experiment/html/iModMixData.html). The R package source code and Docker is available from GitHub: https://github.com/biodatalab/iModMix. Shiny application can be accessed at: https://imodmix.moffitt.org.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1093/bioinformatics/btag010
Huaiwu Zhang, Xinliang Sun, Jianxin Wang, Min Li, Jing Tang
Motivation: Drug synergy is crucial for developing effective combination therapies, but traditional screening methods suffer from inefficiency and high costs. While deep learning shows promise for predicting drug synergy, current approaches using Transformers and graph neural networks focus on combining drug and cell line features without modelling how genes causally influence drug responses.
Results: To address this limitation, we propose CADS (Causal Adjustment for Drug Synergy), a deep learning framework that integrates causal relationships between genes and drug responses. Leveraging multi-omics data, CADS uses a learnable mask mechanism to identify key causal genes while filtering out irrelevant genetic factors through backdoor adjustment. Our model achieves two key objectives simultaneously: accurate prediction of drug synergy and interpretable causal gene discovery. Experiments on multiple datasets show that CADS consistently outperforms state-of-the-art methods across multiple metrics. Case studies demonstrate that CADS can reduce unnecessary complexity while providing more biological insights through its gene importance scores, which help identify clinically validated cancer-related genes that mediate drug interactions.
Availability and implementation: Taken together, CADS advances combination therapy prediction by explicitly modelling drug synergy causal genes, offering enhanced interpretability for AI-based drug development. The source code can be found at https://github.com/HuaiwuZhang/causalDC.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"CADS: A Causal Inference Framework for Identifying Essential Genes to Enhance Drug Synergy Prediction.","authors":"Huaiwu Zhang, Xinliang Sun, Jianxin Wang, Min Li, Jing Tang","doi":"10.1093/bioinformatics/btag010","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag010","url":null,"abstract":"<p><strong>Motivation: </strong>Drug synergy is crucial for developing effective combination therapies, but traditional screening methods suffer from inefficiency and high costs. While deep learning shows promise for predicting drug synergy, current approaches using Transformers and graph neural networks focus on combining drug and cell line features without modelling how genes causally influence drug responses.</p><p><strong>Results: </strong>To address this limitation, we propose CADS (Causal Adjustment for Drug Synergy), a deep learning framework that integrates causal relationships between genes and drug responses. Leveraging multi-omics data, CADS uses a learnable mask mechanism to identify key causal genes while filtering out irrelevant genetic factors through backdoor adjustment. Our model achieves two key objectives simultaneously: accurate prediction of drug synergy and interpretable causal gene discovery. Experiments on multiple datasets show that CADS consistently outperforms state-of-the-art methods across multiple metrics. Case studies demonstrate that CADS can reduce unnecessary complexity while providing more biological insights through its gene importance scores, which help identify clinically validated cancer-related genes that mediate drug interactions.</p><p><strong>Availability and implementation: </strong>Taken together, CADS advances combination therapy prediction by explicitly modelling drug synergy causal genes, offering enhanced interpretability for AI-based drug development. The source code can be found at https://github.com/HuaiwuZhang/causalDC.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}