首页 > 最新文献

Bioinformatics (Oxford, England)最新文献

英文 中文
SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs. 使用gpu加速基因组和全现象关联研究。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag032
Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri

Motivation: Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.

Results: We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.

Availability: Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:生物库规模的全基因组关联研究(GWAS)是计算密集型的,特别是对于需要稳健统计模型的混合种群。SAIGE是一种广泛应用于广义线性混合模型GWAS的方法,但受限于其基于cpu的实现,使得许多研究小组无法进行全现象关联研究。结果:我们开发了SAIGE- gpu,这是一个gpu加速版本的SAIGE,它用gpu优化的内核取代了cpu密集型矩阵运算。核心创新是在gpu和通信层之间分配遗传关系矩阵计算。SAIGE-GPU应用于百万老兵计划(MVP)中635,969名参与者的2,068种表型,包括多样化和混合人群,在超级计算基础设施和云平台上实现了混合模型拟合的5倍加速。通过多核、多性状并行化进一步优化变异关联测试步骤。该方法部署在谷歌云平台和Azure上,节省了大量的成本和时间。可用性:源代码和二进制文件可从https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3下载。为了重现性,代码快照在Zenodo存档(DOI: [10.5281/ Zenodo .17642591])。SAIGE-GPU以容器化格式提供,可跨HPC和云环境使用,并在R/ c++中实现,在Linux系统上运行。补充信息:补充数据可在生物信息学在线获取。
{"title":"SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs.","authors":"Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri","doi":"10.1093/bioinformatics/btag032","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag032","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.</p><p><strong>Results: </strong>We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.</p><p><strong>Availability: </strong>Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset. 应用于TEDDY微生物组数据集的嵌套病例对照研究中存在竞争风险的纵向生物标志物和生存结果联合建模
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag038
Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu

Motivation: Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.

Results: Motivated by the TEDDY study, we propose "JM-NCC", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.

Availability: Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:大规模前瞻性队列研究收集纵向生物标本以及事件时间结果,以调查与疾病风险相关的生物标志物动态。嵌套病例对照(NCC)设计为全队列生物标志物研究提供了一种具有成本效益的替代方案,同时保持了统计效率。尽管纵向和事件时间结果的联合建模取得了进展,但很少有方法解决NCC抽样、非正态分布生物标志物和竞争生存结果所带来的独特挑战。结果:在TEDDY研究的激励下,我们提出了“JM-NCC”,这是一个为具有竞争项目的NCC研究设计的联合建模框架。它将潜在非正态分布生物标志物的广义线性混合效应模型与竞争风险的原因特定风险模型集成在一起。提出了两种估计方法。fJM-NCC利用NCC亚队列纵向生物标志物数据和全队列生存和临床元数据,而wJM-NCC仅使用NCC亚队列数据。仿真研究和对TEDDY微生物组数据集的应用都证明了所提出方法的鲁棒性和有效性。可用性:软件可从https://github.com/Zhaoyn-oss/JMNCC获得,并在Zenodo上存档https://zenodo.org/records/18199759 (DOI: 10.5281/ Zenodo .18199759)。补充信息:补充数据可在生物信息学在线获取。
{"title":"Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset.","authors":"Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu","doi":"10.1093/bioinformatics/btag038","DOIUrl":"10.1093/bioinformatics/btag038","url":null,"abstract":"<p><strong>Motivation: </strong>Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.</p><p><strong>Results: </strong>Motivated by the TEDDY study, we propose \"JM-NCC\", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.</p><p><strong>Availability: </strong>Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
De novo protein ligand design including protein flexibility and conformational adaptation. 从头开始的蛋白质配体设计,包括蛋白质柔韧性和构象适应。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag027
Jakob Agamia, Martin Zacharias

Motivation: The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.

Results: Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.

Availability and implementation: Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.

动机:合理设计化合物结合所需的蛋白质靶分子是药物发现的主要目标。目前大多数分子对接、基于片段的构建或基于机器学习的生成药物设计方法都采用刚性蛋白质靶结构。结果:基于预测蛋白质结构和化合物复合物的最新进展,我们设计了一种AI-MCLig方法来优化化合物与完全柔性和构象适应性的蛋白质结合区域的结合。在随机改变化合物的蒙特卡罗(MC)模拟过程中,使用Chai-1蛋白结构预测程序在每个MC步骤中完全重建目标蛋白-化合物复合物。除了化合物的灵活性,它还允许蛋白质适应化学变化的化合物。基于原子/键类型变化或基于结合更大的化学碎片的mc协议已经进行了测试。在三个测试目标上的模拟结果表明,潜在的配体显示出非常好的结合分数,与使用几种不同评分方案的实验已知结合剂相当。基于mc的化合物设计方法是对现有方法的补充,可以帮助快速设计推定的结合物,包括诱导蛋白质靶点的匹配。可用性和实现:数据集、示例和源代码可在我们的公共GitHub存储库https:/github.com/JakobAgamia/AI-MCLig和Zenodo https://doi.org/10.5281/zenodo.17800140上获得。
{"title":"De novo protein ligand design including protein flexibility and conformational adaptation.","authors":"Jakob Agamia, Martin Zacharias","doi":"10.1093/bioinformatics/btag027","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag027","url":null,"abstract":"<p><strong>Motivation: </strong>The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.</p><p><strong>Results: </strong>Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.</p><p><strong>Availability and implementation: </strong>Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of qualifying variants for genomic analysis. 限定变异在基因组分析中的应用。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btaf676
Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Simon Boutry, Veronika Stadler, Sabine Österle, Jan Armida, David Haerry, D Sean Froese, Luregn J Schlapbach, Jacques Fellay

Motivation: Qualifying variants (QVs) are genomic alterations selected by defined criteria within analysis pipelines. Although crucial for both research and clinical diagnostics, QVs are often seen as simple filters rather than dynamic elements that influence the entire workflow. In practice these rules are embedded within pipelines, which hinders transparency, audit, and reuse across tools. A unified, portable specification for QV criteria is needed.

Results: Our aim is to embed the concept of a "QV" into the genomic analysis vernacular, moving beyond its treatment as a single filtering step. By decoupling QV criteria from pipeline variables and code, the framework enables clearer discussion, application, and reuse. It provides a flexible reference model for integrating QVs into analysis pipelines, improving reproducibility, interpretability, and interdisciplinary communication. Validation across diverse applications confirmed that QV based workflows match conventional methods while offering greater clarity and scalability.

Availability: The source code and data are accessible at the Zenodo repository https://doi.org/10.5281/zenodo.17414191. Manuscript files are available at https://github.com/DylanLawless/qvApp2025lawless. The QV framework is available under the MIT licence, and the dataset will be maintained for at least two years following publication.

动机:合格变体(qv)是由分析管道中定义的标准选择的基因组改变。尽管对于研究和临床诊断都至关重要,但qv通常被视为简单的过滤器,而不是影响整个工作流程的动态元素。在实践中,这些规则被嵌入到管道中,这阻碍了工具之间的透明性、审计和重用。需要一个统一的、可移植的QV标准规范。结果:我们的目标是将“QV”的概念嵌入基因组分析方言中,超越其作为单一过滤步骤的处理。通过将QV标准与管道变量和代码解耦,框架支持更清晰的讨论、应用和重用。它提供了一个灵活的参考模型,用于将qv集成到分析管道中,提高再现性、可解释性和跨学科的交流。跨不同应用程序的验证证实,基于QV的工作流与传统方法相匹配,同时提供更大的清晰度和可扩展性。可用性:源代码和数据可以在Zenodo存储库https://doi.org/10.5281/zenodo.17414191上访问。手稿文件可在https://github.com/DylanLawless/qvApp2025lawless上获得。QV框架在MIT许可下可用,数据集将在出版后至少维护两年。
{"title":"Application of qualifying variants for genomic analysis.","authors":"Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Simon Boutry, Veronika Stadler, Sabine Österle, Jan Armida, David Haerry, D Sean Froese, Luregn J Schlapbach, Jacques Fellay","doi":"10.1093/bioinformatics/btaf676","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf676","url":null,"abstract":"<p><strong>Motivation: </strong>Qualifying variants (QVs) are genomic alterations selected by defined criteria within analysis pipelines. Although crucial for both research and clinical diagnostics, QVs are often seen as simple filters rather than dynamic elements that influence the entire workflow. In practice these rules are embedded within pipelines, which hinders transparency, audit, and reuse across tools. A unified, portable specification for QV criteria is needed.</p><p><strong>Results: </strong>Our aim is to embed the concept of a \"QV\" into the genomic analysis vernacular, moving beyond its treatment as a single filtering step. By decoupling QV criteria from pipeline variables and code, the framework enables clearer discussion, application, and reuse. It provides a flexible reference model for integrating QVs into analysis pipelines, improving reproducibility, interpretability, and interdisciplinary communication. Validation across diverse applications confirmed that QV based workflows match conventional methods while offering greater clarity and scalability.</p><p><strong>Availability: </strong>The source code and data are accessible at the Zenodo repository https://doi.org/10.5281/zenodo.17414191. Manuscript files are available at https://github.com/DylanLawless/qvApp2025lawless. The QV framework is available under the MIT licence, and the dataset will be maintained for at least two years following publication.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unavailability of experimental 3D structural data on protein folding dynamics and necessity for a new generation of structure prediction methods in this context. 蛋白质折叠动力学实验三维结构数据的不可获得性和新一代结构预测方法在此背景下的必要性。
IF 5.4 Pub Date : 2026-01-20 DOI: 10.1093/bioinformatics/btag020
Aydin Wells, Khalique Newaz, Jennifer Morones, Jianlin Cheng, Tijana Milenković

Motivation: Protein folding is a dynamic process during which a protein's amino acid sequence undergoes a series of 3-dimensional (3D) conformational changes en route to reaching a native 3D structure; these conformations are called folding intermediates. While data on native 3D structures are abundant, data on 3D structures of non-native intermediates remain sparse, due to limitations of current technologies for experimental determination of 3D structures. Yet, analyzing folding intermediates is crucial for understanding folding dynamics and misfolding-related diseases. Hence, we search the literature for available (experimentally and computationally obtained) 3D structural data on folding intermediates, organizing the data in a centralized resource. Also, we assess whether existing methods, designed for predicting native structures, can be utilized to predict structures of non-native intermediates.

Results: Our literature search reveals six studies that provide 3D structural data on folding intermediates (two for post-translational and four for co-translational folding), each focused on a single protein, with 2-4 intermediates. Our assessment shows that an established method for predicting native structures, AlphaFold2, does not perform well for non-native intermediates in the context of co-translational folding; a recent study on post-translational folding concluded the same for even more existing methods. Yet, we identify in the literature recent pioneering methods designed explicitly to predict 3D structures of folding intermediates by incorporating intrinsic biophysical characteristics of folding dynamics, which show promise. This study assesses the current landscape and future directions of the field of 3D structural analysis of protein folding dynamics.

Availability and implementation: https://github.com/Aywells/3Dpfi or https://www3.nd.edu/ cone/3Dpfi.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:蛋白质折叠是一个动态过程,在此过程中,蛋白质的氨基酸序列在达到天然3D结构的过程中经历了一系列三维(3D)构象变化;这些构象被称为折叠中间体。虽然原生三维结构的数据丰富,但由于目前实验确定三维结构的技术限制,非原生中间体的三维结构数据仍然很少。然而,分析折叠中间体对于理解折叠动力学和错误折叠相关疾病至关重要。因此,我们搜索文献中可用的(实验和计算获得的)折叠中间体的三维结构数据,将数据组织在一个集中的资源中。此外,我们还评估了用于预测原生结构的现有方法是否可以用于预测非原生中间体的结构。结果:我们的文献检索揭示了六项研究提供了折叠中间体的三维结构数据(两项用于翻译后折叠,四项用于共翻译折叠),每项研究都集中在一个蛋白质上,有2-4个中间体。我们的评估表明,用于预测天然结构的既定方法AlphaFold2在共翻译折叠的背景下对非天然中间体表现不佳;最近一项关于翻译后折叠的研究得出了同样的结论,适用于更多现有的方法。然而,我们在文献中发现了最近的开创性方法,通过结合折叠动力学的内在生物物理特性来明确预测折叠中间体的3D结构,这些方法显示出了希望。本研究评估了蛋白质折叠动力学的三维结构分析领域的现状和未来方向。可用性和实现:https://github.com/Aywells/3Dpfi或https://www3.nd.edu/ cone/3Dpfi。补充信息:补充数据可在生物信息学在线获取。
{"title":"Unavailability of experimental 3D structural data on protein folding dynamics and necessity for a new generation of structure prediction methods in this context.","authors":"Aydin Wells, Khalique Newaz, Jennifer Morones, Jianlin Cheng, Tijana Milenković","doi":"10.1093/bioinformatics/btag020","DOIUrl":"10.1093/bioinformatics/btag020","url":null,"abstract":"<p><strong>Motivation: </strong>Protein folding is a dynamic process during which a protein's amino acid sequence undergoes a series of 3-dimensional (3D) conformational changes en route to reaching a native 3D structure; these conformations are called folding intermediates. While data on native 3D structures are abundant, data on 3D structures of non-native intermediates remain sparse, due to limitations of current technologies for experimental determination of 3D structures. Yet, analyzing folding intermediates is crucial for understanding folding dynamics and misfolding-related diseases. Hence, we search the literature for available (experimentally and computationally obtained) 3D structural data on folding intermediates, organizing the data in a centralized resource. Also, we assess whether existing methods, designed for predicting native structures, can be utilized to predict structures of non-native intermediates.</p><p><strong>Results: </strong>Our literature search reveals six studies that provide 3D structural data on folding intermediates (two for post-translational and four for co-translational folding), each focused on a single protein, with 2-4 intermediates. Our assessment shows that an established method for predicting native structures, AlphaFold2, does not perform well for non-native intermediates in the context of co-translational folding; a recent study on post-translational folding concluded the same for even more existing methods. Yet, we identify in the literature recent pioneering methods designed explicitly to predict 3D structures of folding intermediates by incorporating intrinsic biophysical characteristics of folding dynamics, which show promise. This study assesses the current landscape and future directions of the field of 3D structural analysis of protein folding dynamics.</p><p><strong>Availability and implementation: </strong>https://github.com/Aywells/3Dpfi or https://www3.nd.edu/ cone/3Dpfi.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aggregation of gene regulatory information and knowledge on FAIR principles enables discovery of pathogenic gene regulatory variants. 基因调控信息的聚合和公平原则的知识使致病基因调控变异的发现成为可能。
IF 5.4 Pub Date : 2026-01-20 DOI: 10.1093/bioinformatics/btag013
Keyang Yu, Haoquan Zhao, Andrea Wilderman, Tierra Farris, Jessie Arce, David Chen, Andrew R Jackson, Yiran Guo, Qi Li, Bosko Jevtic, Dubravka Jevtic, Vuk Milinovic, Yuankun Zhu, Jeremy Costanza, Eric Wenger, Chris Nemarich, Lisa Anderson, Aleksandar Mihajlović, Kristin Ardlie, Shaine A Morris, Matthew Roth, Deanne M Taylor, Adam C Resnick, Lilei Zhang, Aleksandar Milosavljevic

Motivation: Methods for sharing gene regulatory information and knowledge on FAIR principles-particularly in the context of tissue-specific gene regulation-remain poorly defined and implemented, hampering discovery and clinical genetic diagnosis.

Results: We specified FAIR principles for tissue-specific gene regulatory information and knowledge; implemented them by developing a registry of regulatory elements and aggregating FAIR gene regulatory information from several major sources; developed computational tools that utilize these FAIR resources; and demonstrated their utility by associating gene regulatory variants with major subtypes of congenital heart disease.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:共享基因调控信息和公平原则知识的方法-特别是在组织特异性基因调控的背景下-仍然缺乏定义和实施,阻碍了发现和临床遗传诊断。结果:我们为组织特异性基因调控信息和知识制定了FAIR原则;通过建立调控元件注册表和从几个主要来源收集FAIR基因调控信息来实施它们;开发了利用这些FAIR资源的计算工具;并通过将基因调控变异与先天性心脏病的主要亚型联系起来,证明了它们的效用。补充信息:补充数据可在生物信息学在线获取。
{"title":"Aggregation of gene regulatory information and knowledge on FAIR principles enables discovery of pathogenic gene regulatory variants.","authors":"Keyang Yu, Haoquan Zhao, Andrea Wilderman, Tierra Farris, Jessie Arce, David Chen, Andrew R Jackson, Yiran Guo, Qi Li, Bosko Jevtic, Dubravka Jevtic, Vuk Milinovic, Yuankun Zhu, Jeremy Costanza, Eric Wenger, Chris Nemarich, Lisa Anderson, Aleksandar Mihajlović, Kristin Ardlie, Shaine A Morris, Matthew Roth, Deanne M Taylor, Adam C Resnick, Lilei Zhang, Aleksandar Milosavljevic","doi":"10.1093/bioinformatics/btag013","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag013","url":null,"abstract":"<p><strong>Motivation: </strong>Methods for sharing gene regulatory information and knowledge on FAIR principles-particularly in the context of tissue-specific gene regulation-remain poorly defined and implemented, hampering discovery and clinical genetic diagnosis.</p><p><strong>Results: </strong>We specified FAIR principles for tissue-specific gene regulatory information and knowledge; implemented them by developing a registry of regulatory elements and aggregating FAIR gene regulatory information from several major sources; developed computational tools that utilize these FAIR resources; and demonstrated their utility by associating gene regulatory variants with major subtypes of congenital heart disease.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
textToKnowledgeGraph: Generation of Molecular Interaction Knowledge Graphs Using Large Language Models for Exploration in Cytoscape. 生成分子相互作用知识图使用大语言模型探索细胞景观。
IF 5.4 Pub Date : 2026-01-19 DOI: 10.1093/bioinformatics/btag031
Favour James, Dexter Pratt, Christopher Churas, Augustin Luna

Motivation: Knowledge graphs (KGs) are powerful tools for structuring and analyzing biological information due to their ability to represent data and improve queries across heterogeneous datasets. However, constructing KGs from unstructured literature remains challenging due to the cost and expertise required for manual curation. Prior works have explored text-mining techniques to automate this process, but have limitations that impact their ability to capture complex relationships fully. Traditional text-mining methods struggle with understanding context across sentences. Additionally, these methods lack expert-level background knowledge, making it difficult to infer relationships that require awareness of concepts indirectly described in the text. Large Language Models (LLMs) present an opportunity to overcome these challenges. LLMs are trained on diverse literature, equipping them with contextual knowledge that enables more accurate information extraction.

Results: We present textToKnowledgeGraph, an artificial intelligence tool using LLMs to extract interactions from individual publications directly in Biological Expression Language (BEL). BEL was chosen for its compact, detailed representation of biological relationships, enabling structured, computationally accessible encoding. This work makes several contributions. 1. Development of the open-source Python textToKnowledgeGraph package (pypi.org/project/texttoknowledgegraph) for BEL extraction from scientific articles, usable from the command line and within other projects, 2. An interactive application within Cytoscape Web to simplify extraction and exploration, 3. A dataset of extractions that have been both computationally and manually reviewed to support future fine-tuning efforts.

Availability: https://github.com/ndexbio/llm-text-to-knowledge-graph.

Contact: augustin@nih.gov; favour.ujames196@gmail.com; depratt@health.ucsd.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:知识图(KGs)是构建和分析生物信息的强大工具,因为它们能够表示数据并改进跨异构数据集的查询。然而,由于手工管理所需的成本和专业知识,从非结构化文献中构建知识库仍然具有挑战性。先前的工作已经探索了文本挖掘技术来自动化这个过程,但是有一些限制,影响了它们完全捕获复杂关系的能力。传统的文本挖掘方法很难理解句子之间的上下文。此外,这些方法缺乏专家级的背景知识,因此很难推断出需要了解文本中间接描述的概念的关系。大型语言模型(llm)为克服这些挑战提供了机会。法学硕士接受过不同文献的培训,使他们具备上下文知识,从而能够更准确地提取信息。结果:我们提出了textToKnowledgeGraph,这是一个人工智能工具,使用法学硕士直接用生物表达语言(BEL)从单个出版物中提取交互。选择BEL是因为它紧凑、详细地表示生物关系,使结构化、计算可访问的编码成为可能。这项工作有几个贡献。1. 开发开源Python textToKnowledgeGraph包(pypi.org/project/texttoknowledgegraph),用于从科学文章中提取BEL,可从命令行和其他项目中使用;2 .在Cytoscape Web中简化提取和探索的交互式应用程序。经过计算和手动审查的提取数据集,以支持未来的微调工作。可用性:https://github.com/ndexbio/llm-text-to-knowledge-graph.Contact: augustin@nih.gov;favour.ujames196@gmail.com;depratt@health.ucsd.edu.Supplementary information:补充数据可在Bioinformatics网站在线获得。
{"title":"textToKnowledgeGraph: Generation of Molecular Interaction Knowledge Graphs Using Large Language Models for Exploration in Cytoscape.","authors":"Favour James, Dexter Pratt, Christopher Churas, Augustin Luna","doi":"10.1093/bioinformatics/btag031","DOIUrl":"10.1093/bioinformatics/btag031","url":null,"abstract":"<p><strong>Motivation: </strong>Knowledge graphs (KGs) are powerful tools for structuring and analyzing biological information due to their ability to represent data and improve queries across heterogeneous datasets. However, constructing KGs from unstructured literature remains challenging due to the cost and expertise required for manual curation. Prior works have explored text-mining techniques to automate this process, but have limitations that impact their ability to capture complex relationships fully. Traditional text-mining methods struggle with understanding context across sentences. Additionally, these methods lack expert-level background knowledge, making it difficult to infer relationships that require awareness of concepts indirectly described in the text. Large Language Models (LLMs) present an opportunity to overcome these challenges. LLMs are trained on diverse literature, equipping them with contextual knowledge that enables more accurate information extraction.</p><p><strong>Results: </strong>We present textToKnowledgeGraph, an artificial intelligence tool using LLMs to extract interactions from individual publications directly in Biological Expression Language (BEL). BEL was chosen for its compact, detailed representation of biological relationships, enabling structured, computationally accessible encoding. This work makes several contributions. 1. Development of the open-source Python textToKnowledgeGraph package (pypi.org/project/texttoknowledgegraph) for BEL extraction from scientific articles, usable from the command line and within other projects, 2. An interactive application within Cytoscape Web to simplify extraction and exploration, 3. A dataset of extractions that have been both computationally and manually reviewed to support future fine-tuning efforts.</p><p><strong>Availability: </strong>https://github.com/ndexbio/llm-text-to-knowledge-graph.</p><p><strong>Contact: </strong>augustin@nih.gov; favour.ujames196@gmail.com; depratt@health.ucsd.edu.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uniform Design-Embedded Predictions of (Tetra-)Peptide Physicochemical Properties. (四)肽物理化学性质的均匀设计嵌入预测。
IF 5.4 Pub Date : 2026-01-19 DOI: 10.1093/bioinformatics/btag036
Zhihui Zhu, Huapeng Liu, Xuechen Li, Haojin Zhou, Jiaqi Wang

Motivation: Short peptides hold significant promise in drug discovery and materials science due to their biocompatibility, multifunctionality, ease of synthesis, etc. However, accurately predicting their physicochemical properties, a prerequisite for application development, remains a grand challenge due to the sheet quantity of peptides.

Results: This study presents an innovative approach integrating uniform design (UD) on the sampling over the whole space with artificial intelligence (AI) on the sampled data to enhance prediction of key physicochemical properties, including aggregation propensity (AP), hydrophilicity (logP), and isoelectric point (pI), within the complete sequence space of tetrapeptides (160,000 sequences). Using UD, we generate 31 distinct peptide datasets, with a consistent amino acid occupation fraction of 5% at each position, thereby creating unbiased training data without any amino acid preferences for training AI models. This work provides comprehensive datasets on the physicochemical properties of all tetrapeptides, develops robust AI-based predictive models, and quantitatively elucidates the relationships between key physicochemical attributes and self-assembly behaviors of short peptides by Shapley Additive Explanations (SHAP) analysis. By integrating the strategic experimental design (i.e., UD), AI modeling, and peptide domain knowledge, our approach facilitates the discovery and optimization of functional peptides, offering new opportunities for peptide-based therapeutic applications.

Availability: The complete datasets, source code, and pre-trained models are made available at the Github repository (https://github.com/JiaqiBenWang/UD-AI-Peptide) and Zenodo (https://doi.org/10.5281/zenodo.17984124).

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:短肽由于其生物相容性、多功能性、易于合成等特点,在药物发现和材料科学中具有重要的应用前景。然而,由于多肽片的数量,准确预测其物理化学性质仍然是一个巨大的挑战,这是应用开发的前提。结果:本研究提出了一种创新的方法,将整个空间采样的均匀设计(UD)与采样数据的人工智能(AI)相结合,增强了对四肽完整序列空间(160,000个序列)内关键物理化学性质的预测,包括聚集倾向(AP)、亲水性(logP)和等电点(pI)。使用UD,我们生成了31个不同的肽数据集,每个位置的氨基酸占用率一致为5%,从而为训练AI模型创建了没有任何氨基酸偏好的无偏训练数据。这项工作提供了所有四肽的物理化学性质的综合数据集,开发了强大的基于人工智能的预测模型,并通过Shapley加性解释(SHAP)分析定量阐明了短肽的关键物理化学属性与自组装行为之间的关系。通过整合战略性实验设计(即UD)、人工智能建模和肽域知识,我们的方法促进了功能肽的发现和优化,为基于肽的治疗应用提供了新的机会。可用性:完整的数据集、源代码和预训练模型可在Github存储库(https://github.com/JiaqiBenWang/UD-AI-Peptide)和Zenodo (https://doi.org/10.5281/zenodo.17984124).Supplementary)上获得。信息:补充数据可在Bioinformatics在线获取。
{"title":"Uniform Design-Embedded Predictions of (Tetra-)Peptide Physicochemical Properties.","authors":"Zhihui Zhu, Huapeng Liu, Xuechen Li, Haojin Zhou, Jiaqi Wang","doi":"10.1093/bioinformatics/btag036","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag036","url":null,"abstract":"<p><strong>Motivation: </strong>Short peptides hold significant promise in drug discovery and materials science due to their biocompatibility, multifunctionality, ease of synthesis, etc. However, accurately predicting their physicochemical properties, a prerequisite for application development, remains a grand challenge due to the sheet quantity of peptides.</p><p><strong>Results: </strong>This study presents an innovative approach integrating uniform design (UD) on the sampling over the whole space with artificial intelligence (AI) on the sampled data to enhance prediction of key physicochemical properties, including aggregation propensity (AP), hydrophilicity (logP), and isoelectric point (pI), within the complete sequence space of tetrapeptides (160,000 sequences). Using UD, we generate 31 distinct peptide datasets, with a consistent amino acid occupation fraction of 5% at each position, thereby creating unbiased training data without any amino acid preferences for training AI models. This work provides comprehensive datasets on the physicochemical properties of all tetrapeptides, develops robust AI-based predictive models, and quantitatively elucidates the relationships between key physicochemical attributes and self-assembly behaviors of short peptides by Shapley Additive Explanations (SHAP) analysis. By integrating the strategic experimental design (i.e., UD), AI modeling, and peptide domain knowledge, our approach facilitates the discovery and optimization of functional peptides, offering new opportunities for peptide-based therapeutic applications.</p><p><strong>Availability: </strong>The complete datasets, source code, and pre-trained models are made available at the Github repository (https://github.com/JiaqiBenWang/UD-AI-Peptide) and Zenodo (https://doi.org/10.5281/zenodo.17984124).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
iModMix: Integrative Module Analysis for Multi-omics Data. iModMix:多组学数据的集成模块分析。
IF 5.4 Pub Date : 2026-01-19 DOI: 10.1093/bioinformatics/btag030
Isis Narváez-Bandera, Ashley Lui, Yonatan Ayalew Mekonnen, Vanessa Rubio, Augustine Takyi, Noah Sulman, Christopher Wilson, Hayley D Ackerman, Oscar E Ospina, Guillermo Gonzalez-Calderon, Elsa Flores, Qian Li, Ann Chen, Brooke Fridley, Paul Stewart

Summary: Integrative Module Analysis for Multi-omics Data (iModMix) is a biology-agnostic framework that enables the discovery of novel associations across any type of quantitative abundance data, including but not limited to transcriptomics, proteomics, and metabolomics. Instead of relying on pathway annotations or prior biological knowledge, iModMix constructs data-driven modules using graphical lasso to estimate sparse networks from omics features. These modules are summarized into eigenfeatures and correlated across datasets for horizontal integration, while preserving the distinct feature sets and interpretability of each omics type. iModMix operates directly on matrices containing expression or abundances for a wide range of features, including but not limited to genes, proteins, and metabolites. Because it does not rely on annotations (e.g., KEGG identifiers), it can seamlessly incorporate both identified and unidentified metabolites, addressing a key limitation of many existing metabolomics tools. iModMix is available as a user-friendly R Shiny application requiring no programming expertise (https://imodmix.moffitt.org), and as a Bioconductor R package for advanced users (https://bioconductor.org/packages/release/bioc/html/iModMix.html). The tool includes several public and in-house datasets to illustrate its utility in identifying novel multi-omics relationships in diverse biological contexts.

Availability and implementation: iModMix is freely available from Bioconductor (https://bioconductor.org/packages/release/bioc/html/iModMix.html) and the example dataset package (iModMixData) is also available from Bioconductor (https://bioconductor.org/packages/release/ data/experiment/html/iModMixData.html). The R package source code and Docker is available from GitHub: https://github.com/biodatalab/iModMix. Shiny application can be accessed at: https://imodmix.moffitt.org.

Supplementary information: Supplementary data are available at Bioinformatics online.

摘要:多组学数据整合模块分析(iModMix)是一个生物学不可知的框架,可以在任何类型的定量丰度数据中发现新的关联,包括但不限于转录组学、蛋白质组学和代谢组学。iModMix不依赖路径注释或先前的生物学知识,而是使用图形lasso构建数据驱动模块,从组学特征中估计稀疏网络。这些模块被总结为特征特征,并在数据集之间进行横向整合,同时保留每个组学类型的独特特征集和可解释性。iModMix直接操作包含表达或丰度的基质,用于广泛的特征,包括但不限于基因,蛋白质和代谢物。因为它不依赖于注释(例如,KEGG标识符),它可以无缝地合并已识别和未识别的代谢物,解决了许多现有代谢组学工具的一个关键限制。iModMix是一个用户友好的R Shiny应用程序,不需要编程专业知识(https://imodmix.moffitt.org),也可以作为高级用户的Bioconductor R包(https://bioconductor.org/packages/release/bioc/html/iModMix.html)。该工具包括几个公共和内部数据集,以说明其在识别不同生物学背景下新的多组学关系方面的效用。可用性和实现:iModMix可以从Bioconductor (https://bioconductor.org/packages/release/bioc/html/iModMix.html)免费获得,示例数据集包(iModMixData)也可以从Bioconductor (https://bioconductor.org/packages/release/ data/experiment/html/iModMixData.html)获得。R包源代码和Docker可从GitHub: https://github.com/biodatalab/iModMix获得。闪亮应用程序可访问:https://imodmix.moffitt.org.Supplementary信息:补充数据可在Bioinformatics在线。
{"title":"iModMix: Integrative Module Analysis for Multi-omics Data.","authors":"Isis Narváez-Bandera, Ashley Lui, Yonatan Ayalew Mekonnen, Vanessa Rubio, Augustine Takyi, Noah Sulman, Christopher Wilson, Hayley D Ackerman, Oscar E Ospina, Guillermo Gonzalez-Calderon, Elsa Flores, Qian Li, Ann Chen, Brooke Fridley, Paul Stewart","doi":"10.1093/bioinformatics/btag030","DOIUrl":"10.1093/bioinformatics/btag030","url":null,"abstract":"<p><strong>Summary: </strong>Integrative Module Analysis for Multi-omics Data (iModMix) is a biology-agnostic framework that enables the discovery of novel associations across any type of quantitative abundance data, including but not limited to transcriptomics, proteomics, and metabolomics. Instead of relying on pathway annotations or prior biological knowledge, iModMix constructs data-driven modules using graphical lasso to estimate sparse networks from omics features. These modules are summarized into eigenfeatures and correlated across datasets for horizontal integration, while preserving the distinct feature sets and interpretability of each omics type. iModMix operates directly on matrices containing expression or abundances for a wide range of features, including but not limited to genes, proteins, and metabolites. Because it does not rely on annotations (e.g., KEGG identifiers), it can seamlessly incorporate both identified and unidentified metabolites, addressing a key limitation of many existing metabolomics tools. iModMix is available as a user-friendly R Shiny application requiring no programming expertise (https://imodmix.moffitt.org), and as a Bioconductor R package for advanced users (https://bioconductor.org/packages/release/bioc/html/iModMix.html). The tool includes several public and in-house datasets to illustrate its utility in identifying novel multi-omics relationships in diverse biological contexts.</p><p><strong>Availability and implementation: </strong>iModMix is freely available from Bioconductor (https://bioconductor.org/packages/release/bioc/html/iModMix.html) and the example dataset package (iModMixData) is also available from Bioconductor (https://bioconductor.org/packages/release/ data/experiment/html/iModMixData.html). The R package source code and Docker is available from GitHub: https://github.com/biodatalab/iModMix. Shiny application can be accessed at: https://imodmix.moffitt.org.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CADS: A Causal Inference Framework for Identifying Essential Genes to Enhance Drug Synergy Prediction. CADS:鉴定必要基因以增强药物协同作用预测的因果推理框架。
IF 5.4 Pub Date : 2026-01-14 DOI: 10.1093/bioinformatics/btag010
Huaiwu Zhang, Xinliang Sun, Jianxin Wang, Min Li, Jing Tang

Motivation: Drug synergy is crucial for developing effective combination therapies, but traditional screening methods suffer from inefficiency and high costs. While deep learning shows promise for predicting drug synergy, current approaches using Transformers and graph neural networks focus on combining drug and cell line features without modelling how genes causally influence drug responses.

Results: To address this limitation, we propose CADS (Causal Adjustment for Drug Synergy), a deep learning framework that integrates causal relationships between genes and drug responses. Leveraging multi-omics data, CADS uses a learnable mask mechanism to identify key causal genes while filtering out irrelevant genetic factors through backdoor adjustment. Our model achieves two key objectives simultaneously: accurate prediction of drug synergy and interpretable causal gene discovery. Experiments on multiple datasets show that CADS consistently outperforms state-of-the-art methods across multiple metrics. Case studies demonstrate that CADS can reduce unnecessary complexity while providing more biological insights through its gene importance scores, which help identify clinically validated cancer-related genes that mediate drug interactions.

Availability and implementation: Taken together, CADS advances combination therapy prediction by explicitly modelling drug synergy causal genes, offering enhanced interpretability for AI-based drug development. The source code can be found at https://github.com/HuaiwuZhang/causalDC.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:药物协同作用对于开发有效的联合疗法至关重要,但传统的筛选方法效率低下且成本高。虽然深度学习有望预测药物协同作用,但目前使用transformer和图神经网络的方法侧重于结合药物和细胞系特征,而没有模拟基因如何因果影响药物反应。为了解决这一限制,我们提出了CADS(因果调整药物协同),这是一个深度学习框架,整合了基因和药物反应之间的因果关系。CADS利用多组学数据,利用可学习的掩模机制识别关键的致病基因,同时通过后门调节过滤掉无关的遗传因素。我们的模型同时实现了两个关键目标:准确预测药物协同作用和发现可解释的因果基因。在多个数据集上的实验表明,CADS在多个指标上始终优于最先进的方法。案例研究表明,CADS可以减少不必要的复杂性,同时通过其基因重要性评分提供更多的生物学见解,这有助于识别经临床验证的介导药物相互作用的癌症相关基因。可用性和实施:总的来说,CADS通过明确建模药物协同作用因果基因来推进联合治疗预测,为基于人工智能的药物开发提供增强的可解释性。源代码可在https://github.com/HuaiwuZhang/causalDC.Supplementary信息中找到:补充数据可在Bioinformatics在线获得。
{"title":"CADS: A Causal Inference Framework for Identifying Essential Genes to Enhance Drug Synergy Prediction.","authors":"Huaiwu Zhang, Xinliang Sun, Jianxin Wang, Min Li, Jing Tang","doi":"10.1093/bioinformatics/btag010","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag010","url":null,"abstract":"<p><strong>Motivation: </strong>Drug synergy is crucial for developing effective combination therapies, but traditional screening methods suffer from inefficiency and high costs. While deep learning shows promise for predicting drug synergy, current approaches using Transformers and graph neural networks focus on combining drug and cell line features without modelling how genes causally influence drug responses.</p><p><strong>Results: </strong>To address this limitation, we propose CADS (Causal Adjustment for Drug Synergy), a deep learning framework that integrates causal relationships between genes and drug responses. Leveraging multi-omics data, CADS uses a learnable mask mechanism to identify key causal genes while filtering out irrelevant genetic factors through backdoor adjustment. Our model achieves two key objectives simultaneously: accurate prediction of drug synergy and interpretable causal gene discovery. Experiments on multiple datasets show that CADS consistently outperforms state-of-the-art methods across multiple metrics. Case studies demonstrate that CADS can reduce unnecessary complexity while providing more biological insights through its gene importance scores, which help identify clinically validated cancer-related genes that mediate drug interactions.</p><p><strong>Availability and implementation: </strong>Taken together, CADS advances combination therapy prediction by explicitly modelling drug synergy causal genes, offering enhanced interpretability for AI-based drug development. The source code can be found at https://github.com/HuaiwuZhang/causalDC.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics (Oxford, England)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1