首页 > 最新文献

Bioinformatics (Oxford, England)最新文献

英文 中文
X-intNMF: A Cross- and Intra-Omics Regularized NMF Framework for Multi-Omics Integration. X-intNMF:用于多组学整合的跨组和组内正则化NMF框架。
IF 5.4 Pub Date : 2026-01-27 DOI: 10.1093/bioinformatics/btag046
Tien-Thanh Bui, Rui Xie, Wei Zhang

Motivation: The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.

Results: For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.

Availability and implementation: The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:多组学数据的快速积累为促进我们对复杂疾病和生物系统的理解提供了宝贵的机会,推动了综合计算方法的发展。然而,生物过程的复杂性,跨越多个分子层,涉及复杂的调控相互作用,需要能够捕捉组内和组间关系的模型。大多数现有的集成方法主要关注样本水平的相似性或组学内部特征的相互作用,往往忽略了不同组学层之间的相互作用。这种限制可能导致关键生物信息的丢失和次优性能。为了解决这一差距,我们提出了X-intNMF,这是一种网络正则化的非负矩阵分解(NMF)模型,在多组学整合过程中明确地结合了跨组学和组学内部特征相互作用网络。通过对这些多层关系进行建模,X-intNMF增强了生物相互作用的表征,提高了集成质量和预测精度。结果:为了评估,我们应用X-intNMF预测乳腺癌表型,并使用TCGA的mRNA表达、microRNA表达和DNA甲基化数据对肺癌和卵巢癌的临床结果进行分类。结果表明,X-intNMF始终优于最先进的方法。消融研究证实,结合交叉组学和组学内部相互作用对模型的性能改善有显著贡献。此外,对25个TCGA癌症数据集的生存分析表明,集成的多组学表示对总体生存和无病状态都具有很强的预后价值。这些发现突出了X-intNMF在NMF框架内保持可解释性、稳健性和可扩展性的同时有效地模拟多层分子相互作用的能力。可用性和实施:支持本研究的源代码和数据集可在GitHub (https://github.com/compbiolabucf/X-intNMF)上公开获取,并在Zenodo (https://doi.org/10.5281/zenodo.18238385).Supplementary)上存档。信息:补充数据可在Bioinformatics在线获取。
{"title":"X-intNMF: A Cross- and Intra-Omics Regularized NMF Framework for Multi-Omics Integration.","authors":"Tien-Thanh Bui, Rui Xie, Wei Zhang","doi":"10.1093/bioinformatics/btag046","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag046","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.</p><p><strong>Results: </strong>For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.</p><p><strong>Availability and implementation: </strong>The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How Negative Sampling Shapes the Performance of Transcription Factor Binding Site Prediction Models. 负采样如何影响转录因子结合位点预测模型的性能。
IF 5.4 Pub Date : 2026-01-27 DOI: 10.1093/bioinformatics/btag048
Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman

Motivation: Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.

Results: In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.

Availability and implementation: The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).

Supplementary information: Supplementary data are available at Bioinformatics online.

转录因子(Transcription factors, TFs)是基因调控和发育的关键角色,通过DNA结合激活和抑制基因表达。预测转录因子结合位点(TFBSs)一直是一个活跃的研究领域,许多深度学习方法被开发出来解决这个问题。这些模型通常是在TF ChIP-seq数据上训练的,通常被认为只提供阳性样本。数据集和负抽样技术的选择是这项工作的一个关键但经常被忽视的方面。结果:在本研究中,我们研究了不同的负抽样技术对TFBS预测性能的影响。我们基于ChIP-seq和ATAC-seq数据创建高质量的测试数据集,其中真阴性可以识别为可访问但不受相关TF约束的位置。然后,我们使用各种负采样技术训练模型,包括基因组采样、洗牌、二核苷酸洗牌、邻域采样和细胞系特定采样,模拟无法获得匹配ATAC-seq数据的情况。我们的结果表明,通常,在训练数据集上计算的指标给出了夸大的性能分数。在测试的技术中,基于与阳性相似度的阴性基因组抽样迄今为止表现最好,尽管仍未达到在高质量数据集上训练的基线模型的性能。在二核苷酸洗牌底片上训练的模型表现不佳,尽管在该领域是一种常见的做法。我们的研究结果强调了仔细选择负抽样技术对TFBS预测的重要性,因为它们会显著影响模型的性能和结果的解释。可用性和实现:本研究中使用的代码可在https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567)上获得。补充信息:补充数据可在生物信息学在线获取。
{"title":"How Negative Sampling Shapes the Performance of Transcription Factor Binding Site Prediction Models.","authors":"Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman","doi":"10.1093/bioinformatics/btag048","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag048","url":null,"abstract":"<p><strong>Motivation: </strong>Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.</p><p><strong>Results: </strong>In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.</p><p><strong>Availability and implementation: </strong>The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
One-Hot News: Drug Synergy Models Shortcut Molecular Features. 一个热点新闻:药物协同模型快捷分子特征。
IF 5.4 Pub Date : 2026-01-24 DOI: 10.1093/bioinformatics/btag040
Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercument Cicek, Oznur Tastan

Motivation: Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.

Results: In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.

Implementation: The scritps to run the experiments are available at: https://github.com/tastanlab/ohe.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:组合药物治疗在治疗复杂疾病方面有很大的希望,但是大量可能的药物组合使得详尽的实验测试不可行。计算模型已经开发出来,通过分配药物对细胞系组合的协同得分来指导实验筛选,其中它们输入药物的结构和化学信息以及细胞系的分子特征。这些模型的前提是,他们利用这些生物和化学信息来预测协同测量。结果:在本研究中,我们证明了用简单的单热编码代替药物和细胞系表示,在不同的已发表的药物组合模型中,其性能相当甚至略有提高。这一意想不到的发现表明,当前的模型主要使用这些表示作为标识符,并利用协同标签中的协变。我们的综合数据实验表明,模型可以从真实特征中学习;然而,当药物和细胞系在药物-细胞三联体中反复出现时,这种重复的结构会损害基于特征的学习。虽然目前的协同预测模型可以帮助在一组测试药物和细胞系中确定药物对的优先级,但我们的研究结果强调需要更好的策略来学习预期的特征,并推广到未见过的药物和细胞系。实施:运行实验的脚本可在:https://github.com/tastanlab/ohe.Supplementary上获得信息:补充数据可在Bioinformatics在线上获得。
{"title":"One-Hot News: Drug Synergy Models Shortcut Molecular Features.","authors":"Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercument Cicek, Oznur Tastan","doi":"10.1093/bioinformatics/btag040","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag040","url":null,"abstract":"<p><strong>Motivation: </strong>Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.</p><p><strong>Results: </strong>In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.</p><p><strong>Implementation: </strong>The scritps to run the experiments are available at: https://github.com/tastanlab/ohe.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GeoGAD: Geometry-Aware Antibody Design Framework for Complementarity-Determining Region Precision Engineering. 基于几何感知的互补区精确工程抗体设计框架。
IF 5.4 Pub Date : 2026-01-24 DOI: 10.1093/bioinformatics/btag042
Songjian Wei, Jinxiong Zhang, Yan Chen, Chunyan Tang, Jiayang Tan

Motivation: Antibodies, as pivotal effector molecules of the immune system, neutralize pathogens through specific binding to antigens mediated by complementarity-determining regions (CDRs), highlighting the critical importance for precise antibody design in diagnostics and therapeutics. Despite significant advances in CDR design, current methods remain limited by inadequate modeling of geometric constraints, omission of multi-scale spatial relationships, and insufficient conformational representation capacity-factors that collectively degrade prediction accuracy.

Results: To overcome these limitations, we present GeoGAD, a geometry-aware antibody design framework with Gaussian attention mechanisms. Key innovations include: (1) the introduction of rotational positional encoding to enhance geometric sensitivity; (2) a geometry-aware module that integrates multi-scale spatial features through dynamic message passing, adaptive edge refinement, and multi-edge-type coordinate optimization; and (3) a Gaussian attention mechanism that employs an edge-type-sensitive spatial Gaussian kernel to model long-range sequence dependencies, enabling focused attention on local critical residues while preserving global contextual modeling. Experimental evaluations demonstrate that GeoGAD achieves superior or comparable performance to state-of-the-art models across antibody sequence-structure co-modeling, CDR design, and affinity optimization benchmarks, particularly excelling in amino acid recovery rates (AAR) and structural accuracy metrics (RMSD, TM-score). By enhancing the design precision of antibody CDR regions, GeoGAD offers a geometrically consistent framework for the computational design of therapeutic antibodies.

Availability and implementation: The source code and implementation are available at https://github.com/WeiSongJian/GeoGAD, and the archival version for this manuscript is deposited at https://doi.org/10.5281/zenodo.18073443.

动机:抗体作为免疫系统的关键效应分子,通过互补决定区(cdr)介导的特异性结合抗原来中和病原体,这突出了在诊断和治疗中精确设计抗体的重要性。尽管CDR设计取得了重大进展,但目前的方法仍然受到几何约束建模不足、忽略多尺度空间关系以及构象表示能力不足等因素的限制,这些因素共同降低了预测精度。结果:为了克服这些限制,我们提出了GeoGAD,一个具有高斯注意机制的几何感知抗体设计框架。关键创新包括:(1)引入旋转位置编码,提高几何灵敏度;(2)通过动态消息传递、自适应边缘细化和多边缘型坐标优化,集成多尺度空间特征的几何感知模块;(3)高斯注意机制,该机制采用边缘类型敏感的空间高斯核来建模远程序列依赖关系,在保留全局上下文建模的同时,将注意力集中在局部关键残基上。实验评估表明,GeoGAD在抗体序列-结构协同建模、CDR设计和亲和优化基准方面的表现优于或比得上最先进的模型,尤其是在氨基酸回收率(AAR)和结构精度指标(RMSD, TM-score)方面。通过提高抗体CDR区域的设计精度,GeoGAD为治疗性抗体的计算设计提供了几何上一致的框架。可用性和实现:源代码和实现可在https://github.com/WeiSongJian/GeoGAD上获得,该手稿的存档版本存放在https://doi.org/10.5281/zenodo.18073443上。
{"title":"GeoGAD: Geometry-Aware Antibody Design Framework for Complementarity-Determining Region Precision Engineering.","authors":"Songjian Wei, Jinxiong Zhang, Yan Chen, Chunyan Tang, Jiayang Tan","doi":"10.1093/bioinformatics/btag042","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag042","url":null,"abstract":"<p><strong>Motivation: </strong>Antibodies, as pivotal effector molecules of the immune system, neutralize pathogens through specific binding to antigens mediated by complementarity-determining regions (CDRs), highlighting the critical importance for precise antibody design in diagnostics and therapeutics. Despite significant advances in CDR design, current methods remain limited by inadequate modeling of geometric constraints, omission of multi-scale spatial relationships, and insufficient conformational representation capacity-factors that collectively degrade prediction accuracy.</p><p><strong>Results: </strong>To overcome these limitations, we present GeoGAD, a geometry-aware antibody design framework with Gaussian attention mechanisms. Key innovations include: (1) the introduction of rotational positional encoding to enhance geometric sensitivity; (2) a geometry-aware module that integrates multi-scale spatial features through dynamic message passing, adaptive edge refinement, and multi-edge-type coordinate optimization; and (3) a Gaussian attention mechanism that employs an edge-type-sensitive spatial Gaussian kernel to model long-range sequence dependencies, enabling focused attention on local critical residues while preserving global contextual modeling. Experimental evaluations demonstrate that GeoGAD achieves superior or comparable performance to state-of-the-art models across antibody sequence-structure co-modeling, CDR design, and affinity optimization benchmarks, particularly excelling in amino acid recovery rates (AAR) and structural accuracy metrics (RMSD, TM-score). By enhancing the design precision of antibody CDR regions, GeoGAD offers a geometrically consistent framework for the computational design of therapeutic antibodies.</p><p><strong>Availability and implementation: </strong>The source code and implementation are available at https://github.com/WeiSongJian/GeoGAD, and the archival version for this manuscript is deposited at https://doi.org/10.5281/zenodo.18073443.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uchimata: a toolkit for visualization of 3D genome structures on the web and in computational notebooks. 内田:一个在网络和计算机笔记本上可视化三维基因组结构的工具包。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag035
David Kou, Trevor Manz, Tereza Clarence, Nils Gehlenborg

Summary: Uchimata is a toolkit for visualization of 3D structures of genomes. It consists of two packages: a Javascript library facilitating the rendering of 3D models of genomes, and a Python widget for visualization in Jupyter Notebooks. Main features include an expressive way to specify visual encodings, and filtering of 3D genome structures based on genomic semantics and spatial aspects. Uchimata is designed to be highly integratable with biological tooling available in Python.

Availability and implementation: Uchimata is released under the MIT License. The Javascript library is available on NPM, while the widget is available as a Python package hosted on PyPI. The source code for both is available publicly on Github (https://github.com/hms-dbmi/uchimata and https://github.com/hms-dbmi/uchimata-py) and Zenodo (https://doi.org/10.5281/zenodo.17831959 and https://doi.org/10.5281/zenodo.17832045). The documentation with examples is hosted at https://hms-dbmi.github.io/uchimata/.

摘要:Uchimata是一个用于可视化基因组三维结构的工具包。它由两个包组成:一个Javascript库,用于促进基因组3D模型的渲染,以及一个Python小部件,用于在Jupyter notebook中进行可视化。主要特征包括一种指定视觉编码的表达方式,以及基于基因组语义和空间方面的三维基因组结构过滤。Uchimata被设计成与Python中可用的生物工具高度集成。可用性和实现:内田在MIT许可下发布。Javascript库在NPM上可用,而小部件则作为托管在PyPI上的Python包可用。两者的源代码都可以在Github (https://github.com/hms-dbmi/uchimata和https://github.com/hms-dbmi/uchimata-py)和Zenodo (https://doi.org/10.5281/zenodo.17831959和https://doi.org/10.5281/zenodo.17832045)上公开获得。带有示例的文档位于https://hms-dbmi.github.io/uchimata/。
{"title":"Uchimata: a toolkit for visualization of 3D genome structures on the web and in computational notebooks.","authors":"David Kou, Trevor Manz, Tereza Clarence, Nils Gehlenborg","doi":"10.1093/bioinformatics/btag035","DOIUrl":"10.1093/bioinformatics/btag035","url":null,"abstract":"<p><strong>Summary: </strong>Uchimata is a toolkit for visualization of 3D structures of genomes. It consists of two packages: a Javascript library facilitating the rendering of 3D models of genomes, and a Python widget for visualization in Jupyter Notebooks. Main features include an expressive way to specify visual encodings, and filtering of 3D genome structures based on genomic semantics and spatial aspects. Uchimata is designed to be highly integratable with biological tooling available in Python.</p><p><strong>Availability and implementation: </strong>Uchimata is released under the MIT License. The Javascript library is available on NPM, while the widget is available as a Python package hosted on PyPI. The source code for both is available publicly on Github (https://github.com/hms-dbmi/uchimata and https://github.com/hms-dbmi/uchimata-py) and Zenodo (https://doi.org/10.5281/zenodo.17831959 and https://doi.org/10.5281/zenodo.17832045). The documentation with examples is hosted at https://hms-dbmi.github.io/uchimata/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146020897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs. 使用gpu加速基因组和全现象关联研究。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag032
Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri

Motivation: Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.

Results: We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.

Availability: Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:生物库规模的全基因组关联研究(GWAS)是计算密集型的,特别是对于需要稳健统计模型的混合种群。SAIGE是一种广泛应用于广义线性混合模型GWAS的方法,但受限于其基于cpu的实现,使得许多研究小组无法进行全现象关联研究。结果:我们开发了SAIGE- gpu,这是一个gpu加速版本的SAIGE,它用gpu优化的内核取代了cpu密集型矩阵运算。核心创新是在gpu和通信层之间分配遗传关系矩阵计算。SAIGE-GPU应用于百万老兵计划(MVP)中635,969名参与者的2,068种表型,包括多样化和混合人群,在超级计算基础设施和云平台上实现了混合模型拟合的5倍加速。通过多核、多性状并行化进一步优化变异关联测试步骤。该方法部署在谷歌云平台和Azure上,节省了大量的成本和时间。可用性:源代码和二进制文件可从https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3下载。为了重现性,代码快照在Zenodo存档(DOI: [10.5281/ Zenodo .17642591])。SAIGE-GPU以容器化格式提供,可跨HPC和云环境使用,并在R/ c++中实现,在Linux系统上运行。补充信息:补充数据可在生物信息学在线获取。
{"title":"SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs.","authors":"Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri","doi":"10.1093/bioinformatics/btag032","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag032","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.</p><p><strong>Results: </strong>We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.</p><p><strong>Availability: </strong>Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset. 应用于TEDDY微生物组数据集的嵌套病例对照研究中存在竞争风险的纵向生物标志物和生存结果联合建模
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag038
Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu

Motivation: Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.

Results: Motivated by the TEDDY study, we propose "JM-NCC", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.

Availability: Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:大规模前瞻性队列研究收集纵向生物标本以及事件时间结果,以调查与疾病风险相关的生物标志物动态。嵌套病例对照(NCC)设计为全队列生物标志物研究提供了一种具有成本效益的替代方案,同时保持了统计效率。尽管纵向和事件时间结果的联合建模取得了进展,但很少有方法解决NCC抽样、非正态分布生物标志物和竞争生存结果所带来的独特挑战。结果:在TEDDY研究的激励下,我们提出了“JM-NCC”,这是一个为具有竞争项目的NCC研究设计的联合建模框架。它将潜在非正态分布生物标志物的广义线性混合效应模型与竞争风险的原因特定风险模型集成在一起。提出了两种估计方法。fJM-NCC利用NCC亚队列纵向生物标志物数据和全队列生存和临床元数据,而wJM-NCC仅使用NCC亚队列数据。仿真研究和对TEDDY微生物组数据集的应用都证明了所提出方法的鲁棒性和有效性。可用性:软件可从https://github.com/Zhaoyn-oss/JMNCC获得,并在Zenodo上存档https://zenodo.org/records/18199759 (DOI: 10.5281/ Zenodo .18199759)。补充信息:补充数据可在生物信息学在线获取。
{"title":"Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset.","authors":"Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu","doi":"10.1093/bioinformatics/btag038","DOIUrl":"10.1093/bioinformatics/btag038","url":null,"abstract":"<p><strong>Motivation: </strong>Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.</p><p><strong>Results: </strong>Motivated by the TEDDY study, we propose \"JM-NCC\", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.</p><p><strong>Availability: </strong>Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
De novo protein ligand design including protein flexibility and conformational adaptation. 从头开始的蛋白质配体设计,包括蛋白质柔韧性和构象适应。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag027
Jakob Agamia, Martin Zacharias

Motivation: The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.

Results: Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.

Availability and implementation: Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.

动机:合理设计化合物结合所需的蛋白质靶分子是药物发现的主要目标。目前大多数分子对接、基于片段的构建或基于机器学习的生成药物设计方法都采用刚性蛋白质靶结构。结果:基于预测蛋白质结构和化合物复合物的最新进展,我们设计了一种AI-MCLig方法来优化化合物与完全柔性和构象适应性的蛋白质结合区域的结合。在随机改变化合物的蒙特卡罗(MC)模拟过程中,使用Chai-1蛋白结构预测程序在每个MC步骤中完全重建目标蛋白-化合物复合物。除了化合物的灵活性,它还允许蛋白质适应化学变化的化合物。基于原子/键类型变化或基于结合更大的化学碎片的mc协议已经进行了测试。在三个测试目标上的模拟结果表明,潜在的配体显示出非常好的结合分数,与使用几种不同评分方案的实验已知结合剂相当。基于mc的化合物设计方法是对现有方法的补充,可以帮助快速设计推定的结合物,包括诱导蛋白质靶点的匹配。可用性和实现:数据集、示例和源代码可在我们的公共GitHub存储库https:/github.com/JakobAgamia/AI-MCLig和Zenodo https://doi.org/10.5281/zenodo.17800140上获得。
{"title":"De novo protein ligand design including protein flexibility and conformational adaptation.","authors":"Jakob Agamia, Martin Zacharias","doi":"10.1093/bioinformatics/btag027","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag027","url":null,"abstract":"<p><strong>Motivation: </strong>The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.</p><p><strong>Results: </strong>Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.</p><p><strong>Availability and implementation: </strong>Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of qualifying variants for genomic analysis. 限定变异在基因组分析中的应用。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btaf676
Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Simon Boutry, Veronika Stadler, Sabine Österle, Jan Armida, David Haerry, D Sean Froese, Luregn J Schlapbach, Jacques Fellay

Motivation: Qualifying variants (QVs) are genomic alterations selected by defined criteria within analysis pipelines. Although crucial for both research and clinical diagnostics, QVs are often seen as simple filters rather than dynamic elements that influence the entire workflow. In practice these rules are embedded within pipelines, which hinders transparency, audit, and reuse across tools. A unified, portable specification for QV criteria is needed.

Results: Our aim is to embed the concept of a "QV" into the genomic analysis vernacular, moving beyond its treatment as a single filtering step. By decoupling QV criteria from pipeline variables and code, the framework enables clearer discussion, application, and reuse. It provides a flexible reference model for integrating QVs into analysis pipelines, improving reproducibility, interpretability, and interdisciplinary communication. Validation across diverse applications confirmed that QV based workflows match conventional methods while offering greater clarity and scalability.

Availability: The source code and data are accessible at the Zenodo repository https://doi.org/10.5281/zenodo.17414191. Manuscript files are available at https://github.com/DylanLawless/qvApp2025lawless. The QV framework is available under the MIT licence, and the dataset will be maintained for at least two years following publication.

动机:合格变体(qv)是由分析管道中定义的标准选择的基因组改变。尽管对于研究和临床诊断都至关重要,但qv通常被视为简单的过滤器,而不是影响整个工作流程的动态元素。在实践中,这些规则被嵌入到管道中,这阻碍了工具之间的透明性、审计和重用。需要一个统一的、可移植的QV标准规范。结果:我们的目标是将“QV”的概念嵌入基因组分析方言中,超越其作为单一过滤步骤的处理。通过将QV标准与管道变量和代码解耦,框架支持更清晰的讨论、应用和重用。它提供了一个灵活的参考模型,用于将qv集成到分析管道中,提高再现性、可解释性和跨学科的交流。跨不同应用程序的验证证实,基于QV的工作流与传统方法相匹配,同时提供更大的清晰度和可扩展性。可用性:源代码和数据可以在Zenodo存储库https://doi.org/10.5281/zenodo.17414191上访问。手稿文件可在https://github.com/DylanLawless/qvApp2025lawless上获得。QV框架在MIT许可下可用,数据集将在出版后至少维护两年。
{"title":"Application of qualifying variants for genomic analysis.","authors":"Dylan Lawless, Ali Saadat, Mariam Ait Oumelloul, Simon Boutry, Veronika Stadler, Sabine Österle, Jan Armida, David Haerry, D Sean Froese, Luregn J Schlapbach, Jacques Fellay","doi":"10.1093/bioinformatics/btaf676","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf676","url":null,"abstract":"<p><strong>Motivation: </strong>Qualifying variants (QVs) are genomic alterations selected by defined criteria within analysis pipelines. Although crucial for both research and clinical diagnostics, QVs are often seen as simple filters rather than dynamic elements that influence the entire workflow. In practice these rules are embedded within pipelines, which hinders transparency, audit, and reuse across tools. A unified, portable specification for QV criteria is needed.</p><p><strong>Results: </strong>Our aim is to embed the concept of a \"QV\" into the genomic analysis vernacular, moving beyond its treatment as a single filtering step. By decoupling QV criteria from pipeline variables and code, the framework enables clearer discussion, application, and reuse. It provides a flexible reference model for integrating QVs into analysis pipelines, improving reproducibility, interpretability, and interdisciplinary communication. Validation across diverse applications confirmed that QV based workflows match conventional methods while offering greater clarity and scalability.</p><p><strong>Availability: </strong>The source code and data are accessible at the Zenodo repository https://doi.org/10.5281/zenodo.17414191. Manuscript files are available at https://github.com/DylanLawless/qvApp2025lawless. The QV framework is available under the MIT licence, and the dataset will be maintained for at least two years following publication.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unavailability of experimental 3D structural data on protein folding dynamics and necessity for a new generation of structure prediction methods in this context. 蛋白质折叠动力学实验三维结构数据的不可获得性和新一代结构预测方法在此背景下的必要性。
IF 5.4 Pub Date : 2026-01-20 DOI: 10.1093/bioinformatics/btag020
Aydin Wells, Khalique Newaz, Jennifer Morones, Jianlin Cheng, Tijana Milenković

Motivation: Protein folding is a dynamic process during which a protein's amino acid sequence undergoes a series of 3-dimensional (3D) conformational changes en route to reaching a native 3D structure; these conformations are called folding intermediates. While data on native 3D structures are abundant, data on 3D structures of non-native intermediates remain sparse, due to limitations of current technologies for experimental determination of 3D structures. Yet, analyzing folding intermediates is crucial for understanding folding dynamics and misfolding-related diseases. Hence, we search the literature for available (experimentally and computationally obtained) 3D structural data on folding intermediates, organizing the data in a centralized resource. Also, we assess whether existing methods, designed for predicting native structures, can be utilized to predict structures of non-native intermediates.

Results: Our literature search reveals six studies that provide 3D structural data on folding intermediates (two for post-translational and four for co-translational folding), each focused on a single protein, with 2-4 intermediates. Our assessment shows that an established method for predicting native structures, AlphaFold2, does not perform well for non-native intermediates in the context of co-translational folding; a recent study on post-translational folding concluded the same for even more existing methods. Yet, we identify in the literature recent pioneering methods designed explicitly to predict 3D structures of folding intermediates by incorporating intrinsic biophysical characteristics of folding dynamics, which show promise. This study assesses the current landscape and future directions of the field of 3D structural analysis of protein folding dynamics.

Availability and implementation: https://github.com/Aywells/3Dpfi or https://www3.nd.edu/ cone/3Dpfi.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:蛋白质折叠是一个动态过程,在此过程中,蛋白质的氨基酸序列在达到天然3D结构的过程中经历了一系列三维(3D)构象变化;这些构象被称为折叠中间体。虽然原生三维结构的数据丰富,但由于目前实验确定三维结构的技术限制,非原生中间体的三维结构数据仍然很少。然而,分析折叠中间体对于理解折叠动力学和错误折叠相关疾病至关重要。因此,我们搜索文献中可用的(实验和计算获得的)折叠中间体的三维结构数据,将数据组织在一个集中的资源中。此外,我们还评估了用于预测原生结构的现有方法是否可以用于预测非原生中间体的结构。结果:我们的文献检索揭示了六项研究提供了折叠中间体的三维结构数据(两项用于翻译后折叠,四项用于共翻译折叠),每项研究都集中在一个蛋白质上,有2-4个中间体。我们的评估表明,用于预测天然结构的既定方法AlphaFold2在共翻译折叠的背景下对非天然中间体表现不佳;最近一项关于翻译后折叠的研究得出了同样的结论,适用于更多现有的方法。然而,我们在文献中发现了最近的开创性方法,通过结合折叠动力学的内在生物物理特性来明确预测折叠中间体的3D结构,这些方法显示出了希望。本研究评估了蛋白质折叠动力学的三维结构分析领域的现状和未来方向。可用性和实现:https://github.com/Aywells/3Dpfi或https://www3.nd.edu/ cone/3Dpfi。补充信息:补充数据可在生物信息学在线获取。
{"title":"Unavailability of experimental 3D structural data on protein folding dynamics and necessity for a new generation of structure prediction methods in this context.","authors":"Aydin Wells, Khalique Newaz, Jennifer Morones, Jianlin Cheng, Tijana Milenković","doi":"10.1093/bioinformatics/btag020","DOIUrl":"10.1093/bioinformatics/btag020","url":null,"abstract":"<p><strong>Motivation: </strong>Protein folding is a dynamic process during which a protein's amino acid sequence undergoes a series of 3-dimensional (3D) conformational changes en route to reaching a native 3D structure; these conformations are called folding intermediates. While data on native 3D structures are abundant, data on 3D structures of non-native intermediates remain sparse, due to limitations of current technologies for experimental determination of 3D structures. Yet, analyzing folding intermediates is crucial for understanding folding dynamics and misfolding-related diseases. Hence, we search the literature for available (experimentally and computationally obtained) 3D structural data on folding intermediates, organizing the data in a centralized resource. Also, we assess whether existing methods, designed for predicting native structures, can be utilized to predict structures of non-native intermediates.</p><p><strong>Results: </strong>Our literature search reveals six studies that provide 3D structural data on folding intermediates (two for post-translational and four for co-translational folding), each focused on a single protein, with 2-4 intermediates. Our assessment shows that an established method for predicting native structures, AlphaFold2, does not perform well for non-native intermediates in the context of co-translational folding; a recent study on post-translational folding concluded the same for even more existing methods. Yet, we identify in the literature recent pioneering methods designed explicitly to predict 3D structures of folding intermediates by incorporating intrinsic biophysical characteristics of folding dynamics, which show promise. This study assesses the current landscape and future directions of the field of 3D structural analysis of protein folding dynamics.</p><p><strong>Availability and implementation: </strong>https://github.com/Aywells/3Dpfi or https://www3.nd.edu/ cone/3Dpfi.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics (Oxford, England)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1