首页 > 最新文献

Bioinformatics (Oxford, England)最新文献

英文 中文
Using semantic search to find publicly available gene-expression datasets. 使用语义搜索来查找公开可用的基因表达数据集。
IF 5.4 Pub Date : 2026-02-02 DOI: 10.1093/bioinformatics/btag053
Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo

Motivation: Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.

Results: We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.

Availability: Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:数以百万计的高通量分子数据集已经在公共存储库中共享。研究人员可以重复使用这些数据来验证他们自己的发现并探索新的问题。一个常见的目标是找到解决类似研究主题的多个数据集,并直接组合它们或整合它们的推断。然而,一个主要的挑战是找到相关的数据集,因为候选数据集数量庞大,描述不一致,缺乏语义注释。这一挑战是FAIR科学数据原则中的第一项。在这里,我们专注于基因表达Omnibus (GEO)中的数据集发现,这是一个包含100,000 s数据序列的存储库。GEO支持基于关键字、本体术语和其他注释的查询。然而,回顾这些结果既耗时又乏味,而且经常遗漏相关的数据集。结果:我们假设语言模型可以通过将数据集描述总结为数字表示(嵌入)来解决这个问题。假设研究人员之前已经发现了一些相关数据集,我们评估了发现其他相关数据集的潜力。对于六种人类医疗条件,我们使用30个模型为人类管理员先前与这些条件关联的数据集生成嵌入,并识别出具有最相似描述的其他数据集。这种方法通常比GEO的搜索引擎更有效,但并不总是如此。表现最好的模型在一般语料库上进行训练,使用对比学习策略,并使用相对较大的嵌入。我们的研究结果表明,语言模型有潜力改善数据集发现,可能与现有的搜索工具相结合。可用性:我们的分析代码和一个基于web的工具,使其他人能够使用我们的方法,分别可以从https://github.com/srp33/GEO_NLP和https://github.com/srp33/GEOfinder3.0获得。补充信息:补充数据可在生物信息学在线获取。
{"title":"Using semantic search to find publicly available gene-expression datasets.","authors":"Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo","doi":"10.1093/bioinformatics/btag053","DOIUrl":"10.1093/bioinformatics/btag053","url":null,"abstract":"<p><strong>Motivation: </strong>Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.</p><p><strong>Results: </strong>We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.</p><p><strong>Availability: </strong>Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A spectral dimension reduction technique that improves pattern detection in multivariate spatial data. 一种改进多变量空间数据模式检测的光谱降维技术。
IF 5.4 Pub Date : 2026-01-31 DOI: 10.1093/bioinformatics/btag052
David Köhler, Niklas Kleinenkuhnen, Kiarash Rastegar, Till Baar, Chrysa Nikopoulou, Vangelis Kondylis, Vlada Milchevskaya, Matthias Schmid, Peter Tessarz, Achim Tresch

Motivation: We introduce a statistical approach for pattern recognition in multivariate spatial transcriptomics data.

Results: Our algorithm constructs a projection of the data onto a low-dimensional feature space which is optimal in maximising Moran's I, a measure of spatial dependency. This projection mitigates non-spatial variation and outperforms principal components analysis for pre-processing. Patterns of spatially variable genes are well represented in this feature space, and their projection can be shown to be a denoising operation. Our framework does not require any parameter tuning, and it furthermore gives rise to a calibrated, powerful test of spatial gene expression.

Availability and implementation: The algorithm is implemented in the open source software R and is available at https://github.com/IMSBCompBio/SpaCo.

动机:我们介绍了一种用于多变量空间转录组学数据模式识别的统计方法。结果:我们的算法构建了一个低维特征空间的数据投影,这在最大化Moran's I(一种空间依赖性度量)方面是最优的。这种投影减轻了非空间变化,并且优于预处理的主成分分析。空间可变基因的模式在这个特征空间中得到很好的表示,它们的投影可以被证明是一个去噪操作。我们的框架不需要任何参数调整,而且它进一步产生了一个校准的,强大的空间基因表达测试。可用性和实现:该算法是在开源软件R中实现的,可以在https://github.com/IMSBCompBio/SpaCo上获得。
{"title":"A spectral dimension reduction technique that improves pattern detection in multivariate spatial data.","authors":"David Köhler, Niklas Kleinenkuhnen, Kiarash Rastegar, Till Baar, Chrysa Nikopoulou, Vangelis Kondylis, Vlada Milchevskaya, Matthias Schmid, Peter Tessarz, Achim Tresch","doi":"10.1093/bioinformatics/btag052","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag052","url":null,"abstract":"<p><strong>Motivation: </strong>We introduce a statistical approach for pattern recognition in multivariate spatial transcriptomics data.</p><p><strong>Results: </strong>Our algorithm constructs a projection of the data onto a low-dimensional feature space which is optimal in maximising Moran's I, a measure of spatial dependency. This projection mitigates non-spatial variation and outperforms principal components analysis for pre-processing. Patterns of spatially variable genes are well represented in this feature space, and their projection can be shown to be a denoising operation. Our framework does not require any parameter tuning, and it furthermore gives rise to a calibrated, powerful test of spatial gene expression.</p><p><strong>Availability and implementation: </strong>The algorithm is implemented in the open source software R and is available at https://github.com/IMSBCompBio/SpaCo.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146097705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ProteoGyver: A Fast, User-Friendly Tool for Routine QC and Analysis of MS-based Proteomics Data. ProteoGyver:一个快速,用户友好的工具,用于常规QC和分析基于质谱的蛋白质组学数据。
IF 5.4 Pub Date : 2026-01-30 DOI: 10.1093/bioinformatics/btag050
Kari Salokas, Salla Keskitalo, Markku Varjosalo

Mass spectrometry-based proteomics generates increasingly large datasets requiring rapid quality control (QC) and preliminary analysis. Current software solutions often require specialized knowledge, limiting their routine use. We developed ProteoGyver (PG), an accessible, lightweight software solution designed for rapid QC and preliminary proteomics data analysis. PG provides automated QC metrics, intuitive graphical reports, and streamlined workflows for whole-proteome and interactomics datasets, significantly lowering the barrier to regular QC practices. The platform includes additional tools such as MS Inspector for longitudinal chromatogram inspection and Colocalizer for microscopy data. PG is easily deployed as a Docker container or standalone Python installation. PG is open-source and freely available in dockerhub and source code in github at github.com/varjolab/Proteogyver. Availability PG image and source code are available in github and dockerhub under LGPL-2.1.

基于质谱的蛋白质组学产生越来越大的数据集,需要快速的质量控制(QC)和初步分析。当前的软件解决方案通常需要专业知识,限制了它们的日常使用。我们开发了ProteoGyver (PG),这是一种易于使用的轻量级软件解决方案,专为快速QC和初步蛋白质组学数据分析而设计。PG提供自动化的QC指标,直观的图形报告,以及全蛋白质组和相互作用组数据集的简化工作流程,大大降低了常规QC实践的障碍。该平台包括其他工具,如用于纵向色谱检查的MS Inspector和用于显微镜数据的Colocalizer。PG很容易部署为Docker容器或独立的Python安装。PG是开源的,可以在dockerhub和github中免费获得,源代码在github.com/varjolab/Proteogyver。可用性PG映像和源代码可在LGPL-2.1下的github和dockerhub中获得。
{"title":"ProteoGyver: A Fast, User-Friendly Tool for Routine QC and Analysis of MS-based Proteomics Data.","authors":"Kari Salokas, Salla Keskitalo, Markku Varjosalo","doi":"10.1093/bioinformatics/btag050","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag050","url":null,"abstract":"<p><p>Mass spectrometry-based proteomics generates increasingly large datasets requiring rapid quality control (QC) and preliminary analysis. Current software solutions often require specialized knowledge, limiting their routine use. We developed ProteoGyver (PG), an accessible, lightweight software solution designed for rapid QC and preliminary proteomics data analysis. PG provides automated QC metrics, intuitive graphical reports, and streamlined workflows for whole-proteome and interactomics datasets, significantly lowering the barrier to regular QC practices. The platform includes additional tools such as MS Inspector for longitudinal chromatogram inspection and Colocalizer for microscopy data. PG is easily deployed as a Docker container or standalone Python installation. PG is open-source and freely available in dockerhub and source code in github at github.com/varjolab/Proteogyver. Availability PG image and source code are available in github and dockerhub under LGPL-2.1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
X-intNMF: A Cross- and Intra-Omics Regularized NMF Framework for Multi-Omics Integration. X-intNMF:用于多组学整合的跨组和组内正则化NMF框架。
IF 5.4 Pub Date : 2026-01-27 DOI: 10.1093/bioinformatics/btag046
Tien-Thanh Bui, Rui Xie, Wei Zhang

Motivation: The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.

Results: For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.

Availability and implementation: The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:多组学数据的快速积累为促进我们对复杂疾病和生物系统的理解提供了宝贵的机会,推动了综合计算方法的发展。然而,生物过程的复杂性,跨越多个分子层,涉及复杂的调控相互作用,需要能够捕捉组内和组间关系的模型。大多数现有的集成方法主要关注样本水平的相似性或组学内部特征的相互作用,往往忽略了不同组学层之间的相互作用。这种限制可能导致关键生物信息的丢失和次优性能。为了解决这一差距,我们提出了X-intNMF,这是一种网络正则化的非负矩阵分解(NMF)模型,在多组学整合过程中明确地结合了跨组学和组学内部特征相互作用网络。通过对这些多层关系进行建模,X-intNMF增强了生物相互作用的表征,提高了集成质量和预测精度。结果:为了评估,我们应用X-intNMF预测乳腺癌表型,并使用TCGA的mRNA表达、microRNA表达和DNA甲基化数据对肺癌和卵巢癌的临床结果进行分类。结果表明,X-intNMF始终优于最先进的方法。消融研究证实,结合交叉组学和组学内部相互作用对模型的性能改善有显著贡献。此外,对25个TCGA癌症数据集的生存分析表明,集成的多组学表示对总体生存和无病状态都具有很强的预后价值。这些发现突出了X-intNMF在NMF框架内保持可解释性、稳健性和可扩展性的同时有效地模拟多层分子相互作用的能力。可用性和实施:支持本研究的源代码和数据集可在GitHub (https://github.com/compbiolabucf/X-intNMF)上公开获取,并在Zenodo (https://doi.org/10.5281/zenodo.18238385).Supplementary)上存档。信息:补充数据可在Bioinformatics在线获取。
{"title":"X-intNMF: A Cross- and Intra-Omics Regularized NMF Framework for Multi-Omics Integration.","authors":"Tien-Thanh Bui, Rui Xie, Wei Zhang","doi":"10.1093/bioinformatics/btag046","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag046","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid accumulation of multi-omics data presents a valuable opportunity to advance our understanding of complex diseases and biological systems, driving the development of integrative computational methods. However, the complexity of biological processes, spanning multiple molecular layers and involving intricate regulatory interactions, requires models that can capture both intra- and cross-omics relationships. Most existing integration methods primarily focus on sample-level similarities or intra-omics feature interactions, often neglecting the interactions across different omics layers. This limitation can result in the loss of critical biological information and suboptimal performance. To address this gap, we propose X-intNMF, a network-regularized non-negative matrix factorization (NMF) model that explicitly incorporates both cross-omics and intra-omics feature interaction networks during multi-omics integration. By modeling these multi-layered relationships, X-intNMF enhances the representation of biological interactions and improves integration quality and prediction accuracy.</p><p><strong>Results: </strong>For evaluation, we applied X-intNMF to predict breast cancer phenotypes and classify clinical outcomes in lung and ovarian cancers using mRNA expression, microRNA expression, and DNA methylation data from TCGA. The results show that X-intNMF consistently outperforms state-of-the-art methods. Ablation studies confirm that incorporating both cross-omics and intra-omics interactions contributes significantly to the model's improved performance. Additionally, survival analysis on 25 TCGA cancer datasets demonstrates that the integrated multi-omics representation offers strong prognostic value for both overall survival and disease-free status. These findings highlight X-intNMF's ability to effectively model multi-layered molecular interactions while maintaining interpretability, robustness, and scalability within the NMF framework.</p><p><strong>Availability and implementation: </strong>The source code and datasets supporting this study are publicly available at GitHub (https://github.com/compbiolabucf/X-intNMF) and archived on Zenodo (https://doi.org/10.5281/zenodo.18238385).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How Negative Sampling Shapes the Performance of Transcription Factor Binding Site Prediction Models. 负采样如何影响转录因子结合位点预测模型的性能。
IF 5.4 Pub Date : 2026-01-27 DOI: 10.1093/bioinformatics/btag048
Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman

Motivation: Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.

Results: In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.

Availability and implementation: The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).

Supplementary information: Supplementary data are available at Bioinformatics online.

转录因子(Transcription factors, TFs)是基因调控和发育的关键角色,通过DNA结合激活和抑制基因表达。预测转录因子结合位点(TFBSs)一直是一个活跃的研究领域,许多深度学习方法被开发出来解决这个问题。这些模型通常是在TF ChIP-seq数据上训练的,通常被认为只提供阳性样本。数据集和负抽样技术的选择是这项工作的一个关键但经常被忽视的方面。结果:在本研究中,我们研究了不同的负抽样技术对TFBS预测性能的影响。我们基于ChIP-seq和ATAC-seq数据创建高质量的测试数据集,其中真阴性可以识别为可访问但不受相关TF约束的位置。然后,我们使用各种负采样技术训练模型,包括基因组采样、洗牌、二核苷酸洗牌、邻域采样和细胞系特定采样,模拟无法获得匹配ATAC-seq数据的情况。我们的结果表明,通常,在训练数据集上计算的指标给出了夸大的性能分数。在测试的技术中,基于与阳性相似度的阴性基因组抽样迄今为止表现最好,尽管仍未达到在高质量数据集上训练的基线模型的性能。在二核苷酸洗牌底片上训练的模型表现不佳,尽管在该领域是一种常见的做法。我们的研究结果强调了仔细选择负抽样技术对TFBS预测的重要性,因为它们会显著影响模型的性能和结果的解释。可用性和实现:本研究中使用的代码可在https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567)上获得。补充信息:补充数据可在生物信息学在线获取。
{"title":"How Negative Sampling Shapes the Performance of Transcription Factor Binding Site Prediction Models.","authors":"Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman","doi":"10.1093/bioinformatics/btag048","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag048","url":null,"abstract":"<p><strong>Motivation: </strong>Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.</p><p><strong>Results: </strong>In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.</p><p><strong>Availability and implementation: </strong>The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
One-Hot News: Drug Synergy Models Shortcut Molecular Features. 一个热点新闻:药物协同模型快捷分子特征。
IF 5.4 Pub Date : 2026-01-24 DOI: 10.1093/bioinformatics/btag040
Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercument Cicek, Oznur Tastan

Motivation: Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.

Results: In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.

Implementation: The scritps to run the experiments are available at: https://github.com/tastanlab/ohe.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:组合药物治疗在治疗复杂疾病方面有很大的希望,但是大量可能的药物组合使得详尽的实验测试不可行。计算模型已经开发出来,通过分配药物对细胞系组合的协同得分来指导实验筛选,其中它们输入药物的结构和化学信息以及细胞系的分子特征。这些模型的前提是,他们利用这些生物和化学信息来预测协同测量。结果:在本研究中,我们证明了用简单的单热编码代替药物和细胞系表示,在不同的已发表的药物组合模型中,其性能相当甚至略有提高。这一意想不到的发现表明,当前的模型主要使用这些表示作为标识符,并利用协同标签中的协变。我们的综合数据实验表明,模型可以从真实特征中学习;然而,当药物和细胞系在药物-细胞三联体中反复出现时,这种重复的结构会损害基于特征的学习。虽然目前的协同预测模型可以帮助在一组测试药物和细胞系中确定药物对的优先级,但我们的研究结果强调需要更好的策略来学习预期的特征,并推广到未见过的药物和细胞系。实施:运行实验的脚本可在:https://github.com/tastanlab/ohe.Supplementary上获得信息:补充数据可在Bioinformatics在线上获得。
{"title":"One-Hot News: Drug Synergy Models Shortcut Molecular Features.","authors":"Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercument Cicek, Oznur Tastan","doi":"10.1093/bioinformatics/btag040","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag040","url":null,"abstract":"<p><strong>Motivation: </strong>Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.</p><p><strong>Results: </strong>In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.</p><p><strong>Implementation: </strong>The scritps to run the experiments are available at: https://github.com/tastanlab/ohe.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GeoGAD: Geometry-Aware Antibody Design Framework for Complementarity-Determining Region Precision Engineering. 基于几何感知的互补区精确工程抗体设计框架。
IF 5.4 Pub Date : 2026-01-24 DOI: 10.1093/bioinformatics/btag042
Songjian Wei, Jinxiong Zhang, Yan Chen, Chunyan Tang, Jiayang Tan

Motivation: Antibodies, as pivotal effector molecules of the immune system, neutralize pathogens through specific binding to antigens mediated by complementarity-determining regions (CDRs), highlighting the critical importance for precise antibody design in diagnostics and therapeutics. Despite significant advances in CDR design, current methods remain limited by inadequate modeling of geometric constraints, omission of multi-scale spatial relationships, and insufficient conformational representation capacity-factors that collectively degrade prediction accuracy.

Results: To overcome these limitations, we present GeoGAD, a geometry-aware antibody design framework with Gaussian attention mechanisms. Key innovations include: (1) the introduction of rotational positional encoding to enhance geometric sensitivity; (2) a geometry-aware module that integrates multi-scale spatial features through dynamic message passing, adaptive edge refinement, and multi-edge-type coordinate optimization; and (3) a Gaussian attention mechanism that employs an edge-type-sensitive spatial Gaussian kernel to model long-range sequence dependencies, enabling focused attention on local critical residues while preserving global contextual modeling. Experimental evaluations demonstrate that GeoGAD achieves superior or comparable performance to state-of-the-art models across antibody sequence-structure co-modeling, CDR design, and affinity optimization benchmarks, particularly excelling in amino acid recovery rates (AAR) and structural accuracy metrics (RMSD, TM-score). By enhancing the design precision of antibody CDR regions, GeoGAD offers a geometrically consistent framework for the computational design of therapeutic antibodies.

Availability and implementation: The source code and implementation are available at https://github.com/WeiSongJian/GeoGAD, and the archival version for this manuscript is deposited at https://doi.org/10.5281/zenodo.18073443.

动机:抗体作为免疫系统的关键效应分子,通过互补决定区(cdr)介导的特异性结合抗原来中和病原体,这突出了在诊断和治疗中精确设计抗体的重要性。尽管CDR设计取得了重大进展,但目前的方法仍然受到几何约束建模不足、忽略多尺度空间关系以及构象表示能力不足等因素的限制,这些因素共同降低了预测精度。结果:为了克服这些限制,我们提出了GeoGAD,一个具有高斯注意机制的几何感知抗体设计框架。关键创新包括:(1)引入旋转位置编码,提高几何灵敏度;(2)通过动态消息传递、自适应边缘细化和多边缘型坐标优化,集成多尺度空间特征的几何感知模块;(3)高斯注意机制,该机制采用边缘类型敏感的空间高斯核来建模远程序列依赖关系,在保留全局上下文建模的同时,将注意力集中在局部关键残基上。实验评估表明,GeoGAD在抗体序列-结构协同建模、CDR设计和亲和优化基准方面的表现优于或比得上最先进的模型,尤其是在氨基酸回收率(AAR)和结构精度指标(RMSD, TM-score)方面。通过提高抗体CDR区域的设计精度,GeoGAD为治疗性抗体的计算设计提供了几何上一致的框架。可用性和实现:源代码和实现可在https://github.com/WeiSongJian/GeoGAD上获得,该手稿的存档版本存放在https://doi.org/10.5281/zenodo.18073443上。
{"title":"GeoGAD: Geometry-Aware Antibody Design Framework for Complementarity-Determining Region Precision Engineering.","authors":"Songjian Wei, Jinxiong Zhang, Yan Chen, Chunyan Tang, Jiayang Tan","doi":"10.1093/bioinformatics/btag042","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag042","url":null,"abstract":"<p><strong>Motivation: </strong>Antibodies, as pivotal effector molecules of the immune system, neutralize pathogens through specific binding to antigens mediated by complementarity-determining regions (CDRs), highlighting the critical importance for precise antibody design in diagnostics and therapeutics. Despite significant advances in CDR design, current methods remain limited by inadequate modeling of geometric constraints, omission of multi-scale spatial relationships, and insufficient conformational representation capacity-factors that collectively degrade prediction accuracy.</p><p><strong>Results: </strong>To overcome these limitations, we present GeoGAD, a geometry-aware antibody design framework with Gaussian attention mechanisms. Key innovations include: (1) the introduction of rotational positional encoding to enhance geometric sensitivity; (2) a geometry-aware module that integrates multi-scale spatial features through dynamic message passing, adaptive edge refinement, and multi-edge-type coordinate optimization; and (3) a Gaussian attention mechanism that employs an edge-type-sensitive spatial Gaussian kernel to model long-range sequence dependencies, enabling focused attention on local critical residues while preserving global contextual modeling. Experimental evaluations demonstrate that GeoGAD achieves superior or comparable performance to state-of-the-art models across antibody sequence-structure co-modeling, CDR design, and affinity optimization benchmarks, particularly excelling in amino acid recovery rates (AAR) and structural accuracy metrics (RMSD, TM-score). By enhancing the design precision of antibody CDR regions, GeoGAD offers a geometrically consistent framework for the computational design of therapeutic antibodies.</p><p><strong>Availability and implementation: </strong>The source code and implementation are available at https://github.com/WeiSongJian/GeoGAD, and the archival version for this manuscript is deposited at https://doi.org/10.5281/zenodo.18073443.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs. 使用gpu加速基因组和全现象关联研究。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag032
Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri

Motivation: Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.

Results: We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.

Availability: Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:生物库规模的全基因组关联研究(GWAS)是计算密集型的,特别是对于需要稳健统计模型的混合种群。SAIGE是一种广泛应用于广义线性混合模型GWAS的方法,但受限于其基于cpu的实现,使得许多研究小组无法进行全现象关联研究。结果:我们开发了SAIGE- gpu,这是一个gpu加速版本的SAIGE,它用gpu优化的内核取代了cpu密集型矩阵运算。核心创新是在gpu和通信层之间分配遗传关系矩阵计算。SAIGE-GPU应用于百万老兵计划(MVP)中635,969名参与者的2,068种表型,包括多样化和混合人群,在超级计算基础设施和云平台上实现了混合模型拟合的5倍加速。通过多核、多性状并行化进一步优化变异关联测试步骤。该方法部署在谷歌云平台和Azure上,节省了大量的成本和时间。可用性:源代码和二进制文件可从https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3下载。为了重现性,代码快照在Zenodo存档(DOI: [10.5281/ Zenodo .17642591])。SAIGE-GPU以容器化格式提供,可跨HPC和云环境使用,并在R/ c++中实现,在Linux系统上运行。补充信息:补充数据可在生物信息学在线获取。
{"title":"SAIGE-GPU - Accelerating Genome- and Phenome-Wide Association Studies using GPUs.","authors":"Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, J Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri","doi":"10.1093/bioinformatics/btag032","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag032","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.</p><p><strong>Results: </strong>We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2,068 phenotypes from 635,969 participants in the Million Veteran Program (MVP), including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.</p><p><strong>Availability: </strong>Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C ++ and runs on Linux systems.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset. 应用于TEDDY微生物组数据集的嵌套病例对照研究中存在竞争风险的纵向生物标志物和生存结果联合建模
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag038
Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu

Motivation: Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.

Results: Motivated by the TEDDY study, we propose "JM-NCC", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.

Availability: Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:大规模前瞻性队列研究收集纵向生物标本以及事件时间结果,以调查与疾病风险相关的生物标志物动态。嵌套病例对照(NCC)设计为全队列生物标志物研究提供了一种具有成本效益的替代方案,同时保持了统计效率。尽管纵向和事件时间结果的联合建模取得了进展,但很少有方法解决NCC抽样、非正态分布生物标志物和竞争生存结果所带来的独特挑战。结果:在TEDDY研究的激励下,我们提出了“JM-NCC”,这是一个为具有竞争项目的NCC研究设计的联合建模框架。它将潜在非正态分布生物标志物的广义线性混合效应模型与竞争风险的原因特定风险模型集成在一起。提出了两种估计方法。fJM-NCC利用NCC亚队列纵向生物标志物数据和全队列生存和临床元数据,而wJM-NCC仅使用NCC亚队列数据。仿真研究和对TEDDY微生物组数据集的应用都证明了所提出方法的鲁棒性和有效性。可用性:软件可从https://github.com/Zhaoyn-oss/JMNCC获得,并在Zenodo上存档https://zenodo.org/records/18199759 (DOI: 10.5281/ Zenodo .18199759)。补充信息:补充数据可在生物信息学在线获取。
{"title":"Joint Modeling of Longitudinal Biomarker and Survival Outcomes with the Presence of Competing Risk in the Nested Case-Control Studies with Application to the TEDDY Microbiome Dataset.","authors":"Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu","doi":"10.1093/bioinformatics/btag038","DOIUrl":"10.1093/bioinformatics/btag038","url":null,"abstract":"<p><strong>Motivation: </strong>Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.</p><p><strong>Results: </strong>Motivated by the TEDDY study, we propose \"JM-NCC\", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.</p><p><strong>Availability: </strong>Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
De novo protein ligand design including protein flexibility and conformational adaptation. 从头开始的蛋白质配体设计,包括蛋白质柔韧性和构象适应。
IF 5.4 Pub Date : 2026-01-22 DOI: 10.1093/bioinformatics/btag027
Jakob Agamia, Martin Zacharias

Motivation: The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.

Results: Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.

Availability and implementation: Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.

动机:合理设计化合物结合所需的蛋白质靶分子是药物发现的主要目标。目前大多数分子对接、基于片段的构建或基于机器学习的生成药物设计方法都采用刚性蛋白质靶结构。结果:基于预测蛋白质结构和化合物复合物的最新进展,我们设计了一种AI-MCLig方法来优化化合物与完全柔性和构象适应性的蛋白质结合区域的结合。在随机改变化合物的蒙特卡罗(MC)模拟过程中,使用Chai-1蛋白结构预测程序在每个MC步骤中完全重建目标蛋白-化合物复合物。除了化合物的灵活性,它还允许蛋白质适应化学变化的化合物。基于原子/键类型变化或基于结合更大的化学碎片的mc协议已经进行了测试。在三个测试目标上的模拟结果表明,潜在的配体显示出非常好的结合分数,与使用几种不同评分方案的实验已知结合剂相当。基于mc的化合物设计方法是对现有方法的补充,可以帮助快速设计推定的结合物,包括诱导蛋白质靶点的匹配。可用性和实现:数据集、示例和源代码可在我们的公共GitHub存储库https:/github.com/JakobAgamia/AI-MCLig和Zenodo https://doi.org/10.5281/zenodo.17800140上获得。
{"title":"De novo protein ligand design including protein flexibility and conformational adaptation.","authors":"Jakob Agamia, Martin Zacharias","doi":"10.1093/bioinformatics/btag027","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag027","url":null,"abstract":"<p><strong>Motivation: </strong>The rational design of chemical compounds that bind to a desired protein target molecule is a major goal of drug discovery. Most current molecular docking but also fragment-based build-up or machine-learning based generative drug design approaches employ a rigid protein target structure.</p><p><strong>Results: </strong>Based on recent progress in predicting protein structures and complexes with chemical compounds we have designed an approach, AI-MCLig, to optimize a chemical compound bound to a fully flexible and conformationally adaptable protein binding region. During a Monte-Carlo (MC) type simulation to randomly change a chemical compound the target protein-compound complex is completely rebuilt at every MC step using the Chai-1 protein structure prediction program. Besides compound flexibility it allows the protein to adapt to the chemically changing compound. MC-protocols based on atom/bond type changes or based on combining larger chemical fragments have been tested. Simulations on three test targets resulted in potential ligands that show very good binding scores comparable to experimentally known binders using several different scoring schemes. The MC-based compound design approach is complementary to existing approaches and could help for the rapid design of putative binders including induced fit of the protein target.</p><p><strong>Availability and implementation: </strong>Datasets, examples and source code are available on our public GitHub repository https:/github.com/JakobAgamia/AI-MCLig and on Zenodo at https://doi.org/10.5281/zenodo.17800140.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics (Oxford, England)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1