Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data.

IF 6.2 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Communications Chemistry Pub Date : 2025-02-08 DOI:10.1038/s42004-025-01428-y

Ísak Valsson, Matthew T Warren, Charlotte M Deane, Aniket Magarkar, Garrett M Morris, Philip C Biggin

{"title":"Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data.","authors":"Ísak Valsson, Matthew T Warren, Charlotte M Deane, Aniket Magarkar, Garrett M Morris, Philip C Biggin","doi":"10.1038/s42004-025-01428-y","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning offers great promise for fast and accurate binding affinity predictions. However, current models lack robust evaluation and fail on tasks encountered in (hit-to-) lead optimisation, such as ranking the binding affinity of a congeneric series of ligands, thereby limiting their application in drug discovery. Here, we address these issues by first introducing a novel attention-based graph neural network model called AEV-PLIG (atomic environment vector-protein ligand interaction graph). Second, we introduce a new and more realistic out-of-distribution test set called the OOD Test. We benchmark our model on this set, CASF-2016, and a test set used for free energy perturbation (FEP) calculations, that not only highlights the competitive performance of AEV-PLIG, but provides a realistic assessment of machine learning models with rigorous physics-based approaches. Moreover, we demonstrate how leveraging augmented data (generated using template-based modelling or molecular docking) can significantly improve binding affinity prediction correlation and ranking on the FEP benchmark (weighted mean PCC and Kendall's τ increases from 0.41 and 0.26 to 0.59 and 0.42). These strategies together are closing the performance gap with FEP calculations (FEP+ achieves weighted mean PCC and Kendall's τ of 0.68 and 0.49 on the FEP benchmark) while being ~400,000 times faster.</p>","PeriodicalId":10529,"journal":{"name":"Communications Chemistry","volume":"8 1","pages":"41"},"PeriodicalIF":6.2000,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11807228/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications Chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1038/s42004-025-01428-y","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning offers great promise for fast and accurate binding affinity predictions. However, current models lack robust evaluation and fail on tasks encountered in (hit-to-) lead optimisation, such as ranking the binding affinity of a congeneric series of ligands, thereby limiting their application in drug discovery. Here, we address these issues by first introducing a novel attention-based graph neural network model called AEV-PLIG (atomic environment vector-protein ligand interaction graph). Second, we introduce a new and more realistic out-of-distribution test set called the OOD Test. We benchmark our model on this set, CASF-2016, and a test set used for free energy perturbation (FEP) calculations, that not only highlights the competitive performance of AEV-PLIG, but provides a realistic assessment of machine learning models with rigorous physics-based approaches. Moreover, we demonstrate how leveraging augmented data (generated using template-based modelling or molecular docking) can significantly improve binding affinity prediction correlation and ranking on the FEP benchmark (weighted mean PCC and Kendall's τ increases from 0.41 and 0.26 to 0.59 and 0.42). These strategies together are closing the performance gap with FEP calculations (FEP+ achieves weighted mean PCC and Kendall's τ of 0.68 and 0.49 on the FEP benchmark) while being ~400,000 times faster.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用增广数据缩小机器学习评分函数和自由能摄动之间的差距。

机器学习为快速准确的绑定亲和预测提供了巨大的希望。然而，目前的模型缺乏可靠的评估，并且在（hit-to-）先导优化中遇到的任务上失败，例如对同源系列配体的结合亲和力进行排序，从而限制了它们在药物发现中的应用。在这里，我们通过首先引入一种新的基于注意力的图神经网络模型AEV-PLIG（原子环境载体-蛋白质配体相互作用图）来解决这些问题。其次，我们引入了一个新的更真实的分布外测试集，称为OOD测试。我们在此集CASF-2016和用于自由能摄动（FEP）计算的测试集上对我们的模型进行基准测试，这不仅突出了AEV-PLIG的竞争性能，而且通过严格的基于物理的方法提供了对机器学习模型的现实评估。此外，我们展示了利用增强数据（使用基于模板的建模或分子对接生成）如何显著提高结合亲和预测相关性和FEP基准排名（加权平均PCC和Kendall τ从0.41和0.26增加到0.59和0.42）。这些策略共同缩小了与FEP计算的性能差距（FEP+在FEP基准上实现加权平均PCC和肯德尔τ分别为0.68和0.49），同时速度提高了约40万倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Communications Chemistry Chemistry-General Chemistry

CiteScore

7.70

自引率

1.70%

发文量

146

审稿时长

13 weeks

期刊介绍： Communications Chemistry is an open access journal from Nature Research publishing high-quality research, reviews and commentary in all areas of the chemical sciences. Research papers published by the journal represent significant advances bringing new chemical insight to a specialized area of research. We also aim to provide a community forum for issues of importance to all chemists, regardless of sub-discipline.