Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking

arXiv - STAT - Computation Pub Date : 2024-06-09 DOI:arxiv-2406.05738

Thomas Le Menestrel, Manuel Rivas

{"title":"Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking","authors":"Thomas Le Menestrel, Manuel Rivas","doi":"arxiv-2406.05738","DOIUrl":null,"url":null,"abstract":"Docking is a crucial component in drug discovery aimed at predicting the\nbinding conformation and affinity between small molecules and target proteins.\nML-based docking has recently emerged as a prominent approach, outpacing\ntraditional methods like DOCK and AutoDock Vina in handling the growing scale\nand complexity of molecular libraries. However, the availability of\ncomprehensive and user-friendly datasets for training and benchmarking ML-based\ndocking algorithms remains limited. We introduce Smiles2Dock, an open\nlarge-scale multi-task dataset for molecular docking. We created a framework\ncombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\ndatabase against 15 AlphaFold proteins, giving us more than 25 million\nprotein-ligand binding scores. The dataset leverages a wide range of\nhigh-accuracy AlphaFold protein models, encompasses a diverse set of\nbiologically relevant compounds and enables researchers to benchmark all major\napproaches for ML-based docking such as Graph, Transformer and CNN-based\nmethods. We also introduce a novel Transformer-based architecture for docking\nscores prediction and set it as an initial benchmark for our dataset. Our\ndataset and code are publicly available to support the development of novel\nML-based methods for molecular docking to advance scientific research in this\nfield.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"33 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking to advance scientific research in this field.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Smiles2Dock：基于 ML 的分子对接的开放式大规模多任务数据集

对接是药物发现中的一个重要组成部分，旨在预测小分子与靶蛋白之间的结合构象和亲和力。基于 ML 的对接最近已成为一种突出的方法，在处理规模和复杂性不断增加的分子库方面已超过 DOCK 和 AutoDock Vina 等传统方法。然而，用于基于 ML 的对接算法的训练和基准测试的全面且用户友好的数据集仍然有限。我们介绍了用于分子对接的开放式大规模多任务数据集 Smiles2Dock。我们创建了一个结合 P2Rank 和 AutoDock Vina 的框架，将 ChEMBL 数据库中的 170 万配体与 15 种 AlphaFold 蛋白进行对接，得到了超过 2,500 万个蛋白质-配体结合得分。该数据集利用了广泛的高精度 AlphaFold 蛋白模型，涵盖了多种生物学相关化合物，使研究人员能够对基于 ML 的所有主要对接方法（如基于 Graph、Transformer 和 CNN 的方法）进行基准测试。我们还介绍了一种基于 Transformer 的新型对接分数预测架构，并将其设定为我们数据集的初始基准。我们的数据集和代码是公开的，以支持开发基于ML的新型分子对接方法，从而推动该领域的科学研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - STAT - Computation

自引率

0.00%

发文量