{"title":"Smiles2Dock:基于 ML 的分子对接的开放式大规模多任务数据集","authors":"Thomas Le Menestrel, Manuel Rivas","doi":"arxiv-2406.05738","DOIUrl":null,"url":null,"abstract":"Docking is a crucial component in drug discovery aimed at predicting the\nbinding conformation and affinity between small molecules and target proteins.\nML-based docking has recently emerged as a prominent approach, outpacing\ntraditional methods like DOCK and AutoDock Vina in handling the growing scale\nand complexity of molecular libraries. However, the availability of\ncomprehensive and user-friendly datasets for training and benchmarking ML-based\ndocking algorithms remains limited. We introduce Smiles2Dock, an open\nlarge-scale multi-task dataset for molecular docking. We created a framework\ncombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\ndatabase against 15 AlphaFold proteins, giving us more than 25 million\nprotein-ligand binding scores. The dataset leverages a wide range of\nhigh-accuracy AlphaFold protein models, encompasses a diverse set of\nbiologically relevant compounds and enables researchers to benchmark all major\napproaches for ML-based docking such as Graph, Transformer and CNN-based\nmethods. We also introduce a novel Transformer-based architecture for docking\nscores prediction and set it as an initial benchmark for our dataset. Our\ndataset and code are publicly available to support the development of novel\nML-based methods for molecular docking to advance scientific research in this\nfield.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"33 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking\",\"authors\":\"Thomas Le Menestrel, Manuel Rivas\",\"doi\":\"arxiv-2406.05738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Docking is a crucial component in drug discovery aimed at predicting the\\nbinding conformation and affinity between small molecules and target proteins.\\nML-based docking has recently emerged as a prominent approach, outpacing\\ntraditional methods like DOCK and AutoDock Vina in handling the growing scale\\nand complexity of molecular libraries. However, the availability of\\ncomprehensive and user-friendly datasets for training and benchmarking ML-based\\ndocking algorithms remains limited. We introduce Smiles2Dock, an open\\nlarge-scale multi-task dataset for molecular docking. We created a framework\\ncombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\\ndatabase against 15 AlphaFold proteins, giving us more than 25 million\\nprotein-ligand binding scores. The dataset leverages a wide range of\\nhigh-accuracy AlphaFold protein models, encompasses a diverse set of\\nbiologically relevant compounds and enables researchers to benchmark all major\\napproaches for ML-based docking such as Graph, Transformer and CNN-based\\nmethods. We also introduce a novel Transformer-based architecture for docking\\nscores prediction and set it as an initial benchmark for our dataset. Our\\ndataset and code are publicly available to support the development of novel\\nML-based methods for molecular docking to advance scientific research in this\\nfield.\",\"PeriodicalId\":501215,\"journal\":{\"name\":\"arXiv - STAT - Computation\",\"volume\":\"33 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.05738\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
对接是药物发现中的一个重要组成部分,旨在预测小分子与靶蛋白之间的结合构象和亲和力。基于 ML 的对接最近已成为一种突出的方法,在处理规模和复杂性不断增加的分子库方面已超过 DOCK 和 AutoDock Vina 等传统方法。然而,用于基于 ML 的对接算法的训练和基准测试的全面且用户友好的数据集仍然有限。我们介绍了用于分子对接的开放式大规模多任务数据集 Smiles2Dock。我们创建了一个结合 P2Rank 和 AutoDock Vina 的框架,将 ChEMBL 数据库中的 170 万配体与 15 种 AlphaFold 蛋白进行对接,得到了超过 2,500 万个蛋白质-配体结合得分。该数据集利用了广泛的高精度 AlphaFold 蛋白模型,涵盖了多种生物学相关化合物,使研究人员能够对基于 ML 的所有主要对接方法(如基于 Graph、Transformer 和 CNN 的方法)进行基准测试。我们还介绍了一种基于 Transformer 的新型对接分数预测架构,并将其设定为我们数据集的初始基准。我们的数据集和代码是公开的,以支持开发基于ML的新型分子对接方法,从而推动该领域的科学研究。
Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking
Docking is a crucial component in drug discovery aimed at predicting the
binding conformation and affinity between small molecules and target proteins.
ML-based docking has recently emerged as a prominent approach, outpacing
traditional methods like DOCK and AutoDock Vina in handling the growing scale
and complexity of molecular libraries. However, the availability of
comprehensive and user-friendly datasets for training and benchmarking ML-based
docking algorithms remains limited. We introduce Smiles2Dock, an open
large-scale multi-task dataset for molecular docking. We created a framework
combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL
database against 15 AlphaFold proteins, giving us more than 25 million
protein-ligand binding scores. The dataset leverages a wide range of
high-accuracy AlphaFold protein models, encompasses a diverse set of
biologically relevant compounds and enables researchers to benchmark all major
approaches for ML-based docking such as Graph, Transformer and CNN-based
methods. We also introduce a novel Transformer-based architecture for docking
scores prediction and set it as an initial benchmark for our dataset. Our
dataset and code are publicly available to support the development of novel
ML-based methods for molecular docking to advance scientific research in this
field.