Smiles2Dock：基于 ML 的分子对接的开放式大规模多任务数据集

arXiv - STAT - Computation Pub Date : 2024-06-09 DOI:arxiv-2406.05738

Thomas Le Menestrel, Manuel Rivas

{"title":"Smiles2Dock：基于 ML 的分子对接的开放式大规模多任务数据集","authors":"Thomas Le Menestrel, Manuel Rivas","doi":"arxiv-2406.05738","DOIUrl":null,"url":null,"abstract":"Docking is a crucial component in drug discovery aimed at predicting the\nbinding conformation and affinity between small molecules and target proteins.\nML-based docking has recently emerged as a prominent approach, outpacing\ntraditional methods like DOCK and AutoDock Vina in handling the growing scale\nand complexity of molecular libraries. However, the availability of\ncomprehensive and user-friendly datasets for training and benchmarking ML-based\ndocking algorithms remains limited. We introduce Smiles2Dock, an open\nlarge-scale multi-task dataset for molecular docking. We created a framework\ncombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\ndatabase against 15 AlphaFold proteins, giving us more than 25 million\nprotein-ligand binding scores. The dataset leverages a wide range of\nhigh-accuracy AlphaFold protein models, encompasses a diverse set of\nbiologically relevant compounds and enables researchers to benchmark all major\napproaches for ML-based docking such as Graph, Transformer and CNN-based\nmethods. We also introduce a novel Transformer-based architecture for docking\nscores prediction and set it as an initial benchmark for our dataset. Our\ndataset and code are publicly available to support the development of novel\nML-based methods for molecular docking to advance scientific research in this\nfield.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"33 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking\",\"authors\":\"Thomas Le Menestrel, Manuel Rivas\",\"doi\":\"arxiv-2406.05738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Docking is a crucial component in drug discovery aimed at predicting the\\nbinding conformation and affinity between small molecules and target proteins.\\nML-based docking has recently emerged as a prominent approach, outpacing\\ntraditional methods like DOCK and AutoDock Vina in handling the growing scale\\nand complexity of molecular libraries. However, the availability of\\ncomprehensive and user-friendly datasets for training and benchmarking ML-based\\ndocking algorithms remains limited. We introduce Smiles2Dock, an open\\nlarge-scale multi-task dataset for molecular docking. We created a framework\\ncombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\\ndatabase against 15 AlphaFold proteins, giving us more than 25 million\\nprotein-ligand binding scores. The dataset leverages a wide range of\\nhigh-accuracy AlphaFold protein models, encompasses a diverse set of\\nbiologically relevant compounds and enables researchers to benchmark all major\\napproaches for ML-based docking such as Graph, Transformer and CNN-based\\nmethods. We also introduce a novel Transformer-based architecture for docking\\nscores prediction and set it as an initial benchmark for our dataset. Our\\ndataset and code are publicly available to support the development of novel\\nML-based methods for molecular docking to advance scientific research in this\\nfield.\",\"PeriodicalId\":501215,\"journal\":{\"name\":\"arXiv - STAT - Computation\",\"volume\":\"33 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.05738\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

对接是药物发现中的一个重要组成部分，旨在预测小分子与靶蛋白之间的结合构象和亲和力。基于 ML 的对接最近已成为一种突出的方法，在处理规模和复杂性不断增加的分子库方面已超过 DOCK 和 AutoDock Vina 等传统方法。然而，用于基于 ML 的对接算法的训练和基准测试的全面且用户友好的数据集仍然有限。我们介绍了用于分子对接的开放式大规模多任务数据集 Smiles2Dock。我们创建了一个结合 P2Rank 和 AutoDock Vina 的框架，将 ChEMBL 数据库中的 170 万配体与 15 种 AlphaFold 蛋白进行对接，得到了超过 2,500 万个蛋白质-配体结合得分。该数据集利用了广泛的高精度 AlphaFold 蛋白模型，涵盖了多种生物学相关化合物，使研究人员能够对基于 ML 的所有主要对接方法（如基于 Graph、Transformer 和 CNN 的方法）进行基准测试。我们还介绍了一种基于 Transformer 的新型对接分数预测架构，并将其设定为我们数据集的初始基准。我们的数据集和代码是公开的，以支持开发基于ML的新型分子对接方法，从而推动该领域的科学研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking

Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking to advance scientific research in this field.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - STAT - Computation

自引率

0.00%

发文量