Smiles2Dock:基于 ML 的分子对接的开放式大规模多任务数据集

Thomas Le Menestrel, Manuel Rivas
{"title":"Smiles2Dock:基于 ML 的分子对接的开放式大规模多任务数据集","authors":"Thomas Le Menestrel, Manuel Rivas","doi":"arxiv-2406.05738","DOIUrl":null,"url":null,"abstract":"Docking is a crucial component in drug discovery aimed at predicting the\nbinding conformation and affinity between small molecules and target proteins.\nML-based docking has recently emerged as a prominent approach, outpacing\ntraditional methods like DOCK and AutoDock Vina in handling the growing scale\nand complexity of molecular libraries. However, the availability of\ncomprehensive and user-friendly datasets for training and benchmarking ML-based\ndocking algorithms remains limited. We introduce Smiles2Dock, an open\nlarge-scale multi-task dataset for molecular docking. We created a framework\ncombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\ndatabase against 15 AlphaFold proteins, giving us more than 25 million\nprotein-ligand binding scores. The dataset leverages a wide range of\nhigh-accuracy AlphaFold protein models, encompasses a diverse set of\nbiologically relevant compounds and enables researchers to benchmark all major\napproaches for ML-based docking such as Graph, Transformer and CNN-based\nmethods. We also introduce a novel Transformer-based architecture for docking\nscores prediction and set it as an initial benchmark for our dataset. Our\ndataset and code are publicly available to support the development of novel\nML-based methods for molecular docking to advance scientific research in this\nfield.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"33 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking\",\"authors\":\"Thomas Le Menestrel, Manuel Rivas\",\"doi\":\"arxiv-2406.05738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Docking is a crucial component in drug discovery aimed at predicting the\\nbinding conformation and affinity between small molecules and target proteins.\\nML-based docking has recently emerged as a prominent approach, outpacing\\ntraditional methods like DOCK and AutoDock Vina in handling the growing scale\\nand complexity of molecular libraries. However, the availability of\\ncomprehensive and user-friendly datasets for training and benchmarking ML-based\\ndocking algorithms remains limited. We introduce Smiles2Dock, an open\\nlarge-scale multi-task dataset for molecular docking. We created a framework\\ncombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\\ndatabase against 15 AlphaFold proteins, giving us more than 25 million\\nprotein-ligand binding scores. The dataset leverages a wide range of\\nhigh-accuracy AlphaFold protein models, encompasses a diverse set of\\nbiologically relevant compounds and enables researchers to benchmark all major\\napproaches for ML-based docking such as Graph, Transformer and CNN-based\\nmethods. We also introduce a novel Transformer-based architecture for docking\\nscores prediction and set it as an initial benchmark for our dataset. Our\\ndataset and code are publicly available to support the development of novel\\nML-based methods for molecular docking to advance scientific research in this\\nfield.\",\"PeriodicalId\":501215,\"journal\":{\"name\":\"arXiv - STAT - Computation\",\"volume\":\"33 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.05738\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

对接是药物发现中的一个重要组成部分,旨在预测小分子与靶蛋白之间的结合构象和亲和力。基于 ML 的对接最近已成为一种突出的方法,在处理规模和复杂性不断增加的分子库方面已超过 DOCK 和 AutoDock Vina 等传统方法。然而,用于基于 ML 的对接算法的训练和基准测试的全面且用户友好的数据集仍然有限。我们介绍了用于分子对接的开放式大规模多任务数据集 Smiles2Dock。我们创建了一个结合 P2Rank 和 AutoDock Vina 的框架,将 ChEMBL 数据库中的 170 万配体与 15 种 AlphaFold 蛋白进行对接,得到了超过 2,500 万个蛋白质-配体结合得分。该数据集利用了广泛的高精度 AlphaFold 蛋白模型,涵盖了多种生物学相关化合物,使研究人员能够对基于 ML 的所有主要对接方法(如基于 Graph、Transformer 和 CNN 的方法)进行基准测试。我们还介绍了一种基于 Transformer 的新型对接分数预测架构,并将其设定为我们数据集的初始基准。我们的数据集和代码是公开的,以支持开发基于ML的新型分子对接方法,从而推动该领域的科学研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking
Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking to advance scientific research in this field.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Model-Embedded Gaussian Process Regression for Parameter Estimation in Dynamical System Effects of the entropy source on Monte Carlo simulations A Robust Approach to Gaussian Processes Implementation HJ-sampler: A Bayesian sampler for inverse problems of a stochastic process by leveraging Hamilton-Jacobi PDEs and score-based generative models Reducing Shape-Graph Complexity with Application to Classification of Retinal Blood Vessels and Neurons
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1