{"title":"Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking","authors":"Thomas Le Menestrel, Manuel Rivas","doi":"arxiv-2406.05738","DOIUrl":null,"url":null,"abstract":"Docking is a crucial component in drug discovery aimed at predicting the\nbinding conformation and affinity between small molecules and target proteins.\nML-based docking has recently emerged as a prominent approach, outpacing\ntraditional methods like DOCK and AutoDock Vina in handling the growing scale\nand complexity of molecular libraries. However, the availability of\ncomprehensive and user-friendly datasets for training and benchmarking ML-based\ndocking algorithms remains limited. We introduce Smiles2Dock, an open\nlarge-scale multi-task dataset for molecular docking. We created a framework\ncombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\ndatabase against 15 AlphaFold proteins, giving us more than 25 million\nprotein-ligand binding scores. The dataset leverages a wide range of\nhigh-accuracy AlphaFold protein models, encompasses a diverse set of\nbiologically relevant compounds and enables researchers to benchmark all major\napproaches for ML-based docking such as Graph, Transformer and CNN-based\nmethods. We also introduce a novel Transformer-based architecture for docking\nscores prediction and set it as an initial benchmark for our dataset. Our\ndataset and code are publicly available to support the development of novel\nML-based methods for molecular docking to advance scientific research in this\nfield.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Docking is a crucial component in drug discovery aimed at predicting the
binding conformation and affinity between small molecules and target proteins.
ML-based docking has recently emerged as a prominent approach, outpacing
traditional methods like DOCK and AutoDock Vina in handling the growing scale
and complexity of molecular libraries. However, the availability of
comprehensive and user-friendly datasets for training and benchmarking ML-based
docking algorithms remains limited. We introduce Smiles2Dock, an open
large-scale multi-task dataset for molecular docking. We created a framework
combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL
database against 15 AlphaFold proteins, giving us more than 25 million
protein-ligand binding scores. The dataset leverages a wide range of
high-accuracy AlphaFold protein models, encompasses a diverse set of
biologically relevant compounds and enables researchers to benchmark all major
approaches for ML-based docking such as Graph, Transformer and CNN-based
methods. We also introduce a novel Transformer-based architecture for docking
scores prediction and set it as an initial benchmark for our dataset. Our
dataset and code are publicly available to support the development of novel
ML-based methods for molecular docking to advance scientific research in this
field.