Combining crystallographic and binding affinity data towards a novel dataset of small molecule overlays

IF 3.1 3区生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY Journal of Computer-Aided Molecular Design Pub Date : 2024-12-04 DOI:10.1007/s10822-024-00581-1

Sophia M. N. Hönig, Torben Gutermuth, Christiane Ehrt, Christian Lemmen, Matthias Rarey

{"title":"Combining crystallographic and binding affinity data towards a novel dataset of small molecule overlays","authors":"Sophia M. N. Hönig, Torben Gutermuth, Christiane Ehrt, Christian Lemmen, Matthias Rarey","doi":"10.1007/s10822-024-00581-1","DOIUrl":null,"url":null,"abstract":"<p>Although small molecule superposition is a standard technique in drug discovery, a rigorous performance assessment of the corresponding methods is currently challenging. Datasets in this field are sparse, small, tailored to specific applications, unavailable, or outdated. The newly developed LOBSTER set described herein offers a publicly available and method-independent dataset for benchmarking and method optimization. LOBSTER stands for “Ligand Overlays from Binding SiTe Ensemble Representatives”. All ligands were derived from the PDB in a fully automated workflow, including a ligand efficiency filter. So-called ligand ensembles were assembled by aligning identical binding sites. Thus, the ligands within the ensembles are superimposed according to their experimentally determined binding orientation and conformation. Overall, 671 representative ligand ensembles comprise 3583 ligands from 3521 proteins. Altogether, 72,734 ligand pairs based on the ensembles were grouped into ten distinct subsets based on their volume overlap, for the benefit of introducing different degrees of difficulty for evaluating superposition methods. Statistics on the physicochemical properties of the compounds indicate that the dataset represents drug-like compounds. Consensus Diversity Plots show predominantly high Bemis–Murcko scaffold diversity and low median MACCS fingerprint similarity for each ensemble. An analysis of the underlying protein classes further demonstrates the heterogeneity within our dataset. The LOBSTER set offers a variety of applications like benchmarking multiple as well as pairwise alignments, generating training and test sets, for example based on time splits, or empirical software performance evaluation studies. The LOBSTER set is publicly available at https://doi.org/10.5281/zenodo.12658320, representing a stable and versioned data resource. The Python scripts are available at https://github.com/rareylab/LOBSTER, open-source, and allow for updating or recreating superposition sets with different data sources. </p><p>Simplified illustration of the LOBSTER dataset generation.</p>","PeriodicalId":621,"journal":{"name":"Journal of Computer-Aided Molecular Design","volume":"39 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10822-024-00581-1.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer-Aided Molecular Design","FirstCategoryId":"99","ListUrlMain":"https://link.springer.com/article/10.1007/s10822-024-00581-1","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Although small molecule superposition is a standard technique in drug discovery, a rigorous performance assessment of the corresponding methods is currently challenging. Datasets in this field are sparse, small, tailored to specific applications, unavailable, or outdated. The newly developed LOBSTER set described herein offers a publicly available and method-independent dataset for benchmarking and method optimization. LOBSTER stands for “Ligand Overlays from Binding SiTe Ensemble Representatives”. All ligands were derived from the PDB in a fully automated workflow, including a ligand efficiency filter. So-called ligand ensembles were assembled by aligning identical binding sites. Thus, the ligands within the ensembles are superimposed according to their experimentally determined binding orientation and conformation. Overall, 671 representative ligand ensembles comprise 3583 ligands from 3521 proteins. Altogether, 72,734 ligand pairs based on the ensembles were grouped into ten distinct subsets based on their volume overlap, for the benefit of introducing different degrees of difficulty for evaluating superposition methods. Statistics on the physicochemical properties of the compounds indicate that the dataset represents drug-like compounds. Consensus Diversity Plots show predominantly high Bemis–Murcko scaffold diversity and low median MACCS fingerprint similarity for each ensemble. An analysis of the underlying protein classes further demonstrates the heterogeneity within our dataset. The LOBSTER set offers a variety of applications like benchmarking multiple as well as pairwise alignments, generating training and test sets, for example based on time splits, or empirical software performance evaluation studies. The LOBSTER set is publicly available at https://doi.org/10.5281/zenodo.12658320, representing a stable and versioned data resource. The Python scripts are available at https://github.com/rareylab/LOBSTER, open-source, and allow for updating or recreating superposition sets with different data sources.

Simplified illustration of the LOBSTER dataset generation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

结合晶体学和结合亲和数据对一个新的小分子覆盖数据集。

虽然小分子叠加是药物发现的标准技术，但对相应方法的严格性能评估目前具有挑战性。该领域的数据集稀疏、小、针对特定应用量身定制、不可用或过时。本文描述的新开发的LOBSTER集提供了一个公开可用的、与方法无关的数据集，用于基准测试和方法优化。龙虾代表“结合位点集合代表的配体叠加”。所有的配体都是在一个完全自动化的工作流程中从PDB中提取的，包括一个配体效率过滤器。所谓的配体组合是通过排列相同的结合位点来组装的。因此，根据实验确定的结合取向和构象，组合内的配体是叠加的。总的来说，671个有代表性的配体集合包含3583个配体，来自3521个蛋白质。总的来说，基于这些组合的72,734对配体基于它们的体积重叠被分为10个不同的子集，以便引入不同程度的难度来评估叠加方法。对化合物的物理化学性质的统计表明，该数据集代表药物样化合物。一致性多样性图显示，每个集合的Bemis-Murcko骨架多样性显著较高，而MACCS指纹相似度中值较低。对潜在蛋白质类别的分析进一步证明了我们数据集中的异质性。LOBSTER集提供了各种各样的应用程序，如基准测试多重和成对对齐，生成训练和测试集，例如基于时间分裂，或经验软件性能评估研究。LOBSTER集可以在https://doi.org/10.5281/zenodo.12658320上公开获得，它表示稳定且有版本的数据资源。Python脚本可从https://github.com/rareylab/LOBSTER（开源）获取，并允许使用不同的数据源更新或重新创建叠加集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Computer-Aided Molecular Design 生物-计算机：跨学科应用

CiteScore

8.00

自引率

8.60%

发文量

审稿时长

3 months

期刊介绍： The Journal of Computer-Aided Molecular Design provides a form for disseminating information on both the theory and the application of computer-based methods in the analysis and design of molecules. The scope of the journal encompasses papers which report new and original research and applications in the following areas: - theoretical chemistry; - computational chemistry; - computer and molecular graphics; - molecular modeling; - protein engineering; - drug design; - expert systems; - general structure-property relationships; - molecular dynamics; - chemical database development and usage.