Combining crystallographic and binding affinity data towards a novel dataset of small molecule overlays

IF 3 3区 生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY Journal of Computer-Aided Molecular Design Pub Date : 2024-12-04 DOI:10.1007/s10822-024-00581-1
Sophia M. N. Hönig, Torben Gutermuth, Christiane Ehrt, Christian Lemmen, Matthias Rarey
{"title":"Combining crystallographic and binding affinity data towards a novel dataset of small molecule overlays","authors":"Sophia M. N. Hönig,&nbsp;Torben Gutermuth,&nbsp;Christiane Ehrt,&nbsp;Christian Lemmen,&nbsp;Matthias Rarey","doi":"10.1007/s10822-024-00581-1","DOIUrl":null,"url":null,"abstract":"<p>Although small molecule superposition is a standard technique in drug discovery, a rigorous performance assessment of the corresponding methods is currently challenging. Datasets in this field are sparse, small, tailored to specific applications, unavailable, or outdated. The newly developed LOBSTER set described herein offers a publicly available and method-independent dataset for benchmarking and method optimization. LOBSTER stands for “Ligand Overlays from Binding SiTe Ensemble Representatives”. All ligands were derived from the PDB in a fully automated workflow, including a ligand efficiency filter. So-called ligand ensembles were assembled by aligning identical binding sites. Thus, the ligands within the ensembles are superimposed according to their experimentally determined binding orientation and conformation. Overall, 671 representative ligand ensembles comprise 3583 ligands from 3521 proteins. Altogether, 72,734 ligand pairs based on the ensembles were grouped into ten distinct subsets based on their volume overlap, for the benefit of introducing different degrees of difficulty for evaluating superposition methods. Statistics on the physicochemical properties of the compounds indicate that the dataset represents drug-like compounds. Consensus Diversity Plots show predominantly high Bemis–Murcko scaffold diversity and low median MACCS fingerprint similarity for each ensemble. An analysis of the underlying protein classes further demonstrates the heterogeneity within our dataset. The LOBSTER set offers a variety of applications like benchmarking multiple as well as pairwise alignments, generating training and test sets, for example based on time splits, or empirical software performance evaluation studies. The LOBSTER set is publicly available at https://doi.org/10.5281/zenodo.12658320, representing a stable and versioned data resource. The Python scripts are available at https://github.com/rareylab/LOBSTER, open-source, and allow for updating or recreating superposition sets with different data sources. </p><p>Simplified illustration of the LOBSTER dataset generation.</p>","PeriodicalId":621,"journal":{"name":"Journal of Computer-Aided Molecular Design","volume":"39 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10822-024-00581-1.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer-Aided Molecular Design","FirstCategoryId":"99","ListUrlMain":"https://link.springer.com/article/10.1007/s10822-024-00581-1","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Although small molecule superposition is a standard technique in drug discovery, a rigorous performance assessment of the corresponding methods is currently challenging. Datasets in this field are sparse, small, tailored to specific applications, unavailable, or outdated. The newly developed LOBSTER set described herein offers a publicly available and method-independent dataset for benchmarking and method optimization. LOBSTER stands for “Ligand Overlays from Binding SiTe Ensemble Representatives”. All ligands were derived from the PDB in a fully automated workflow, including a ligand efficiency filter. So-called ligand ensembles were assembled by aligning identical binding sites. Thus, the ligands within the ensembles are superimposed according to their experimentally determined binding orientation and conformation. Overall, 671 representative ligand ensembles comprise 3583 ligands from 3521 proteins. Altogether, 72,734 ligand pairs based on the ensembles were grouped into ten distinct subsets based on their volume overlap, for the benefit of introducing different degrees of difficulty for evaluating superposition methods. Statistics on the physicochemical properties of the compounds indicate that the dataset represents drug-like compounds. Consensus Diversity Plots show predominantly high Bemis–Murcko scaffold diversity and low median MACCS fingerprint similarity for each ensemble. An analysis of the underlying protein classes further demonstrates the heterogeneity within our dataset. The LOBSTER set offers a variety of applications like benchmarking multiple as well as pairwise alignments, generating training and test sets, for example based on time splits, or empirical software performance evaluation studies. The LOBSTER set is publicly available at https://doi.org/10.5281/zenodo.12658320, representing a stable and versioned data resource. The Python scripts are available at https://github.com/rareylab/LOBSTER, open-source, and allow for updating or recreating superposition sets with different data sources.

Simplified illustration of the LOBSTER dataset generation.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
结合晶体学和结合亲和数据对一个新的小分子覆盖数据集。
虽然小分子叠加是药物发现的标准技术,但对相应方法的严格性能评估目前具有挑战性。该领域的数据集稀疏、小、针对特定应用量身定制、不可用或过时。本文描述的新开发的LOBSTER集提供了一个公开可用的、与方法无关的数据集,用于基准测试和方法优化。龙虾代表“结合位点集合代表的配体叠加”。所有的配体都是在一个完全自动化的工作流程中从PDB中提取的,包括一个配体效率过滤器。所谓的配体组合是通过排列相同的结合位点来组装的。因此,根据实验确定的结合取向和构象,组合内的配体是叠加的。总的来说,671个有代表性的配体集合包含3583个配体,来自3521个蛋白质。总的来说,基于这些组合的72,734对配体基于它们的体积重叠被分为10个不同的子集,以便引入不同程度的难度来评估叠加方法。对化合物的物理化学性质的统计表明,该数据集代表药物样化合物。一致性多样性图显示,每个集合的Bemis-Murcko骨架多样性显著较高,而MACCS指纹相似度中值较低。对潜在蛋白质类别的分析进一步证明了我们数据集中的异质性。LOBSTER集提供了各种各样的应用程序,如基准测试多重和成对对齐,生成训练和测试集,例如基于时间分裂,或经验软件性能评估研究。LOBSTER集可以在https://doi.org/10.5281/zenodo.12658320上公开获得,它表示稳定且有版本的数据资源。Python脚本可从https://github.com/rareylab/LOBSTER(开源)获取,并允许使用不同的数据源更新或重新创建叠加集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Computer-Aided Molecular Design
Journal of Computer-Aided Molecular Design 生物-计算机:跨学科应用
CiteScore
8.00
自引率
8.60%
发文量
56
审稿时长
3 months
期刊介绍: The Journal of Computer-Aided Molecular Design provides a form for disseminating information on both the theory and the application of computer-based methods in the analysis and design of molecules. The scope of the journal encompasses papers which report new and original research and applications in the following areas: - theoretical chemistry; - computational chemistry; - computer and molecular graphics; - molecular modeling; - protein engineering; - drug design; - expert systems; - general structure-property relationships; - molecular dynamics; - chemical database development and usage.
期刊最新文献
In silico exploration of natural xanthone derivatives as potential inhibitors of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) replication and cellular entry Elucidating allosteric signal disruption in PBP2a: impact of N146K/E150K mutations on ceftaroline resistance in methicillin-resistant Staphylococcus aureus In silico design of dehydrophenylalanine containing peptide activators of glucokinase using pharmacophore modelling, molecular dynamics and machine learning: implications in type 2 diabetes ConoDL: a deep learning framework for rapid generation and prediction of conotoxins MolGraph: a Python package for the implementation of molecular graphs and graph neural networks with TensorFlow and Keras
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1