An Open-Source Implementation of the Scaffold Identification and Naming System (SCINS) and Example Applications

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL Journal of Chemical Information and Modeling Pub Date : 2024-10-15 DOI:10.1021/acs.jcim.4c01314

Kamen P. Petrov, Andreas Bender

{"title":"An Open-Source Implementation of the Scaffold Identification and Naming System (SCINS) and Example Applications","authors":"Kamen P. Petrov, Andreas Bender","doi":"10.1021/acs.jcim.4c01314","DOIUrl":null,"url":null,"abstract":"Organizing and partitioning sets of chemical structures is of considerable practical significance, e.g., in compound library analysis and the postprocessing of screening hit lists. Approaches such as unsupervised clustering are computationally demanding and dataset-dependent; on the other hand, rule-based methods, such as those based on Murcko scaffolds, have linear time complexity but are often too fine-grained, leading to a large number of singletons or sparsely populated classes. An alternative rule-based method that seeks to achieve an optimal balance when grouping compounds into sets is the ‘Scaffold Identification and Naming System’ (SCINS). To facilitate public use of this previously published method, here, we provide an open-source Python implementation of SCINS, dependent only on RDKit. We show that SCINS can be useful in identifying sparsely and densely populated regions in chemical space in large databases, here exemplified with Enamine REAL Diverse and ChEMBL. We find that Enamine REAL Diverse covers a much smaller SCINS space relative to ChEMBL, whereas the opposite is true when Murcko and generic Murcko scaffolds are considered. Additionally, we show that SCINS can result in chemically intuitive grouping of medium-sized sets of bioactive compounds, which can be useful in compound selection from virtual screening campaigns as well as postprocessing of experimental hit lists. Hence, in this work, we provide both an open-source implementation of SCINS and its characterization with relevant use cases.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"86 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.4c01314","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Organizing and partitioning sets of chemical structures is of considerable practical significance, e.g., in compound library analysis and the postprocessing of screening hit lists. Approaches such as unsupervised clustering are computationally demanding and dataset-dependent; on the other hand, rule-based methods, such as those based on Murcko scaffolds, have linear time complexity but are often too fine-grained, leading to a large number of singletons or sparsely populated classes. An alternative rule-based method that seeks to achieve an optimal balance when grouping compounds into sets is the ‘Scaffold Identification and Naming System’ (SCINS). To facilitate public use of this previously published method, here, we provide an open-source Python implementation of SCINS, dependent only on RDKit. We show that SCINS can be useful in identifying sparsely and densely populated regions in chemical space in large databases, here exemplified with Enamine REAL Diverse and ChEMBL. We find that Enamine REAL Diverse covers a much smaller SCINS space relative to ChEMBL, whereas the opposite is true when Murcko and generic Murcko scaffolds are considered. Additionally, we show that SCINS can result in chemically intuitive grouping of medium-sized sets of bioactive compounds, which can be useful in compound selection from virtual screening campaigns as well as postprocessing of experimental hit lists. Hence, in this work, we provide both an open-source implementation of SCINS and its characterization with relevant use cases.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

脚手架识别和命名系统 (SCINS) 的开源实现和应用实例

组织和划分化学结构集具有重要的实际意义，例如在化合物库分析和筛选命中列表的后处理中。无监督聚类等方法对计算要求很高，而且依赖于数据集；另一方面，基于规则的方法（如基于 Murcko 支架的方法）具有线性时间复杂性，但往往过于精细，导致大量单体或稀疏类别的出现。支架识别和命名系统"（SCINS）是另一种基于规则的方法，旨在将化合物分组时实现最佳平衡。为了方便公众使用这一先前已发表的方法，我们在此提供了 SCINS 的开源 Python 实现，仅依赖于 RDKit。我们表明，SCINS 可用于识别大型数据库中化学空间的稀疏和密集区域，这里以 Enamine REAL Diverse 和 ChEMBL 为例。我们发现，与 ChEMBL 相比，Enamine REAL Diverse 涵盖的 SCINS 空间要小得多，而考虑 Murcko 和通用 Murcko 支架时，情况则恰恰相反。此外，我们还发现 SCINS 可以对中等规模的生物活性化合物进行直观的化学分组，这对于从虚拟筛选活动中选择化合物以及对实验结果列表进行后处理非常有用。因此，在这项工作中，我们提供了 SCINS 的开源实现及其相关用例的特征描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.