Journal of Cheminformatics最新文献_第3页

cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research cidalsDB：人工智能赋能的抗病原治疗研究平台

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-28 DOI: 10.1186/s13321-024-00929-7

Emna Harigua-Souiai, Ons Masmoudi, Samer Makni, Rafeh Oualha, Yosser Z. Abdelkrim, Sara Hamdi, Oussama Souiai, Ikram Guizani

Computer-aided drug discovery (CADD) is nurtured by late advances in big data analytics and Artificial Intelligence (AI) towards enhanced drug discovery (DD) outcomes. In this context, reliable datasets are of utmost importance. We herein present CidalsDB a novel web server for AI-assisted DD against infectious pathogens, namely Leishmania parasites and Coronaviruses. We performed a literature search on molecules with validated anti-pathogen effects. Then, we consolidated these data with bioassays from PubChem. Finally, we constructed a database to store these datasets and make them accessible and ready-to-use for the scientific community through CidalsDB, a web-based interface. In a second step, we implemented and optimized four machine learning (ML) and three deep learning (DL) algorithms that optimally predicted the biological activity of molecules. Random Forests (RF), Multi-Layer Perceptron (MLP) and ChemBERTa were the best classifiers of anti-Leishmania molecules, while Gradient Boosting (GB), Graph-Convolutional Network (GCN) and ChemBERTa achieved the best performances on the Coronaviruses dataset. All six models were optimized and deployed through CidalsDB as anti-pathogen activity prediction models.

Scientific contribution

CidalsDB is an open access web-based tool that allows browsing and access to ready-to-use datasets of anti-pathogen molecules, alongside best performing AI models for biological activity prediction. It offers a democratized no-code platform for AI-based CADD, which shall foster innovation and collaboration within the DD community. CidalsDB is accessible through https://cidalsdb.streamlit.app/.

计算机辅助药物发现（CADD）是在大数据分析和人工智能（AI）的推动下发展起来的，旨在提高药物发现（DD）的成果。在此背景下，可靠的数据集至关重要。我们在此介绍 CidalsDB，这是一个新型网络服务器，用于针对传染性病原体（即利什曼原虫和冠状病毒）的人工智能辅助药物研发。我们对具有有效抗病原体作用的分子进行了文献检索。然后，我们将这些数据与来自 PubChem 的生物测定结果进行了整合。最后，我们建立了一个数据库来存储这些数据集，并通过基于网络的界面 CidalsDB 使科学界能够访问和使用这些数据集。第二步，我们实施并优化了四种机器学习（ML）算法和三种深度学习（DL）算法，以最佳方式预测分子的生物活性。随机森林（RF）、多层感知器（MLP）和ChemBERTa是抗利什曼病分子的最佳分类器，而梯度提升（GB）、图卷积网络（GCN）和ChemBERTa在冠状病毒数据集上取得了最佳性能。所有六个模型都经过了优化，并通过 CidalsDB 作为抗病原体活性预测模型进行了部署。它为基于人工智能的计算机辅助设计（CADD）提供了一个民主化的无代码平台，可促进 DD 社区的创新与合作。CidalsDB 可通过 https://cidalsdb.streamlit.app/ 访问。

{"title":"cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research","authors":"Emna Harigua-Souiai, Ons Masmoudi, Samer Makni, Rafeh Oualha, Yosser Z. Abdelkrim, Sara Hamdi, Oussama Souiai, Ikram Guizani","doi":"10.1186/s13321-024-00929-7","DOIUrl":"10.1186/s13321-024-00929-7","url":null,"abstract":"<div><p>Computer-aided drug discovery (CADD) is nurtured by late advances in big data analytics and Artificial Intelligence (AI) towards enhanced drug discovery (DD) outcomes. In this context, reliable datasets are of utmost importance. We herein present <i>CidalsDB</i> a novel web server for AI-assisted DD against infectious pathogens, namely <i>Leishmania</i> parasites and Coronaviruses. We performed a literature search on molecules with validated anti-pathogen effects. Then, we consolidated these data with bioassays from PubChem. Finally, we constructed a database to store these datasets and make them accessible and ready-to-use for the scientific community through <i>CidalsDB</i>, a web-based interface. In a second step, we implemented and optimized four machine learning (ML) and three deep learning (DL) algorithms that optimally predicted the biological activity of molecules. Random Forests (RF), Multi-Layer Perceptron (MLP) and ChemBERTa were the best classifiers of anti-<i>Leishmania</i> molecules, while Gradient Boosting (GB), Graph-Convolutional Network (GCN) and ChemBERTa achieved the best performances on the Coronaviruses dataset. All six models were optimized and deployed through <i>CidalsDB</i> as anti-pathogen activity prediction models.</p><p><b>Scientific contribution</b></p><p>CidalsDB is an open access web-based tool that allows browsing and access to ready-to-use datasets of anti-pathogen molecules, alongside best performing AI models for biological activity prediction. It offers a democratized no-code platform for AI-based CADD, which shall foster innovation and collaboration within the DD community. <i>CidalsDB</i> is accessible through https://cidalsdb.streamlit.app/.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00929-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142737029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Group graph: a molecular graph representation with enhanced performance, efficiency and interpretability 组图：一种性能、效率和可解释性更强的分子图表示法

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-28 DOI: 10.1186/s13321-024-00933-x

Piao-Yang Cao, Yang He, Ming-Yang Cui, Xiao-Min Zhang, Qingye Zhang, Hong-Yu Zhang

The exploration of chemical space holds promise for developing influential chemical entities. Molecular representations, which reflect features of molecular structure in silico, assist in navigating chemical space appropriately. Unlike atom-level molecular representations, such as SMILES and atom graph, which can sometimes lead to confusing interpretations about chemical substructures, substructure-level molecular representations encode important substructures into molecular features; they not only provide more information for predicting molecular properties and drug‒drug interactions but also help to interpret the correlations between molecular properties and substructures. However, it remains challenging to represent the entire molecular structure both intactly and simply with substructure-level molecular representations. In this study, we developed a novel substructure-level molecular representation and named it a group graph. The group graph offers three advantages: (a) the substructure of the group graph reflects the diversity and consistency of different molecular datasets; (b) the group graph retains molecular structural features with minimal information loss because the graph isomorphism network (GIN) of the group graph performs well in molecular properties and drug‒drug interactions prediction, showing higher accuracy and efficiency than the model of other molecular graphs, even without any pretraining; and (c) the molecular property may change when the substructure is substituted with another of differing importance in group graph, facilitating the detection of activity cliffs. In addition, we successfully predicted structural modifications to improve blood‒brain barrier permeability (BBBP) via the GIN of group graph. Therefore, the group graph takes advantages for simultaneously representing molecular local characteristics and global features.

Scientific contribution The group graph, as a substructure-level molecular representation, has the ability to retain molecular structural features with minimal information loss. As a result, it shows superior performance in predicting molecular properties and drug‒drug interactions with enhanced efficiency and interpretability.

探索化学空间有望开发出有影响力的化学实体。分子表征反映了硅学中分子结构的特征，有助于适当地浏览化学空间。与 SMILES 和原子图等原子级分子表征不同，亚结构级分子表征将重要的亚结构编码为分子特征；它们不仅为预测分子性质和药物间相互作用提供了更多信息，还有助于解释分子性质与亚结构之间的相关性。然而，用亚结构级分子表征完整而简单地表征整个分子结构仍然具有挑战性。在这项研究中，我们开发了一种新颖的亚结构级分子表示法，并将其命名为组图。组图有三个优点：(a) 群图的亚结构反映了不同分子数据集的多样性和一致性；(b) 群图以最小的信息损失保留了分子结构特征，因为群图的图同构网络（GIN）在分子性质和药物相互作用预测方面表现出色，即使没有任何预训练，也比其他分子图的模型表现出更高的准确性和效率；(c) 当群图中的亚结构被另一个不同重要性的亚结构替代时，分子性质可能会发生变化，这有利于检测活性悬崖。此外，我们还通过组图的 GIN 成功预测了改善血脑屏障通透性（BBBP）的结构修饰。因此，组图在同时表示分子局部特征和全局特征方面具有优势。科学贡献组图作为一种亚结构级分子表示法，能够在保留分子结构特征的同时将信息损失降到最低。因此，它在预测分子特性和药物相互作用方面表现出卓越的性能，并提高了效率和可解释性。

{"title":"Group graph: a molecular graph representation with enhanced performance, efficiency and interpretability","authors":"Piao-Yang Cao, Yang He, Ming-Yang Cui, Xiao-Min Zhang, Qingye Zhang, Hong-Yu Zhang","doi":"10.1186/s13321-024-00933-x","DOIUrl":"10.1186/s13321-024-00933-x","url":null,"abstract":"<div><p>The exploration of chemical space holds promise for developing influential chemical entities. Molecular representations, which reflect features of molecular structure in silico, assist in navigating chemical space appropriately. Unlike atom-level molecular representations, such as SMILES and atom graph, which can sometimes lead to confusing interpretations about chemical substructures, substructure-level molecular representations encode important substructures into molecular features; they not only provide more information for predicting molecular properties and drug‒drug interactions but also help to interpret the correlations between molecular properties and substructures. However, it remains challenging to represent the entire molecular structure both intactly and simply with substructure-level molecular representations. In this study, we developed a novel substructure-level molecular representation and named it a group graph. The group graph offers three advantages: (a) the substructure of the group graph reflects the diversity and consistency of different molecular datasets; (b) the group graph retains molecular structural features with minimal information loss because the graph isomorphism network (GIN) of the group graph performs well in molecular properties and drug‒drug interactions prediction, showing higher accuracy and efficiency than the model of other molecular graphs, even without any pretraining; and (c) the molecular property may change when the substructure is substituted with another of differing importance in group graph, facilitating the detection of activity cliffs. In addition, we successfully predicted structural modifications to improve blood‒brain barrier permeability (BBBP) via the GIN of group graph. Therefore, the group graph takes advantages for simultaneously representing molecular local characteristics and global features.</p><p><b>Scientific contribution</b> The group graph, as a substructure-level molecular representation, has the ability to retain molecular structural features with minimal information loss. As a result, it shows superior performance in predicting molecular properties and drug‒drug interactions with enhanced efficiency and interpretability. </p><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00933-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142737030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature 从专利文献中提取高质量化学反应数据集的大语言模型的适用性

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-26 DOI: 10.1186/s13321-024-00928-8

Sarveswara Rao Vangala, Sowmya Ramaswamy Krishnan, Navneet Bung, Dhandapani Nandagopal, Gomathi Ramasamy, Satyam Kumar, Sridharan Sankaran, Rajgopal Srinivasan, Arijit Roy

With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.

Scientific contribution

In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.

随着人工智能（AI）时代的到来，人们现在有可能从以前未曾探索过的化学空间中设计出多种多样的新型分子。然而，化学家面临的一个挑战是如何合成这些分子。最近，有人尝试开发用于逆合成预测的人工智能模型，但这有赖于高质量训练数据集的可用性。在这项工作中，我们探索了大语言模型（LLM）从专利文件中提取高质量化学反应数据的适用性。对早期研究中的同一组专利进行的比较研究表明，所提出的自动化方法可以增加 26% 的新反应，从而增强当前数据集的能力。在反应挖掘过程中发现了一些挑战，并针对其中一些挑战提出了替代解决方案。此外，还进行了详细的分析，发现了之前数据集中的几个错误条目。在更大的专利数据集上使用所提出的管道提取反应，可以提高未来合成预测模型的准确性和效率。科学贡献在这项工作中，我们评估了大语言模型从专利文献中挖掘高质量化学反应数据集的适用性。结果表明，所提出的方法可以通过识别更多的化学反应来显著提高反应数据库的数量，并通过纠正以前的错误/假阳性反应来提高反应数据库的质量。

{"title":"Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature","authors":"Sarveswara Rao Vangala, Sowmya Ramaswamy Krishnan, Navneet Bung, Dhandapani Nandagopal, Gomathi Ramasamy, Satyam Kumar, Sridharan Sankaran, Rajgopal Srinivasan, Arijit Roy","doi":"10.1186/s13321-024-00928-8","DOIUrl":"10.1186/s13321-024-00928-8","url":null,"abstract":"<div><p>With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.</p><p><b>Scientific contribution</b></p><p>In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00928-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142713383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts GT-NMR：基于图变换器的新型核磁共振化学位移精确预测方法

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-26 DOI: 10.1186/s13321-024-00927-9

Haochen Chen, Tao Liang, Kai Tan, Anan Wu, Xin Lu

In this work, inspired by the graph transformer, we presented an improved protocol, termed GT-NMR, which integrates 2D molecular graph representation with Transformer architecture, for accurate yet efficient prediction of NMR chemical shifts. The effectiveness of the GT-NMR was thoroughly examined with the standard nmrshiftdb2 dataset, 37 natural products and structural elucidation of 11 pairs of natural products. Systematical analysis affirms that GT-NMR outperforms traditional graph-based methods in all aspects, achieving state-of-the-art performance, with the mean absolute error of 0.158 and 1.189 ppm in predicting ¹H and ¹³C NMR chemical shifts, respectively, for the standard nmrshiftdb2 dataset. Further scrutiny of its practical applications indicates that GT-NMR's efficacy is closely tied to molecular complexity, as quantified by the size-normalized spatial score (nSPS). For relatively simple molecules (nSPS < = 27.71), GT-NMR performs comparably to the best density functional while its effectiveness significantly diminishes with complex molecules characterized by higher nSPS values (nSPS > = 38.42). This trend is consistent across other graph-based NMR chemical shift prediction methods as well. Therefore, while employing GT-NMR or other graph-based methods for the rapid and routine prediction of NMR chemical shifts, it is advisable to utilize nSPS to assess their suitability. The source codes and trained model of GT-NMR are publicly available at GitHub.

Scientific contribution

GT-NMR, which combines the 2D molecular graph representation with the Transformer architecture, was implemented for the first time to predict atom-level NMR chemical shifts, achieving state-of-the-art performance. More importantly, the reliability of the GT-NMR and graph-based methods was assessed for the first time in terms of molecular complexity, as quantified by the size-normalized spacial score (nSPS). Systematical scrutiny demonstrated that GT-NMR offer a valuable way for routine application in structural screening and elucidation of relatively simple molecules.

在这项工作中，我们受到图转换器的启发，提出了一种改进的方案，称为 GT-NMR，它将二维分子图表示法与转换器架构相结合，用于准确而高效地预测核磁共振化学位移。我们利用标准 nmrshiftdb2 数据集、37 种天然产品和 11 对天然产品的结构阐释对 GT-NMR 的有效性进行了全面检验。系统分析证实，在预测标准 nmrshiftdb2 数据集的 1H 和 13C NMR 化学位移方面，GT-NMR 的平均绝对误差分别为 0.158 和 1.189 ppm，在各方面均优于传统的基于图形的方法，达到了最先进的性能。对其实际应用的进一步研究表明，GT-NMR 的功效与分子复杂性密切相关，分子复杂性可通过尺寸归一化空间分数 (nSPS) 量化。对于相对简单的分子（nSPS < = 27.71），GT-NMR 的性能可与最佳密度函数相媲美，而对于 nSPS 值较高的复杂分子（nSPS > = 38.42），GT-NMR 的功效则明显降低。这一趋势在其他基于图形的 NMR 化学位移预测方法中也是一致的。因此，在使用 GT-NMR 或其他基于图形的方法快速、常规预测 NMR 化学位移时，最好使用 nSPS 来评估其适用性。GT-NMR 的源代码和训练有素的模型可在 GitHub 上公开获取。科学贡献 GT-NMR 结合了二维分子图表示法和 Transformer 架构，首次用于预测原子级 NMR 化学位移，实现了最先进的性能。更重要的是，首次从分子复杂性的角度评估了 GT-NMR 和基于图的方法的可靠性，以尺寸归一化空间分数 (nSPS) 进行量化。系统审查表明，GT-NMR 为结构筛选和阐明相对简单分子的常规应用提供了一种有价值的方法。

{"title":"GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts","authors":"Haochen Chen, Tao Liang, Kai Tan, Anan Wu, Xin Lu","doi":"10.1186/s13321-024-00927-9","DOIUrl":"10.1186/s13321-024-00927-9","url":null,"abstract":"<div><p>In this work, inspired by the graph transformer, we presented an improved protocol, termed GT-NMR, which integrates 2D molecular graph representation with Transformer architecture, for accurate yet efficient prediction of NMR chemical shifts. The effectiveness of the GT-NMR was thoroughly examined with the standard nmrshiftdb2 dataset, 37 natural products and structural elucidation of 11 pairs of natural products. Systematical analysis affirms that GT-NMR outperforms traditional graph-based methods in all aspects, achieving state-of-the-art performance, with the mean absolute error of 0.158 and 1.189 ppm in predicting <sup>1</sup>H and <sup>13</sup>C NMR chemical shifts, respectively, for the standard nmrshiftdb2 dataset. Further scrutiny of its practical applications indicates that GT-NMR's efficacy is closely tied to molecular complexity, as quantified by the size-normalized spatial score (nSPS). For relatively simple molecules (nSPS < = 27.71), GT-NMR performs comparably to the best density functional while its effectiveness significantly diminishes with complex molecules characterized by higher nSPS values (nSPS > = 38.42). This trend is consistent across other graph-based NMR chemical shift prediction methods as well. Therefore, while employing GT-NMR or other graph-based methods for the rapid and routine prediction of NMR chemical shifts, it is advisable to utilize nSPS to assess their suitability. The source codes and trained model of GT-NMR are publicly available at GitHub.</p><p><b>Scientific contribution</b></p><p>GT-NMR, which combines the 2D molecular graph representation with the Transformer architecture, was implemented for the first time to predict atom-level NMR chemical shifts, achieving state-of-the-art performance. More importantly, the reliability of the GT-NMR and graph-based methods was assessed for the first time in terms of molecular complexity, as quantified by the size-normalized spacial score (nSPS). Systematical scrutiny demonstrated that GT-NMR offer a valuable way for routine application in structural screening and elucidation of relatively simple molecules.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00927-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142713121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Molecular identification via molecular fingerprint extraction from atomic force microscopy images 从原子力显微镜图像中提取分子指纹进行分子鉴定

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-25 DOI: 10.1186/s13321-024-00921-1

Manuel González Lastre, Pablo Pou, Miguel Wiche, Daniel Ebeling, Andre Schirmeisen, Rubén Pérez

Non–Contact Atomic Force Microscopy with CO–functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR–AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024–bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR–AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification. By construction, the number of times a certain substructure is present in the molecule is lost during the hashing process, necessary to make them useful for machine learning applications. We show that it is possible to complement the fingerprint-based virtual screening with global information provided by another DL model that predicts from the same HR–AFM stacks the chemical formula, boosting the identification accuracy up to a 97.6%. Finally, we perform a limited test with experimental images, obtaining promising results towards the application of this pipeline under real conditions.

Scientific contribution

Previous works on molecular identification from AFM images used chemical descriptors that were intuitive for humans but sub–optimal for neural networks. We propose a novel method to extract the ECFP4 from AFM images and identify the molecule via a virtual screening, beating previous state-of-the-art models.

使用 CO 功能化金属针尖的非接触式原子力显微镜（简称 HR-AFM）能以完全前所未有的分辨率观察吸附在表面上的单个分子的内部结构。之前的研究表明，深度学习（DL）模型可以检索恒定高度的 HR-AFM 图像的三维堆栈中编码的化学和结构信息，从而进行分子识别。在这项工作中，我们利用拓扑指纹（1024 位半径 2 的扩展连接化学指纹（ECFP4））对分子结构进行了完善的描述，从而克服了它们的局限性。ECFP 提供了分子的局部结构信息，每个比特与分子内的特定子结构相关。我们的 DL 模型能够从三维 HR-AFM 堆栈中提取这种优化的结构描述符，并通过虚拟筛选，利用预测的 ECFP4 识别分子，理论图像的检索准确率高达 95.4%。此外，与以往的 DL 模型不同，这种方法会给每个候选分子分配一个置信度分数，即 Tanimoto 相似度，从而提供识别可靠性的信息。根据构造，在散列过程中，分子中出现某种子结构的次数会丢失，而这是使它们在机器学习应用中发挥作用的必要条件。我们的研究表明，可以利用另一个 DL 模型提供的全局信息对基于指纹的虚拟筛选进行补充，该模型可从相同的 HR-AFM 堆栈中预测化学式，从而将识别准确率提高到 97.6%。最后，我们利用实验图像进行了有限的测试，获得了在实际条件下应用该管道的可喜成果。科学贡献以往从原子力显微镜图像中进行分子识别的工作所使用的化学描述符对人类来说是直观的，但对神经网络来说却是次优的。我们提出了一种从原子力显微镜图像中提取 ECFP4 并通过虚拟筛选识别分子的新方法，超越了之前的先进模型。

{"title":"Molecular identification via molecular fingerprint extraction from atomic force microscopy images","authors":"Manuel González Lastre, Pablo Pou, Miguel Wiche, Daniel Ebeling, Andre Schirmeisen, Rubén Pérez","doi":"10.1186/s13321-024-00921-1","DOIUrl":"10.1186/s13321-024-00921-1","url":null,"abstract":"<div><p>Non–Contact Atomic Force Microscopy with CO–functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR–AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024–bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR–AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification. By construction, the number of times a certain substructure is present in the molecule is lost during the hashing process, necessary to make them useful for machine learning applications. We show that it is possible to complement the fingerprint-based virtual screening with global information provided by another DL model that predicts from the same HR–AFM stacks the chemical formula, boosting the identification accuracy up to a 97.6%. Finally, we perform a limited test with experimental images, obtaining promising results towards the application of this pipeline under real conditions.</p><p><b>Scientific contribution</b></p><p>Previous works on molecular identification from AFM images used chemical descriptors that were intuitive for humans but sub–optimal for neural networks. We propose a novel method to extract the ECFP4 from AFM images and identify the molecule via a virtual screening, beating previous state-of-the-art models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00921-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142697120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A systematic review of deep learning chemical language models in recent era 近代深度学习化学语言模型的系统回顾。

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-18 DOI: 10.1186/s13321-024-00916-y

Hector Flores-Hernandez, Emmanuel Martinez-Ledesma

Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.

发现具有特定性质的新化合物可以为依赖材料进行开发的领域带来优势，尽管这项任务在复杂性和资源方面需要付出高昂的代价。自数据时代开始以来，深度学习技术通过分析和学习分子数据表示，彻底改变了分子设计过程，大大减少了所需的资源和时间。迄今为止，已经开发出了多种深度学习方法，采用了各种架构和策略，以探索广泛而不连续的化学空间，为生成具有特定性质的化合物提供益处。在本研究中，我们利用分子集（MOSES）或 Guacamol 中提出的指标，对通过深度学习技术生成分子的策略进行了统计描述和比较，并提交了一篇系统性综述。这项研究包括从 Scopus 和 Web of Science 的查询式搜索中检索到的 48 篇文章，以及从引文搜索中检索到的 25 篇文章，共检索到 72 篇文章，其中 62 篇与分子生成的化学语言模型方法相对应，另外 10 篇检索到的文章与分子图表示方法相对应。变换器、递归神经网络（RNN）、生成对抗网络（GAN）、结构空间状态序列（S4）模型和变异自动编码器（VAE）被认为是检索文章中用于分子生成的主要深度学习架构。此外，迁移学习、强化学习和条件学习也是最常用的技术，用于生成有偏差的模型和探索特定的化学空间区域。最后，本分析侧重于分子表征、数据库、训练数据集规模、有效性-新颖性权衡以及无偏和有偏化学语言模型的性能等中心主题。选定这些主题后，利用图形表示法和统计检验法进行了统计分析。分析结果揭示了过去四年中化学语言模型领域的主要挑战、优势和机遇。

{"title":"A systematic review of deep learning chemical language models in recent era","authors":"Hector Flores-Hernandez, Emmanuel Martinez-Ledesma","doi":"10.1186/s13321-024-00916-y","DOIUrl":"10.1186/s13321-024-00916-y","url":null,"abstract":"<div><p>Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00916-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool QSPRpred：灵活的开源定量结构-属性关系建模工具。

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-14 DOI: 10.1186/s13321-024-00908-y

Helle W. van den Maagdenberg, Martin Šícho, David Alencar Araripe, Sohvi Luukkonen, Linde Schoenmaker, Michiel Jespers, Olivier J. M. Béquignon, Marina Gorostiola González, Remco L. van den Broek, Andrius Bernatavicius, J. G. Coen van Hasselt, Piet. H. van der Graaf, Gerard J. P. van Westen

Building reliable and robust quantitative structure–property relationship (QSPR) models is a challenging task. First, the experimental data needs to be obtained, analyzed and curated. Second, the number of available methods is continuously growing and evaluating different algorithms and methodologies can be arduous. Finally, the last hurdle that researchers face is to ensure the reproducibility of their models and facilitate their transferability into practice. In this work, we introduce QSPRpred, a toolkit for analysis of bioactivity data sets and QSPR modelling, which attempts to address the aforementioned challenges. QSPRpred’s modular Python API enables users to intuitively describe different parts of a modelling workflow using a plethora of pre-implemented components, but also integrates customized implementations in a “plug-and-play” manner. QSPRpred data sets and models are directly serializable, which means they can be readily reproduced and put into operation after training as the models are saved with all required data pre-processing steps to make predictions on new compounds directly from SMILES strings. The general-purpose character of QSPRpred is also demonstrated by inclusion of support for multi-task and proteochemometric modelling. The package is extensively documented and comes with a large collection of tutorials to help new users. In this paper, we describe all of QSPRpred’s functionalities and also conduct a small benchmarking case study to illustrate how different components can be leveraged to compare a diverse set of models. QSPRpred is fully open-source and available at https://github.com/CDDLeiden/QSPRpred.

Scientific Contribution

QSPRpred aims to provide a complex, but comprehensive Python API to conduct all tasks encountered in QSPR modelling from data preparation and analysis to model creation and model deployment. In contrast to similar packages, QSPRpred offers a wider and more exhaustive range of capabilities and integrations with many popular packages that also go beyond QSPR modelling. A significant contribution of QSPRpred is also in its automated and highly standardized serialization scheme, which significantly improves reproducibility and transferability of models.

建立可靠、稳健的定量结构-性能关系（QSPR）模型是一项具有挑战性的任务。首先，需要获取、分析和整理实验数据。其次，可用方法的数量在不断增加，评估不同的算法和方法可能非常困难。最后，研究人员面临的最后一个障碍是确保其模型的可重复性，并促进其在实践中的可移植性。在这项工作中，我们介绍了用于分析生物活性数据集和 QSPR 建模的工具包 QSPRpred，它试图解决上述挑战。QSPRpred 的模块化 Python 应用程序接口（API）使用户能够使用大量预实现组件直观地描述建模工作流程的不同部分，同时还能以 "即插即用 "的方式集成自定义实现。QSPRpred 数据集和模型可直接序列化，这意味着它们可以随时复制，并在训练后投入使用，因为模型与所有必要的数据预处理步骤一起保存，可直接从 SMILES 字符串对新化合物进行预测。QSPRpred 的通用性还体现在支持多任务和蛋白质化学计量建模。该软件包有大量文档，并附有大量教程，可为新用户提供帮助。在本文中，我们介绍了 QSPRpred 的所有功能，还进行了一个小型基准案例研究，以说明如何利用不同组件来比较各种模型。QSPRpred 是完全开源的，可从 https://github.com/CDDLeiden/QSPRpred 上获取。科学贡献QSPRpred 旨在提供一个复杂但全面的 Python 应用程序接口，以执行 QSPR 建模中遇到的所有任务，从数据准备和分析到模型创建和模型部署。与同类软件包相比，QSPRpred 提供了更广泛、更详尽的功能，并与许多流行的软件包集成，这些功能也超出了 QSPR 建模的范围。QSPRpred 的一个重要贡献还在于其自动化和高度标准化的序列化方案，这大大提高了模型的可复制性和可移植性。

{"title":"QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool","authors":"Helle W. van den Maagdenberg, Martin Šícho, David Alencar Araripe, Sohvi Luukkonen, Linde Schoenmaker, Michiel Jespers, Olivier J. M. Béquignon, Marina Gorostiola González, Remco L. van den Broek, Andrius Bernatavicius, J. G. Coen van Hasselt, Piet. H. van der Graaf, Gerard J. P. van Westen","doi":"10.1186/s13321-024-00908-y","DOIUrl":"10.1186/s13321-024-00908-y","url":null,"abstract":"<div><p>Building reliable and robust quantitative structure–property relationship (QSPR) models is a challenging task. First, the experimental data needs to be obtained, analyzed and curated. Second, the number of available methods is continuously growing and evaluating different algorithms and methodologies can be arduous. Finally, the last hurdle that researchers face is to ensure the reproducibility of their models and facilitate their transferability into practice. In this work, we introduce QSPRpred, a toolkit for analysis of bioactivity data sets and QSPR modelling, which attempts to address the aforementioned challenges. QSPRpred’s modular Python API enables users to intuitively describe different parts of a modelling workflow using a plethora of pre-implemented components, but also integrates customized implementations in a “plug-and-play” manner. QSPRpred data sets and models are directly serializable, which means they can be readily reproduced and put into operation after training as the models are saved with all required data pre-processing steps to make predictions on new compounds directly from SMILES strings. The general-purpose character of QSPRpred is also demonstrated by inclusion of support for multi-task and proteochemometric modelling. The package is extensively documented and comes with a large collection of tutorials to help new users. In this paper, we describe all of QSPRpred’s functionalities and also conduct a small benchmarking case study to illustrate how different components can be leveraged to compare a diverse set of models. QSPRpred is fully open-source and available at https://github.com/CDDLeiden/QSPRpred.</p><br><p><b>Scientific Contribution</b></p><p>QSPRpred aims to provide a complex, but comprehensive Python API to conduct all tasks encountered in QSPR modelling from data preparation and analysis to model creation and model deployment. In contrast to similar packages, QSPRpred offers a wider and more exhaustive range of capabilities and integrations with many popular packages that also go beyond QSPR modelling. A significant contribution of QSPRpred is also in its automated and highly standardized serialization scheme, which significantly improves reproducibility and transferability of models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00908-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142611741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerated hit identification with target evaluation, deep learning and automated labs: prospective validation in IRAK1 利用目标评估、深度学习和自动实验室加速命中识别：IRAK1 的前瞻性验证。

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-14 DOI: 10.1186/s13321-024-00914-0

Gintautas Kamuntavičius, Alvaro Prat, Tanya Paquet, Orestis Bastas, Hisham Abdel Aty, Qing Sun, Carsten B. Andersen, John Harman, Marc E. Siladi, Daniel R. Rines, Sarah J. L. Flatters, Roy Tal, Povilas Norvaišas

Background

Target identification and hit identification can be transformed through the application of biomedical knowledge analysis, AI-driven virtual screening and robotic cloud lab systems. However there are few prospective studies that evaluate the efficacy of such integrated approaches.

Results

We synergistically integrate our in-house-developed target evaluation (SpectraView) and deep-learning-driven virtual screening (HydraScreen) tools with an automated robotic cloud lab designed explicitly for ultra-high-throughput screening, enabling us to validate these platforms experimentally. By employing our target evaluation tool to select IRAK1 as the focal point of our investigation, we prospectively validate our structure-based deep learning model. We can identify 23.8% of all IRAK1 hits within the top 1% of ranked compounds. The model outperforms traditional virtual screening techniques and offers advanced features such as ligand pose confidence scoring. Simultaneously, we identify three potent (nanomolar) scaffolds from our compound library, 2 of which represent novel candidates for IRAK1 and hold promise for future development.

Conclusion

This study provides compelling evidence for SpectraView and HydraScreen to provide a significant acceleration in the processes of target identification and hit discovery. By leveraging Ro5’s HydraScreen and Strateos’ automated labs in hit identification for IRAK1, we show how AI-driven virtual screening with HydraScreen could offer high hit discovery rates and reduce experimental costs.

Scientific contribution

We present an innovative platform that leverages Knowledge graph-based biomedical data analytics and AI-driven virtual screening integrated with robotic cloud labs. Through an unbiased, prospective evaluation we show the reliability and robustness of HydraScreen in virtual and high-throughput screening for hit identification in IRAK1. Our platforms and innovative tools can expedite the early stages of drug discovery.

背景：通过应用生物医学知识分析、人工智能驱动的虚拟筛选和机器人云实验室系统，可以改变靶点识别和命中识别的方式。然而，很少有前瞻性研究对这种集成方法的功效进行评估：我们将自主开发的靶点评估（SpectraView）和深度学习驱动的虚拟筛选（HydraScreen）工具与专为超高通量筛选设计的自动化机器人云实验室进行了协同整合，从而使我们能够通过实验验证这些平台。通过使用目标评估工具选择 IRAK1 作为研究重点，我们对基于结构的深度学习模型进行了前瞻性验证。在排名前 1%的化合物中，我们可以识别出 23.8% 的 IRAK1 靶点。该模型优于传统的虚拟筛选技术，并提供配体姿态置信度评分等高级功能。同时，我们还从化合物库中发现了三个强效（纳摩尔）支架，其中两个代表了IRAK1的新型候选化合物，有望在未来得到开发：本研究为 SpectraView 和 HydraScreen 提供了令人信服的证据，可显著加快靶点识别和发现的过程。通过利用 Ro5 的 HydraScreen 和 Strateos 自动化实验室对 IRAK1 进行靶点识别，我们展示了利用 HydraScreen 进行人工智能驱动的虚拟筛选如何能够提供高靶点发现率并降低实验成本：我们提出了一个创新平台，该平台利用基于知识图谱的生物医学数据分析和人工智能驱动的虚拟筛选，并与机器人云实验室集成。通过无偏见的前瞻性评估，我们展示了HydraScreen在虚拟和高通量筛选中识别IRAK1基因突变的可靠性和稳健性。我们的平台和创新工具可以加速药物发现的早期阶段。

{"title":"Accelerated hit identification with target evaluation, deep learning and automated labs: prospective validation in IRAK1","authors":"Gintautas Kamuntavičius, Alvaro Prat, Tanya Paquet, Orestis Bastas, Hisham Abdel Aty, Qing Sun, Carsten B. Andersen, John Harman, Marc E. Siladi, Daniel R. Rines, Sarah J. L. Flatters, Roy Tal, Povilas Norvaišas","doi":"10.1186/s13321-024-00914-0","DOIUrl":"10.1186/s13321-024-00914-0","url":null,"abstract":"<div><h3>Background</h3><p>Target identification and hit identification can be transformed through the application of biomedical knowledge analysis, AI-driven virtual screening and robotic cloud lab systems. However there are few prospective studies that evaluate the efficacy of such integrated approaches.</p><h3>Results</h3><p>We synergistically integrate our in-house-developed target evaluation (SpectraView) and deep-learning-driven virtual screening (HydraScreen) tools with an automated robotic cloud lab designed explicitly for ultra-high-throughput screening, enabling us to validate these platforms experimentally. By employing our target evaluation tool to select IRAK1 as the focal point of our investigation, we prospectively validate our structure-based deep learning model. We can identify 23.8% of all IRAK1 hits within the top 1% of ranked compounds. The model outperforms traditional virtual screening techniques and offers advanced features such as ligand pose confidence scoring. Simultaneously, we identify three potent (nanomolar) scaffolds from our compound library, 2 of which represent novel candidates for IRAK1 and hold promise for future development.</p><h3>Conclusion</h3><p>This study provides compelling evidence for SpectraView and HydraScreen to provide a significant acceleration in the processes of target identification and hit discovery. By leveraging Ro5’s HydraScreen and Strateos’ automated labs in hit identification for IRAK1, we show how AI-driven virtual screening with HydraScreen could offer high hit discovery rates and reduce experimental costs.</p><h3>Scientific contribution</h3><p>We present an innovative platform that leverages Knowledge graph-based biomedical data analytics and AI-driven virtual screening integrated with robotic cloud labs. Through an unbiased, prospective evaluation we show the reliability and robustness of HydraScreen in virtual and high-throughput screening for hit identification in IRAK1. Our platforms and innovative tools can expedite the early stages of drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00914-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142611739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparative evaluation of methods for the prediction of protein–ligand binding sites 蛋白质配体结合位点预测方法的比较评估

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-11 DOI: 10.1186/s13321-024-00923-z

Javier S. Utgés, Geoffrey J. Barton

The accurate identification of protein–ligand binding sites is of critical importance in understanding and modulating protein function. Accordingly, ligand binding site prediction has remained a research focus for over three decades with over 50 methods developed and a change of paradigm from geometry-based to machine learning. In this work, we collate 13 ligand binding site predictors, spanning 30 years, focusing on the latest machine learning-based methods such as VN-EGNN, IF-SitePred, GrASP, PUResNet, and DeepPocket and compare them to the established P2Rank, PRANK and fpocket and earlier methods like PocketFinder, Ligsite and Surfnet. We benchmark the methods against the human subset of our new curated reference dataset, LIGYSIS. LIGYSIS is a comprehensive protein–ligand complex dataset comprising 30,000 proteins with bound ligands which aggregates biologically relevant unique protein–ligand interfaces across biological units of multiple structures from the same protein. LIGYSIS is an improvement for testing methods over earlier datasets like sc-PDB, PDBbind, binding MOAD, COACH420 and HOLO4K which either include 1:1 protein–ligand complexes or consider asymmetric units. Re-scoring of fpocket predictions by PRANK and DeepPocket display the highest recall (60%) whilst IF-SitePred presents the lowest recall (39%). We demonstrate the detrimental effect that redundant prediction of binding sites has on performance as well as the beneficial impact of stronger pocket scoring schemes, with improvements up to 14% in recall (IF-SitePred) and 30% in precision (Surfnet). Finally, we propose top-N+2 recall as the universal benchmark metric for ligand binding site prediction and urge authors to share not only the source code of their methods, but also of their benchmark.

Scientific contributions

This study conducts the largest benchmark of ligand binding site prediction methods to date, comparing 13 original methods and 15 variants using 10 informative metrics. The LIGYSIS dataset is introduced, which aggregates biologically relevant protein–ligand interfaces across multiple structures of the same protein. The study highlights the detrimental effect of redundant binding site prediction and demonstrates significant improvement in recall and precision through stronger scoring schemes. Finally, top-N+2 recall is proposed as a universal benchmark metric for ligand binding site prediction, with a recommendation for open-source sharing of both methods and benchmarks.

准确识别蛋白质配体结合位点对于理解和调节蛋白质功能至关重要。因此，配体结合位点预测三十多年来一直是研究重点，开发了 50 多种方法，研究范式也从基于几何的方法转变为机器学习方法。在这项工作中，我们整理了 13 种配体结合位点预测方法，时间跨度长达 30 年，重点关注 VN-EGNN、IF-SitePred、GrASP、PUResNet 和 DeepPocket 等基于机器学习的最新方法，并将它们与 P2Rank、PRANK 和 fpocket 等成熟方法以及 PocketFinder、Ligsite 和 Surfnet 等早期方法进行比较。我们以新的参考数据集 LIGYSIS 的人类子集为基准对这些方法进行了比较。LIGYSIS 是一个全面的蛋白质配体复合物数据集，包含 30,000 个与配体结合的蛋白质，汇集了同一蛋白质多个结构的生物单元中与生物相关的独特蛋白质配体界面。LIGYSIS 是对早期数据集（如 sc-PDB、PDBbind、结合 MOAD、COACH420 和 HOLO4K）测试方法的改进，这些数据集要么包含 1:1 蛋白质-配体复合物，要么考虑不对称单元。PRANK 和 DeepPocket 对 fpocket 预测的重新评分显示了最高的召回率（60%），而 IF-SitePred 则显示了最低的召回率（39%）。我们证明了多余的结合位点预测对性能的不利影响，以及更强的口袋评分方案的有利影响，召回率（IF-SitePred）和精确率（Surfnet）分别提高了 14% 和 30%。最后，我们建议将top-N+2召回率作为配体结合位点预测的通用基准指标，并敦促作者不仅要共享其方法的源代码，还要共享其基准指标。科学贡献本研究对配体结合位点预测方法进行了迄今为止最大规模的基准测试，使用 10 个信息指标对 13 种原始方法和 15 种变体进行了比较。研究引入了 LIGYSIS 数据集，该数据集汇总了同一蛋白质多个结构中与生物相关的蛋白质配体界面。研究强调了冗余结合位点预测的有害影响，并通过更强的评分方案证明了召回率和精确度的显著提高。最后，研究人员提出了top-N+2召回率作为配体结合位点预测的通用基准指标，并建议对方法和基准进行开源共享。

{"title":"Comparative evaluation of methods for the prediction of protein–ligand binding sites","authors":"Javier S. Utgés, Geoffrey J. Barton","doi":"10.1186/s13321-024-00923-z","DOIUrl":"10.1186/s13321-024-00923-z","url":null,"abstract":"<div><p>The accurate identification of protein–ligand binding sites is of critical importance in understanding and modulating protein function. Accordingly, ligand binding site prediction has remained a research focus for over three decades with over 50 methods developed and a change of paradigm from geometry-based to machine learning. In this work, we collate 13 ligand binding site predictors, spanning 30 years, focusing on the latest machine learning-based methods such as VN-EGNN, IF-SitePred, GrASP, PUResNet, and DeepPocket and compare them to the established P2Rank, PRANK and fpocket and earlier methods like PocketFinder, Ligsite and Surfnet. We benchmark the methods against the human subset of our new curated reference dataset, LIGYSIS. LIGYSIS is a comprehensive protein–ligand complex dataset comprising 30,000 proteins with bound ligands which aggregates biologically relevant unique protein–ligand interfaces across biological units of multiple structures from the same protein. LIGYSIS is an improvement for testing methods over earlier datasets like sc-PDB, PDBbind, binding MOAD, COACH420 and HOLO4K which either include 1:1 protein–ligand complexes or consider asymmetric units. Re-scoring of fpocket predictions by PRANK and DeepPocket display the highest recall (60%) whilst IF-SitePred presents the lowest recall (39%). We demonstrate the detrimental effect that redundant prediction of binding sites has on performance as well as the beneficial impact of stronger pocket scoring schemes, with improvements up to 14% in recall (IF-SitePred) and 30% in precision (Surfnet). Finally, we propose top-<i>N</i>+2 recall as the universal benchmark metric for ligand binding site prediction and urge authors to share not only the source code of their methods, but also of their benchmark.</p><p><b>Scientific contributions</b></p><p>This study conducts the largest benchmark of ligand binding site prediction methods to date, comparing 13 original methods and 15 variants using 10 informative metrics. The LIGYSIS dataset is introduced, which aggregates biologically relevant protein–ligand interfaces across multiple structures of the same protein. The study highlights the detrimental effect of redundant binding site prediction and demonstrates significant improvement in recall and precision through stronger scoring schemes. Finally, top-<i>N</i>+2 recall is proposed as a universal benchmark metric for ligand binding site prediction, with a recommendation for open-source sharing of both methods and benchmarks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00923-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning 基于对比学习的预训练蛋白质语言模型的蛋白质-小分子结合位点预测。

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2024-11-06 DOI: 10.1186/s13321-024-00920-2

Jue Wang, Yufan Liu, Boxue Tian

Predicting protein-small molecule binding sites, the initial step in structure-guided drug design, remains challenging for proteins lacking experimentally derived ligand-bound structures. Here, we propose CLAPE-SMB, which integrates a pre-trained protein language model with contrastive learning to provide high accuracy predictions of small molecule binding sites that can accommodate proteins without a published crystal structure. We trained and tested CLAPE-SMB on the SJC dataset, a non-redundant dataset based on sc-PDB, JOINED, and COACH420, and achieved an MCC of 0.529. We also compiled the UniProtSMB dataset, which merges sites from similar proteins based on raw data from UniProtKB database, and achieved an MCC of 0.699 on the test set. In addition, CLAPE-SMB achieved an MCC of 0.815 on our intrinsically disordered protein (IDP) dataset that contains 336 non-redundant sequences. Case studies of DAPK1, RebH, and Nep1 support the potential of this binding site prediction tool to aid in drug design. The code and datasets are freely available at https://github.com/JueWangTHU/CLAPE-SMB.

CLAPE-SMB combines a pre-trained protein language model with contrastive learning to accurately predict protein-small molecule binding sites, especially for proteins without experimental structures, such as IDPs. Trained across various datasets, this model shows strong adaptability, making it a valuable tool for advancing drug design and understanding protein-small molecule interactions.

预测蛋白质与小分子的结合位点是结构引导药物设计的第一步，但对于缺乏实验得出的配体结合结构的蛋白质来说，这项工作仍然具有挑战性。在这里，我们提出了 CLAPE-SMB，它将预先训练好的蛋白质语言模型与对比学习相结合，对小分子结合位点进行高精度预测，以适应没有公布晶体结构的蛋白质。我们在 SJC 数据集（基于 sc-PDB、JOINED 和 COACH420 的非冗余数据集）上对 CLAPE-SMB 进行了训练和测试，MCC 达到 0.529。我们还编译了 UniProtSMB 数据集，该数据集根据 UniProtKB 数据库的原始数据合并了相似蛋白质的位点，在测试集上的 MCC 达到了 0.699。此外，CLAPE-SMB 在包含 336 个非冗余序列的本征无序蛋白（IDP）数据集上的 MCC 达到了 0.815。对 DAPK1、RebH 和 Nep1 的案例研究证明了这种结合位点预测工具在帮助药物设计方面的潜力。代码和数据集可在 https://github.com/JueWangTHU/CLAPE-SMB 免费获取。科学贡献：CLAPE-SMB 将预先训练好的蛋白质语言模型与对比学习相结合，准确预测蛋白质与小分子的结合位点，尤其是没有实验结构的蛋白质，如 IDP。通过在各种数据集上的训练，该模型显示出很强的适应性，使其成为推进药物设计和了解蛋白质与小分子相互作用的重要工具。

{"title":"Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning","authors":"Jue Wang, Yufan Liu, Boxue Tian","doi":"10.1186/s13321-024-00920-2","DOIUrl":"10.1186/s13321-024-00920-2","url":null,"abstract":"<p>Predicting protein-small molecule binding sites, the initial step in structure-guided drug design, remains challenging for proteins lacking experimentally derived ligand-bound structures. Here, we propose CLAPE-SMB, which integrates a pre-trained protein language model with contrastive learning to provide high accuracy predictions of small molecule binding sites that can accommodate proteins without a published crystal structure. We trained and tested CLAPE-SMB on the SJC dataset, a non-redundant dataset based on sc-PDB, JOINED, and COACH420, and achieved an MCC of 0.529. We also compiled the UniProtSMB dataset, which merges sites from similar proteins based on raw data from UniProtKB database, and achieved an MCC of 0.699 on the test set. In addition, CLAPE-SMB achieved an MCC of 0.815 on our intrinsically disordered protein (IDP) dataset that contains 336 non-redundant sequences. Case studies of DAPK1, RebH, and Nep1 support the potential of this binding site prediction tool to aid in drug design. The code and datasets are freely available at https://github.com/JueWangTHU/CLAPE-SMB.</p><p>CLAPE-SMB combines a pre-trained protein language model with contrastive learning to accurately predict protein-small molecule binding sites, especially for proteins without experimental structures, such as IDPs. Trained across various datasets, this model shows strong adaptability, making it a valuable tool for advancing drug design and understanding protein-small molecule interactions.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00920-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142589348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0