首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
CSearch: chemical space search via virtual synthesis and global optimization CSearch:通过虚拟合成和全局优化的化学空间搜索
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-05 DOI: 10.1186/s13321-024-00936-8
Hakjean Kim, Seongok Ryu, Nuri Jung, Jinsol Yang, Chaok Seok

The two key components of computational molecular design are virtually generating molecules and predicting the properties of these generated molecules. This study focuses on an effective method for molecular generation through virtual synthesis and global optimization of a given objective function. Using a pre-trained graph neural network (GNN) objective function to approximate the docking energies of compounds for four target receptors, we generated highly optimized compounds with 300–400 times less computational effort compared to virtual compound library screening. These optimized compounds exhibit similar synthesizability and diversity to known binders with high potency and are notably novel compared to library chemicals or known ligands. This method, called CSearch, can be effectively utilized to generate chemicals optimized for a given objective function. With the GNN function approximating docking energies, CSearch generated molecules with predicted binding poses to the target receptors similar to known inhibitors, demonstrating its effectiveness in producing drug-like binders.

Scientific Contribution We have developed a method for effectively exploring the chemical space of drug-like molecules using a global optimization algorithm with fragment-based virtual synthesis. The compounds generated using this method optimize the given objective function efficiently and are synthesizable like commercial library compounds. Furthermore, they are diverse, novel drug-like molecules with properties similar to known inhibitors for target receptors.

计算分子设计的两个关键组成部分是虚拟生成分子和预测这些生成分子的性质。本文研究了一种通过虚拟合成和给定目标函数的全局优化进行分子生成的有效方法。使用预训练的图神经网络(GNN)目标函数来近似四种目标受体化合物的对接能量,与虚拟化合物库筛选相比,我们以300-400倍的计算量生成了高度优化的化合物。这些优化的化合物与已知的高效结合物具有相似的可合成性和多样性,与库化学物质或已知配体相比,它们具有显著的创新性。这种方法被称为CSearch,可以有效地用于生成针对给定目标函数进行优化的化学物质。利用接近对接能量的GNN函数,CSearch生成了与已知抑制剂类似的靶向受体具有预测结合姿态的分子,证明了其在生产药物样结合物方面的有效性。我们开发了一种利用基于片段的虚拟合成的全局优化算法有效探索类药物分子化学空间的方法。利用该方法生成的化合物有效地优化了给定的目标函数,并且与商业库化合物一样可合成。此外,它们是多种多样的新型药物样分子,具有与已知靶受体抑制剂相似的特性。
{"title":"CSearch: chemical space search via virtual synthesis and global optimization","authors":"Hakjean Kim,&nbsp;Seongok Ryu,&nbsp;Nuri Jung,&nbsp;Jinsol Yang,&nbsp;Chaok Seok","doi":"10.1186/s13321-024-00936-8","DOIUrl":"10.1186/s13321-024-00936-8","url":null,"abstract":"<div><p>The two key components of computational molecular design are virtually generating molecules and predicting the properties of these generated molecules. This study focuses on an effective method for molecular generation through virtual synthesis and global optimization of a given objective function. Using a pre-trained graph neural network (GNN) objective function to approximate the docking energies of compounds for four target receptors, we generated highly optimized compounds with 300–400 times less computational effort compared to virtual compound library screening. These optimized compounds exhibit similar synthesizability and diversity to known binders with high potency and are notably novel compared to library chemicals or known ligands. This method, called CSearch, can be effectively utilized to generate chemicals optimized for a given objective function. With the GNN function approximating docking energies, CSearch generated molecules with predicted binding poses to the target receptors similar to known inhibitors, demonstrating its effectiveness in producing drug-like binders.</p><p><b>Scientific Contribution</b> We have developed a method for effectively exploring the chemical space of drug-like molecules using a global optimization algorithm with fragment-based virtual synthesis. The compounds generated using this method optimize the given objective function efficiently and are synthesizable like commercial library compounds. Furthermore, they are diverse, novel drug-like molecules with properties similar to known inhibitors for target receptors.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00936-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deepmol: an automated machine and deep learning framework for computational chemistry Deepmol:用于计算化学的自动化机器和深度学习框架
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-05 DOI: 10.1186/s13321-024-00937-7
João Correia, João Capela, Miguel Rocha

The domain of computational chemistry has experienced a significant evolution due to the introduction of Machine Learning (ML) technologies. Despite its potential to revolutionize the field, researchers are often encumbered by obstacles, such as the complexity of selecting optimal algorithms, the automation of data pre-processing steps, the necessity for adaptive feature engineering, and the assurance of model performance consistency across different datasets. Addressing these issues head-on, DeepMol stands out as an Automated ML (AutoML) tool by automating critical steps of the ML pipeline. DeepMol rapidly and automatically identifies the most effective data representation, pre-processing methods and model configurations for a specific molecular property/activity prediction problem. On 22 benchmark datasets, DeepMol obtained competitive pipelines compared with those requiring time-consuming feature engineering, model design and selection processes. As one of the first AutoML tools specifically developed for the computational chemistry domain, DeepMol stands out with its open-source code, in-depth tutorials, detailed documentation, and examples of real-world applications, all available at https://github.com/BioSystemsUM/DeepMol and https://deepmol.readthedocs.io/en/latest/. By introducing AutoML as a groundbreaking feature in computational chemistry, DeepMol establishes itself as the pioneering state-of-the-art tool in the field.

Scientific contribution

DeepMol aims to provide an integrated framework of AutoML for computational chemistry. DeepMol provides a more robust alternative to other tools with its integrated pipeline serialization, enabling seamless deployment using the fit, transform, and predict paradigms. It uniquely supports both conventional and deep learning models for regression, classification and multi-task, offering unmatched flexibility compared to other AutoML tools. DeepMol's predefined configurations and customizable objective functions make it accessible to users at all skill levels while enabling efficient and reproducible workflows. Benchmarking on diverse datasets demonstrated its ability to deliver optimized pipelines and superior performance across various molecular machine-learning tasks.

由于机器学习(ML)技术的引入,计算化学领域经历了重大的发展。尽管它有可能彻底改变该领域,但研究人员经常受到障碍的阻碍,例如选择最佳算法的复杂性,数据预处理步骤的自动化,自适应特征工程的必要性,以及不同数据集之间模型性能一致性的保证。为了正面解决这些问题,DeepMol通过自动化机器学习管道的关键步骤,作为一款自动化机器学习(AutoML)工具脱颖而出。DeepMol可以快速、自动地识别最有效的数据表示、预处理方法和模型配置,用于特定的分子性质/活性预测问题。与耗时的特征工程、模型设计和选择过程相比,DeepMol在22个基准数据集上获得了具有竞争力的管道。作为第一个专门为计算化学领域开发的AutoML工具之一,DeepMol以其开源代码,深入教程,详细文档和实际应用示例而脱颖而出,所有这些都可以在https://github.com/BioSystemsUM/DeepMol和https://deepmol.readthedocs.io/en/latest/上获得。通过引入AutoML作为计算化学领域的开创性功能,DeepMol将自己确立为该领域开创性的最先进工具。DeepMol旨在为计算化学提供一个集成的AutoML框架。DeepMol通过其集成的管道序列化提供了其他工具更强大的替代方案,可以使用fit、transform和predict范式实现无缝部署。它独特地支持传统和深度学习模型,用于回归、分类和多任务,与其他AutoML工具相比,提供了无与伦比的灵活性。DeepMol的预定义配置和可定制的目标函数使所有技能水平的用户都可以访问它,同时实现高效和可重复的工作流程。对不同数据集的基准测试表明,它能够在各种分子机器学习任务中提供优化的管道和卓越的性能。
{"title":"Deepmol: an automated machine and deep learning framework for computational chemistry","authors":"João Correia,&nbsp;João Capela,&nbsp;Miguel Rocha","doi":"10.1186/s13321-024-00937-7","DOIUrl":"10.1186/s13321-024-00937-7","url":null,"abstract":"<div><p>The domain of computational chemistry has experienced a significant evolution due to the introduction of Machine Learning (ML) technologies. Despite its potential to revolutionize the field, researchers are often encumbered by obstacles, such as the complexity of selecting optimal algorithms, the automation of data pre-processing steps, the necessity for adaptive feature engineering, and the assurance of model performance consistency across different datasets. Addressing these issues head-on, <i>DeepMol</i> stands out as an Automated ML (AutoML) tool by automating critical steps of the ML pipeline. <i>DeepMol</i> rapidly and automatically identifies the most effective data representation, pre-processing methods and model configurations for a specific molecular property/activity prediction problem. On 22 benchmark datasets, <i>DeepMol</i> obtained competitive pipelines compared with those requiring time-consuming feature engineering, model design and selection processes. As one of the first AutoML tools specifically developed for the computational chemistry domain, <i>DeepMol</i> stands out with its open-source code, in-depth tutorials, detailed documentation, and examples of real-world applications, all available at https://github.com/BioSystemsUM/DeepMol and https://deepmol.readthedocs.io/en/latest/. By introducing AutoML as a groundbreaking feature in computational chemistry, DeepMol establishes itself as the pioneering state-of-the-art tool in the field.</p><p><b>Scientific contribution</b></p><p><i>DeepMol</i> aims to provide an integrated framework of AutoML for computational chemistry. <i>DeepMol</i> provides a more robust alternative to other tools with its integrated pipeline serialization, enabling seamless deployment using the <i>fit</i>, <i>transform</i>, and <i>predict</i> paradigms. It uniquely supports both conventional and deep learning models for regression, classification and multi-task, offering unmatched flexibility compared to other AutoML tools. <i>DeepMol's</i> predefined configurations and customizable objective functions make it accessible to users at all skill levels while enabling efficient and reproducible workflows. Benchmarking on diverse datasets demonstrated its ability to deliver optimized pipelines and superior performance across various molecular machine-learning tasks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00937-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints Sort & Slice:对于扩展连接指纹,它是基于散列的折叠的一个简单而优越的替代方案
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-03 DOI: 10.1186/s13321-024-00932-y
Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris

Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the L most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, L. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.


Scientific contribution

A general mathematical framework for the vectorisation of structural fingerprints called substructure pooling; and the technical description and computational evaluation of Sort & Slice, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.

扩展连接指纹(ECFPs)是当前化学信息学和分子机器学习中普遍使用的工具,也是用于化学预测的最流行的分子特征提取技术之一。通过图神经网络学习的原子特征可以使用大量的图池方法聚合为复合级表示。相反,检测到的ECFP子结构集在默认情况下仅使用简单的基于哈希的折叠过程转换为位向量。我们通过一种称为子结构池的正式操作引入了一个用于结构指纹矢量化的通用数学框架,该操作包括基于哈希的折叠、算法子结构选择和各种其他潜在技术。我们继续描述Sort & Slice,这是一种易于实现且无位冲突的ECFP子结构池化的基于哈希的折叠替代方案。Sort & Slice首先根据ECFP子结构在给定训练化合物中的相对流行度对其进行排序,然后切片除L个最常见的子结构外的所有子结构,这些子结构随后用于生成所需长度L的二进制指纹。我们计算比较了基于哈希的折叠,Sort & Slice和两种高级监督子结构选择方案(过滤和互信息最大化)的性能,用于基于ECFP的分子性质预测。我们的研究结果表明,尽管其技术简单,但Sort & Slice在不同的预测任务、数据分割技术、机器学习模型和ECFP超参数上稳健地(有时显著地)优于传统的基于哈希的折叠以及其他研究的子结构池方法。因此,我们建议Sort & Slice通常取代基于哈希的折叠作为默认的子结构池技术,以矢量化ecfp用于监督分子机器学习。结构指纹矢量化的一般数学框架称为子结构池;Sort & Slice的技术描述和计算评估,这是一种概念简单且无位碰撞的ECFP子结构池化方法,在分子性质预测方面明显优于经典的基于哈希的折叠。
{"title":"Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints","authors":"Markus Dablander,&nbsp;Thierry Hanser,&nbsp;Renaud Lambiotte,&nbsp;Garrett M. Morris","doi":"10.1186/s13321-024-00932-y","DOIUrl":"10.1186/s13321-024-00932-y","url":null,"abstract":"<div><p>Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe <i>Sort &amp; Slice</i>, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort &amp; Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the <i>L</i> most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, <i>L</i>. We computationally compare the performance of hash-based folding, Sort &amp; Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort &amp; Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort &amp; Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning. </p><br><p><b>Scientific contribution</b> </p><p>A general mathematical framework for the vectorisation of structural fingerprints called <i>substructure pooling</i>; and the technical description and computational evaluation of <i>Sort &amp; Slice</i>, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00932-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142760674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research cidalsDB:人工智能赋能的抗病原治疗研究平台
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-28 DOI: 10.1186/s13321-024-00929-7
Emna Harigua-Souiai, Ons Masmoudi, Samer Makni, Rafeh Oualha, Yosser Z. Abdelkrim, Sara Hamdi, Oussama Souiai, Ikram Guizani

Computer-aided drug discovery (CADD) is nurtured by late advances in big data analytics and Artificial Intelligence (AI) towards enhanced drug discovery (DD) outcomes. In this context, reliable datasets are of utmost importance. We herein present CidalsDB a novel web server for AI-assisted DD against infectious pathogens, namely Leishmania parasites and Coronaviruses. We performed a literature search on molecules with validated anti-pathogen effects. Then, we consolidated these data with bioassays from PubChem. Finally, we constructed a database to store these datasets and make them accessible and ready-to-use for the scientific community through CidalsDB, a web-based interface. In a second step, we implemented and optimized four machine learning (ML) and three deep learning (DL) algorithms that optimally predicted the biological activity of molecules. Random Forests (RF), Multi-Layer Perceptron (MLP) and ChemBERTa were the best classifiers of anti-Leishmania molecules, while Gradient Boosting (GB), Graph-Convolutional Network (GCN) and ChemBERTa achieved the best performances on the Coronaviruses dataset. All six models were optimized and deployed through CidalsDB as anti-pathogen activity prediction models.

Scientific contribution

CidalsDB is an open access web-based tool that allows browsing and access to ready-to-use datasets of anti-pathogen molecules, alongside best performing AI models for biological activity prediction. It offers a democratized no-code platform for AI-based CADD, which shall foster innovation and collaboration within the DD community. CidalsDB is accessible through https://cidalsdb.streamlit.app/.

计算机辅助药物发现(CADD)是在大数据分析和人工智能(AI)的推动下发展起来的,旨在提高药物发现(DD)的成果。在此背景下,可靠的数据集至关重要。我们在此介绍 CidalsDB,这是一个新型网络服务器,用于针对传染性病原体(即利什曼原虫和冠状病毒)的人工智能辅助药物研发。我们对具有有效抗病原体作用的分子进行了文献检索。然后,我们将这些数据与来自 PubChem 的生物测定结果进行了整合。最后,我们建立了一个数据库来存储这些数据集,并通过基于网络的界面 CidalsDB 使科学界能够访问和使用这些数据集。第二步,我们实施并优化了四种机器学习(ML)算法和三种深度学习(DL)算法,以最佳方式预测分子的生物活性。随机森林(RF)、多层感知器(MLP)和ChemBERTa是抗利什曼病分子的最佳分类器,而梯度提升(GB)、图卷积网络(GCN)和ChemBERTa在冠状病毒数据集上取得了最佳性能。所有六个模型都经过了优化,并通过 CidalsDB 作为抗病原体活性预测模型进行了部署。它为基于人工智能的计算机辅助设计(CADD)提供了一个民主化的无代码平台,可促进 DD 社区的创新与合作。CidalsDB 可通过 https://cidalsdb.streamlit.app/ 访问。
{"title":"cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research","authors":"Emna Harigua-Souiai,&nbsp;Ons Masmoudi,&nbsp;Samer Makni,&nbsp;Rafeh Oualha,&nbsp;Yosser Z. Abdelkrim,&nbsp;Sara Hamdi,&nbsp;Oussama Souiai,&nbsp;Ikram Guizani","doi":"10.1186/s13321-024-00929-7","DOIUrl":"10.1186/s13321-024-00929-7","url":null,"abstract":"<div><p>Computer-aided drug discovery (CADD) is nurtured by late advances in big data analytics and Artificial Intelligence (AI) towards enhanced drug discovery (DD) outcomes. In this context, reliable datasets are of utmost importance. We herein present <i>CidalsDB</i> a novel web server for AI-assisted DD against infectious pathogens, namely <i>Leishmania</i> parasites and Coronaviruses. We performed a literature search on molecules with validated anti-pathogen effects. Then, we consolidated these data with bioassays from PubChem. Finally, we constructed a database to store these datasets and make them accessible and ready-to-use for the scientific community through <i>CidalsDB</i>, a web-based interface. In a second step, we implemented and optimized four machine learning (ML) and three deep learning (DL) algorithms that optimally predicted the biological activity of molecules. Random Forests (RF), Multi-Layer Perceptron (MLP) and ChemBERTa were the best classifiers of anti-<i>Leishmania</i> molecules, while Gradient Boosting (GB), Graph-Convolutional Network (GCN) and ChemBERTa achieved the best performances on the Coronaviruses dataset. All six models were optimized and deployed through <i>CidalsDB</i> as anti-pathogen activity prediction models.</p><p><b>Scientific contribution</b></p><p>CidalsDB is an open access web-based tool that allows browsing and access to ready-to-use datasets of anti-pathogen molecules, alongside best performing AI models for biological activity prediction. It offers a democratized no-code platform for AI-based CADD, which shall foster innovation and collaboration within the DD community. <i>CidalsDB</i> is accessible through https://cidalsdb.streamlit.app/.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00929-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142737029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Group graph: a molecular graph representation with enhanced performance, efficiency and interpretability 组图:一种性能、效率和可解释性更强的分子图表示法
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-28 DOI: 10.1186/s13321-024-00933-x
Piao-Yang Cao, Yang He, Ming-Yang Cui, Xiao-Min Zhang, Qingye Zhang, Hong-Yu Zhang

The exploration of chemical space holds promise for developing influential chemical entities. Molecular representations, which reflect features of molecular structure in silico, assist in navigating chemical space appropriately. Unlike atom-level molecular representations, such as SMILES and atom graph, which can sometimes lead to confusing interpretations about chemical substructures, substructure-level molecular representations encode important substructures into molecular features; they not only provide more information for predicting molecular properties and drug‒drug interactions but also help to interpret the correlations between molecular properties and substructures. However, it remains challenging to represent the entire molecular structure both intactly and simply with substructure-level molecular representations. In this study, we developed a novel substructure-level molecular representation and named it a group graph. The group graph offers three advantages: (a) the substructure of the group graph reflects the diversity and consistency of different molecular datasets; (b) the group graph retains molecular structural features with minimal information loss because the graph isomorphism network (GIN) of the group graph performs well in molecular properties and drug‒drug interactions prediction, showing higher accuracy and efficiency than the model of other molecular graphs, even without any pretraining; and (c) the molecular property may change when the substructure is substituted with another of differing importance in group graph, facilitating the detection of activity cliffs. In addition, we successfully predicted structural modifications to improve blood‒brain barrier permeability (BBBP) via the GIN of group graph. Therefore, the group graph takes advantages for simultaneously representing molecular local characteristics and global features.

Scientific contribution The group graph, as a substructure-level molecular representation, has the ability to retain molecular structural features with minimal information loss. As a result, it shows superior performance in predicting molecular properties and drug‒drug interactions with enhanced efficiency and interpretability.

探索化学空间有望开发出有影响力的化学实体。分子表征反映了硅学中分子结构的特征,有助于适当地浏览化学空间。与 SMILES 和原子图等原子级分子表征不同,亚结构级分子表征将重要的亚结构编码为分子特征;它们不仅为预测分子性质和药物间相互作用提供了更多信息,还有助于解释分子性质与亚结构之间的相关性。然而,用亚结构级分子表征完整而简单地表征整个分子结构仍然具有挑战性。在这项研究中,我们开发了一种新颖的亚结构级分子表示法,并将其命名为组图。组图有三个优点:(a) 群图的亚结构反映了不同分子数据集的多样性和一致性;(b) 群图以最小的信息损失保留了分子结构特征,因为群图的图同构网络(GIN)在分子性质和药物相互作用预测方面表现出色,即使没有任何预训练,也比其他分子图的模型表现出更高的准确性和效率;(c) 当群图中的亚结构被另一个不同重要性的亚结构替代时,分子性质可能会发生变化,这有利于检测活性悬崖。此外,我们还通过组图的 GIN 成功预测了改善血脑屏障通透性(BBBP)的结构修饰。因此,组图在同时表示分子局部特征和全局特征方面具有优势。 科学贡献 组图作为一种亚结构级分子表示法,能够在保留分子结构特征的同时将信息损失降到最低。因此,它在预测分子特性和药物相互作用方面表现出卓越的性能,并提高了效率和可解释性。
{"title":"Group graph: a molecular graph representation with enhanced performance, efficiency and interpretability","authors":"Piao-Yang Cao,&nbsp;Yang He,&nbsp;Ming-Yang Cui,&nbsp;Xiao-Min Zhang,&nbsp;Qingye Zhang,&nbsp;Hong-Yu Zhang","doi":"10.1186/s13321-024-00933-x","DOIUrl":"10.1186/s13321-024-00933-x","url":null,"abstract":"<div><p>The exploration of chemical space holds promise for developing influential chemical entities. Molecular representations, which reflect features of molecular structure in silico, assist in navigating chemical space appropriately. Unlike atom-level molecular representations, such as SMILES and atom graph, which can sometimes lead to confusing interpretations about chemical substructures, substructure-level molecular representations encode important substructures into molecular features; they not only provide more information for predicting molecular properties and drug‒drug interactions but also help to interpret the correlations between molecular properties and substructures. However, it remains challenging to represent the entire molecular structure both intactly and simply with substructure-level molecular representations. In this study, we developed a novel substructure-level molecular representation and named it a group graph. The group graph offers three advantages: (a) the substructure of the group graph reflects the diversity and consistency of different molecular datasets; (b) the group graph retains molecular structural features with minimal information loss because the graph isomorphism network (GIN) of the group graph performs well in molecular properties and drug‒drug interactions prediction, showing higher accuracy and efficiency than the model of other molecular graphs, even without any pretraining; and (c) the molecular property may change when the substructure is substituted with another of differing importance in group graph, facilitating the detection of activity cliffs. In addition, we successfully predicted structural modifications to improve blood‒brain barrier permeability (BBBP) via the GIN of group graph. Therefore, the group graph takes advantages for simultaneously representing molecular local characteristics and global features.</p><p><b>Scientific contribution</b> The group graph, as a substructure-level molecular representation, has the ability to retain molecular structural features with minimal information loss. As a result, it shows superior performance in predicting molecular properties and drug‒drug interactions with enhanced efficiency and interpretability. </p><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00933-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142737030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature 从专利文献中提取高质量化学反应数据集的大语言模型的适用性
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-26 DOI: 10.1186/s13321-024-00928-8
Sarveswara Rao Vangala, Sowmya Ramaswamy Krishnan, Navneet Bung, Dhandapani Nandagopal, Gomathi Ramasamy, Satyam Kumar, Sridharan Sankaran, Rajgopal Srinivasan, Arijit Roy

With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.

Scientific contribution

In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.

随着人工智能(AI)时代的到来,人们现在有可能从以前未曾探索过的化学空间中设计出多种多样的新型分子。然而,化学家面临的一个挑战是如何合成这些分子。最近,有人尝试开发用于逆合成预测的人工智能模型,但这有赖于高质量训练数据集的可用性。在这项工作中,我们探索了大语言模型(LLM)从专利文件中提取高质量化学反应数据的适用性。对早期研究中的同一组专利进行的比较研究表明,所提出的自动化方法可以增加 26% 的新反应,从而增强当前数据集的能力。在反应挖掘过程中发现了一些挑战,并针对其中一些挑战提出了替代解决方案。此外,还进行了详细的分析,发现了之前数据集中的几个错误条目。在更大的专利数据集上使用所提出的管道提取反应,可以提高未来合成预测模型的准确性和效率。科学贡献 在这项工作中,我们评估了大语言模型从专利文献中挖掘高质量化学反应数据集的适用性。结果表明,所提出的方法可以通过识别更多的化学反应来显著提高反应数据库的数量,并通过纠正以前的错误/假阳性反应来提高反应数据库的质量。
{"title":"Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature","authors":"Sarveswara Rao Vangala,&nbsp;Sowmya Ramaswamy Krishnan,&nbsp;Navneet Bung,&nbsp;Dhandapani Nandagopal,&nbsp;Gomathi Ramasamy,&nbsp;Satyam Kumar,&nbsp;Sridharan Sankaran,&nbsp;Rajgopal Srinivasan,&nbsp;Arijit Roy","doi":"10.1186/s13321-024-00928-8","DOIUrl":"10.1186/s13321-024-00928-8","url":null,"abstract":"<div><p>With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.</p><p><b>Scientific contribution</b></p><p>In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00928-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142713383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts GT-NMR:基于图变换器的新型核磁共振化学位移精确预测方法
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-26 DOI: 10.1186/s13321-024-00927-9
Haochen Chen, Tao Liang, Kai Tan, Anan Wu, Xin Lu

In this work, inspired by the graph transformer, we presented an improved protocol, termed GT-NMR, which integrates 2D molecular graph representation with Transformer architecture, for accurate yet efficient prediction of NMR chemical shifts. The effectiveness of the GT-NMR was thoroughly examined with the standard nmrshiftdb2 dataset, 37 natural products and structural elucidation of 11 pairs of natural products. Systematical analysis affirms that GT-NMR outperforms traditional graph-based methods in all aspects, achieving state-of-the-art performance, with the mean absolute error of 0.158 and 1.189 ppm in predicting 1H and 13C NMR chemical shifts, respectively, for the standard nmrshiftdb2 dataset. Further scrutiny of its practical applications indicates that GT-NMR's efficacy is closely tied to molecular complexity, as quantified by the size-normalized spatial score (nSPS). For relatively simple molecules (nSPS < = 27.71), GT-NMR performs comparably to the best density functional while its effectiveness significantly diminishes with complex molecules characterized by higher nSPS values (nSPS > = 38.42). This trend is consistent across other graph-based NMR chemical shift prediction methods as well. Therefore, while employing GT-NMR or other graph-based methods for the rapid and routine prediction of NMR chemical shifts, it is advisable to utilize nSPS to assess their suitability. The source codes and trained model of GT-NMR are publicly available at GitHub.

Scientific contribution

GT-NMR, which combines the 2D molecular graph representation with the Transformer architecture, was implemented for the first time to predict atom-level NMR chemical shifts, achieving state-of-the-art performance. More importantly, the reliability of the GT-NMR and graph-based methods was assessed for the first time in terms of molecular complexity, as quantified by the size-normalized spacial score (nSPS). Systematical scrutiny demonstrated that GT-NMR offer a valuable way for routine application in structural screening and elucidation of relatively simple molecules.

在这项工作中,我们受到图转换器的启发,提出了一种改进的方案,称为 GT-NMR,它将二维分子图表示法与转换器架构相结合,用于准确而高效地预测核磁共振化学位移。我们利用标准 nmrshiftdb2 数据集、37 种天然产品和 11 对天然产品的结构阐释对 GT-NMR 的有效性进行了全面检验。系统分析证实,在预测标准 nmrshiftdb2 数据集的 1H 和 13C NMR 化学位移方面,GT-NMR 的平均绝对误差分别为 0.158 和 1.189 ppm,在各方面均优于传统的基于图形的方法,达到了最先进的性能。对其实际应用的进一步研究表明,GT-NMR 的功效与分子复杂性密切相关,分子复杂性可通过尺寸归一化空间分数 (nSPS) 量化。对于相对简单的分子(nSPS < = 27.71),GT-NMR 的性能可与最佳密度函数相媲美,而对于 nSPS 值较高的复杂分子(nSPS > = 38.42),GT-NMR 的功效则明显降低。这一趋势在其他基于图形的 NMR 化学位移预测方法中也是一致的。因此,在使用 GT-NMR 或其他基于图形的方法快速、常规预测 NMR 化学位移时,最好使用 nSPS 来评估其适用性。GT-NMR 的源代码和训练有素的模型可在 GitHub 上公开获取。科学贡献 GT-NMR 结合了二维分子图表示法和 Transformer 架构,首次用于预测原子级 NMR 化学位移,实现了最先进的性能。更重要的是,首次从分子复杂性的角度评估了 GT-NMR 和基于图的方法的可靠性,以尺寸归一化空间分数 (nSPS) 进行量化。系统审查表明,GT-NMR 为结构筛选和阐明相对简单分子的常规应用提供了一种有价值的方法。
{"title":"GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts","authors":"Haochen Chen,&nbsp;Tao Liang,&nbsp;Kai Tan,&nbsp;Anan Wu,&nbsp;Xin Lu","doi":"10.1186/s13321-024-00927-9","DOIUrl":"10.1186/s13321-024-00927-9","url":null,"abstract":"<div><p>In this work, inspired by the graph transformer, we presented an improved protocol, termed GT-NMR, which integrates 2D molecular graph representation with Transformer architecture, for accurate yet efficient prediction of NMR chemical shifts. The effectiveness of the GT-NMR was thoroughly examined with the standard nmrshiftdb2 dataset, 37 natural products and structural elucidation of 11 pairs of natural products. Systematical analysis affirms that GT-NMR outperforms traditional graph-based methods in all aspects, achieving state-of-the-art performance, with the mean absolute error of 0.158 and 1.189 ppm in predicting <sup>1</sup>H and <sup>13</sup>C NMR chemical shifts, respectively, for the standard nmrshiftdb2 dataset. Further scrutiny of its practical applications indicates that GT-NMR's efficacy is closely tied to molecular complexity, as quantified by the size-normalized spatial score (nSPS). For relatively simple molecules (nSPS &lt; = 27.71), GT-NMR performs comparably to the best density functional while its effectiveness significantly diminishes with complex molecules characterized by higher nSPS values (nSPS &gt; = 38.42). This trend is consistent across other graph-based NMR chemical shift prediction methods as well. Therefore, while employing GT-NMR or other graph-based methods for the rapid and routine prediction of NMR chemical shifts, it is advisable to utilize nSPS to assess their suitability. The source codes and trained model of GT-NMR are publicly available at GitHub.</p><p><b>Scientific contribution</b></p><p>GT-NMR, which combines the 2D molecular graph representation with the Transformer architecture, was implemented for the first time to predict atom-level NMR chemical shifts, achieving state-of-the-art performance. More importantly, the reliability of the GT-NMR and graph-based methods was assessed for the first time in terms of molecular complexity, as quantified by the size-normalized spacial score (nSPS). Systematical scrutiny demonstrated that GT-NMR offer a valuable way for routine application in structural screening and elucidation of relatively simple molecules.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00927-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142713121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Molecular identification via molecular fingerprint extraction from atomic force microscopy images 从原子力显微镜图像中提取分子指纹进行分子鉴定
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-25 DOI: 10.1186/s13321-024-00921-1
Manuel González Lastre, Pablo Pou, Miguel Wiche, Daniel Ebeling, Andre Schirmeisen, Rubén Pérez

Non–Contact Atomic Force Microscopy with CO–functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR–AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024–bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR–AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification. By construction, the number of times a certain substructure is present in the molecule is lost during the hashing process, necessary to make them useful for machine learning applications. We show that it is possible to complement the fingerprint-based virtual screening with global information provided by another DL model that predicts from the same HR–AFM stacks the chemical formula, boosting the identification accuracy up to a 97.6%. Finally, we perform a limited test with experimental images, obtaining promising results towards the application of this pipeline under real conditions.

Scientific contribution

Previous works on molecular identification from AFM images used chemical descriptors that were intuitive for humans but sub–optimal for neural networks. We propose a novel method to extract the ECFP4 from AFM images and identify the molecule via a virtual screening, beating previous state-of-the-art models.

使用 CO 功能化金属针尖的非接触式原子力显微镜(简称 HR-AFM)能以完全前所未有的分辨率观察吸附在表面上的单个分子的内部结构。之前的研究表明,深度学习(DL)模型可以检索恒定高度的 HR-AFM 图像的三维堆栈中编码的化学和结构信息,从而进行分子识别。在这项工作中,我们利用拓扑指纹(1024 位半径 2 的扩展连接化学指纹(ECFP4))对分子结构进行了完善的描述,从而克服了它们的局限性。ECFP 提供了分子的局部结构信息,每个比特与分子内的特定子结构相关。我们的 DL 模型能够从三维 HR-AFM 堆栈中提取这种优化的结构描述符,并通过虚拟筛选,利用预测的 ECFP4 识别分子,理论图像的检索准确率高达 95.4%。此外,与以往的 DL 模型不同,这种方法会给每个候选分子分配一个置信度分数,即 Tanimoto 相似度,从而提供识别可靠性的信息。根据构造,在散列过程中,分子中出现某种子结构的次数会丢失,而这是使它们在机器学习应用中发挥作用的必要条件。我们的研究表明,可以利用另一个 DL 模型提供的全局信息对基于指纹的虚拟筛选进行补充,该模型可从相同的 HR-AFM 堆栈中预测化学式,从而将识别准确率提高到 97.6%。最后,我们利用实验图像进行了有限的测试,获得了在实际条件下应用该管道的可喜成果。科学贡献 以往从原子力显微镜图像中进行分子识别的工作所使用的化学描述符对人类来说是直观的,但对神经网络来说却是次优的。我们提出了一种从原子力显微镜图像中提取 ECFP4 并通过虚拟筛选识别分子的新方法,超越了之前的先进模型。
{"title":"Molecular identification via molecular fingerprint extraction from atomic force microscopy images","authors":"Manuel González Lastre,&nbsp;Pablo Pou,&nbsp;Miguel Wiche,&nbsp;Daniel Ebeling,&nbsp;Andre Schirmeisen,&nbsp;Rubén Pérez","doi":"10.1186/s13321-024-00921-1","DOIUrl":"10.1186/s13321-024-00921-1","url":null,"abstract":"<div><p>Non–Contact Atomic Force Microscopy with CO–functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR–AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024–bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR–AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification. By construction, the number of times a certain substructure is present in the molecule is lost during the hashing process, necessary to make them useful for machine learning applications. We show that it is possible to complement the fingerprint-based virtual screening with global information provided by another DL model that predicts from the same HR–AFM stacks the chemical formula, boosting the identification accuracy up to a 97.6%. Finally, we perform a limited test with experimental images, obtaining promising results towards the application of this pipeline under real conditions.</p><p><b>Scientific contribution</b></p><p>Previous works on molecular identification from AFM images used chemical descriptors that were intuitive for humans but sub–optimal for neural networks. We propose a novel method to extract the ECFP4 from AFM images and identify the molecule via a virtual screening, beating previous state-of-the-art models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00921-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142697120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A systematic review of deep learning chemical language models in recent era 近代深度学习化学语言模型的系统回顾。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-18 DOI: 10.1186/s13321-024-00916-y
Hector Flores-Hernandez, Emmanuel Martinez-Ledesma

Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.

发现具有特定性质的新化合物可以为依赖材料进行开发的领域带来优势,尽管这项任务在复杂性和资源方面需要付出高昂的代价。自数据时代开始以来,深度学习技术通过分析和学习分子数据表示,彻底改变了分子设计过程,大大减少了所需的资源和时间。迄今为止,已经开发出了多种深度学习方法,采用了各种架构和策略,以探索广泛而不连续的化学空间,为生成具有特定性质的化合物提供益处。在本研究中,我们利用分子集(MOSES)或 Guacamol 中提出的指标,对通过深度学习技术生成分子的策略进行了统计描述和比较,并提交了一篇系统性综述。这项研究包括从 Scopus 和 Web of Science 的查询式搜索中检索到的 48 篇文章,以及从引文搜索中检索到的 25 篇文章,共检索到 72 篇文章,其中 62 篇与分子生成的化学语言模型方法相对应,另外 10 篇检索到的文章与分子图表示方法相对应。变换器、递归神经网络(RNN)、生成对抗网络(GAN)、结构空间状态序列(S4)模型和变异自动编码器(VAE)被认为是检索文章中用于分子生成的主要深度学习架构。此外,迁移学习、强化学习和条件学习也是最常用的技术,用于生成有偏差的模型和探索特定的化学空间区域。最后,本分析侧重于分子表征、数据库、训练数据集规模、有效性-新颖性权衡以及无偏和有偏化学语言模型的性能等中心主题。选定这些主题后,利用图形表示法和统计检验法进行了统计分析。分析结果揭示了过去四年中化学语言模型领域的主要挑战、优势和机遇。
{"title":"A systematic review of deep learning chemical language models in recent era","authors":"Hector Flores-Hernandez,&nbsp;Emmanuel Martinez-Ledesma","doi":"10.1186/s13321-024-00916-y","DOIUrl":"10.1186/s13321-024-00916-y","url":null,"abstract":"<div><p>Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00916-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool QSPRpred:灵活的开源定量结构-属性关系建模工具。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-14 DOI: 10.1186/s13321-024-00908-y
Helle W. van den Maagdenberg, Martin Šícho, David Alencar Araripe, Sohvi Luukkonen, Linde Schoenmaker, Michiel Jespers, Olivier J. M. Béquignon, Marina Gorostiola González, Remco L. van den Broek, Andrius Bernatavicius, J. G. Coen van Hasselt, Piet. H. van der Graaf, Gerard J. P. van Westen

Building reliable and robust quantitative structure–property relationship (QSPR) models is a challenging task. First, the experimental data needs to be obtained, analyzed and curated. Second, the number of available methods is continuously growing and evaluating different algorithms and methodologies can be arduous. Finally, the last hurdle that researchers face is to ensure the reproducibility of their models and facilitate their transferability into practice. In this work, we introduce QSPRpred, a toolkit for analysis of bioactivity data sets and QSPR modelling, which attempts to address the aforementioned challenges. QSPRpred’s modular Python API enables users to intuitively describe different parts of a modelling workflow using a plethora of pre-implemented components, but also integrates customized implementations in a “plug-and-play” manner. QSPRpred data sets and models are directly serializable, which means they can be readily reproduced and put into operation after training as the models are saved with all required data pre-processing steps to make predictions on new compounds directly from SMILES strings. The general-purpose character of QSPRpred is also demonstrated by inclusion of support for multi-task and proteochemometric modelling. The package is extensively documented and comes with a large collection of tutorials to help new users. In this paper, we describe all of QSPRpred’s functionalities and also conduct a small benchmarking case study to illustrate how different components can be leveraged to compare a diverse set of models. QSPRpred is fully open-source and available at https://github.com/CDDLeiden/QSPRpred.


Scientific Contribution

QSPRpred aims to provide a complex, but comprehensive Python API to conduct all tasks encountered in QSPR modelling from data preparation and analysis to model creation and model deployment. In contrast to similar packages, QSPRpred offers a wider and more exhaustive range of capabilities and integrations with many popular packages that also go beyond QSPR modelling. A significant contribution of QSPRpred is also in its automated and highly standardized serialization scheme, which significantly improves reproducibility and transferability of models.

建立可靠、稳健的定量结构-性能关系(QSPR)模型是一项具有挑战性的任务。首先,需要获取、分析和整理实验数据。其次,可用方法的数量在不断增加,评估不同的算法和方法可能非常困难。最后,研究人员面临的最后一个障碍是确保其模型的可重复性,并促进其在实践中的可移植性。在这项工作中,我们介绍了用于分析生物活性数据集和 QSPR 建模的工具包 QSPRpred,它试图解决上述挑战。QSPRpred 的模块化 Python 应用程序接口(API)使用户能够使用大量预实现组件直观地描述建模工作流程的不同部分,同时还能以 "即插即用 "的方式集成自定义实现。QSPRpred 数据集和模型可直接序列化,这意味着它们可以随时复制,并在训练后投入使用,因为模型与所有必要的数据预处理步骤一起保存,可直接从 SMILES 字符串对新化合物进行预测。QSPRpred 的通用性还体现在支持多任务和蛋白质化学计量建模。该软件包有大量文档,并附有大量教程,可为新用户提供帮助。在本文中,我们介绍了 QSPRpred 的所有功能,还进行了一个小型基准案例研究,以说明如何利用不同组件来比较各种模型。QSPRpred 是完全开源的,可从 https://github.com/CDDLeiden/QSPRpred 上获取。科学贡献QSPRpred 旨在提供一个复杂但全面的 Python 应用程序接口,以执行 QSPR 建模中遇到的所有任务,从数据准备和分析到模型创建和模型部署。与同类软件包相比,QSPRpred 提供了更广泛、更详尽的功能,并与许多流行的软件包集成,这些功能也超出了 QSPR 建模的范围。QSPRpred 的一个重要贡献还在于其自动化和高度标准化的序列化方案,这大大提高了模型的可复制性和可移植性。
{"title":"QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool","authors":"Helle W. van den Maagdenberg,&nbsp;Martin Šícho,&nbsp;David Alencar Araripe,&nbsp;Sohvi Luukkonen,&nbsp;Linde Schoenmaker,&nbsp;Michiel Jespers,&nbsp;Olivier J. M. Béquignon,&nbsp;Marina Gorostiola González,&nbsp;Remco L. van den Broek,&nbsp;Andrius Bernatavicius,&nbsp;J. G. Coen van Hasselt,&nbsp;Piet. H. van der Graaf,&nbsp;Gerard J. P. van Westen","doi":"10.1186/s13321-024-00908-y","DOIUrl":"10.1186/s13321-024-00908-y","url":null,"abstract":"<div><p>Building reliable and robust quantitative structure–property relationship (QSPR) models is a challenging task. First, the experimental data needs to be obtained, analyzed and curated. Second, the number of available methods is continuously growing and evaluating different algorithms and methodologies can be arduous. Finally, the last hurdle that researchers face is to ensure the reproducibility of their models and facilitate their transferability into practice. In this work, we introduce QSPRpred, a toolkit for analysis of bioactivity data sets and QSPR modelling, which attempts to address the aforementioned challenges. QSPRpred’s modular Python API enables users to intuitively describe different parts of a modelling workflow using a plethora of pre-implemented components, but also integrates customized implementations in a “plug-and-play” manner. QSPRpred data sets and models are directly serializable, which means they can be readily reproduced and put into operation after training as the models are saved with all required data pre-processing steps to make predictions on new compounds directly from SMILES strings. The general-purpose character of QSPRpred is also demonstrated by inclusion of support for multi-task and proteochemometric modelling. The package is extensively documented and comes with a large collection of tutorials to help new users. In this paper, we describe all of QSPRpred’s functionalities and also conduct a small benchmarking case study to illustrate how different components can be leveraged to compare a diverse set of models. QSPRpred is fully open-source and available at https://github.com/CDDLeiden/QSPRpred.</p><br><p><b>Scientific Contribution</b></p><p>QSPRpred aims to provide a complex, but comprehensive Python API to conduct all tasks encountered in QSPR modelling from data preparation and analysis to model creation and model deployment. In contrast to similar packages, QSPRpred offers a wider and more exhaustive range of capabilities and integrations with many popular packages that also go beyond QSPR modelling. A significant contribution of QSPRpred is also in its automated and highly standardized serialization scheme, which significantly improves reproducibility and transferability of models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00908-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142611741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1