首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Identifying uncertainty in physical–chemical property estimation with IFSQSAR 用 IFSQSAR 识别物理化学特性估算中的不确定性
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-30 DOI: 10.1186/s13321-024-00853-w
Trevor N. Brown, Alessandro Sangion, Jon A. Arnot

This study describes the development and evaluation of six new models for predicting physical–chemical (PC) properties that are highly relevant for chemical hazard, exposure, and risk estimation: solubility (in water SW and octanol SO), vapor pressure (VP), and the octanol–water (KOW), octanol–air (KOA), and air–water (KAW) partition ratios. The models are implemented in the Iterative Fragment Selection Quantitative Structure–Activity Relationship (IFSQSAR) python package, Version 1.1.0. These models are implemented as Poly-Parameter Linear Free Energy Relationship (PPLFER) equations which combine experimentally calibrated system parameters and solute descriptors predicted with QSPRs. Two other ancillary models have been developed and implemented, a QSPR for Molar Volume (MV) and a classifier for the physical state of chemicals at room temperature. The IFSQSAR methods for characterizing applicability domain (AD) and calculating uncertainty estimates expressed as 95% prediction intervals (PI) for predicted properties are described and tested on 9,000 measured partition ratios and 4,000 VP and SW values. The measured data are external to IFSQSAR training and validation datasets and are used to assess the predictivity of the models for “novel chemicals” in an unbiased manner. The 95% PI intervals calculated from validation datasets for partition ratios needed to be scaled by a factor of 1.25 to capture 95% of the external data. Predictions for VP and SW are more uncertain, primarily due to the challenges in differentiating their physical state (i.e., liquids or solids) at room temperature. The prediction accuracy of the models for log KOW, log KAW and log KOA of novel, data-poor chemicals is estimated to be in the range of 0.7 to 1.4 root mean squared error of prediction (RMSEP), with RMSEP in the range 1.7–1.8 for log VP and log SW.

Scientific contribution

New partitioning models integrate empirical PPLFER equations and QSARs, allowing for seamless integration of experimental data and model predictions. This work tests the real predictivity of the models for novel chemicals which are not in the model training or external validation datasets.

Graphical Abstract

本研究介绍了六种新模型的开发和评估情况,这些模型用于预测与化学品危害、暴露和风险评估高度相关的物理化学(PC)特性:溶解度(在水 SW 和辛醇 SO 中)、蒸气压(VP)以及辛醇-水分配比(KOW)、辛醇-空气分配比(KOA)和空气-水分配比(KAW)。这些模型是在迭代片段选择定量结构-活性关系(IFSQSAR)1.1.0 版 python 软件包中实现的。这些模型以多参数线性自由能关系(PPLFER)方程的形式实现,结合了实验校准的系统参数和 QSPR 预测的溶质描述符。还开发并实施了另外两个辅助模型,一个是摩尔体积(MV)QSPR,另一个是室温下化学品物理状态分类器。IFSQSAR 用于描述适用域 (AD) 和计算不确定性估计值(以预测性质的 95% 预测区间 (PI) 表示)的方法已在 9,000 个测得的分配比和 4,000 个 VP 和 SW 值上进行了描述和测试。测量数据是 IFSQSAR 训练和验证数据集的外部数据,用于以无偏见的方式评估模型对 "新型化学品 "的预测能力。从验证数据集计算出的分配比 95% PI 区间需要按 1.25 的系数进行缩放,以获取 95% 的外部数据。对 VP 和 SW 的预测不确定性较大,这主要是由于在室温下区分它们的物理状态(即液体或 固体)存在困难。据估计,模型对数据贫乏的新型化学品的对数 KOW、对数 KAW 和对数 KOA 的预测精度在 0.7 至 1.4 预测均方根误差 (RMSEP) 之间,对数 VP 和对数 SW 的 RMSEP 在 1.7 至 1.8 之间。科学贡献 新的分区模型整合了 PPLFER 经验方程和 QSAR,实现了实验数据和模型预测的无缝整合。这项工作测试了模型对模型训练数据集或外部验证数据集中没有的新型化学品的实际预测能力。
{"title":"Identifying uncertainty in physical–chemical property estimation with IFSQSAR","authors":"Trevor N. Brown,&nbsp;Alessandro Sangion,&nbsp;Jon A. Arnot","doi":"10.1186/s13321-024-00853-w","DOIUrl":"10.1186/s13321-024-00853-w","url":null,"abstract":"<div><p>This study describes the development and evaluation of six new models for predicting physical–chemical (PC) properties that are highly relevant for chemical hazard, exposure, and risk estimation: solubility (in water <i>S</i><sub><i>W</i></sub> and octanol <i>S</i><sub><i>O</i></sub>), vapor pressure (<i>VP</i>), and the octanol–water (<i>K</i><sub><i>OW</i></sub>), octanol–air (<i>K</i><sub><i>OA</i></sub>), and air–water (<i>K</i><sub><i>AW</i></sub>) partition ratios. The models are implemented in the Iterative Fragment Selection Quantitative Structure–Activity Relationship (IFSQSAR) python package, Version 1.1.0. These models are implemented as Poly-Parameter Linear Free Energy Relationship (PPLFER) equations which combine experimentally calibrated system parameters and solute descriptors predicted with QSPRs. Two other ancillary models have been developed and implemented, a QSPR for Molar Volume (<i>MV</i>) and a classifier for the physical state of chemicals at room temperature. The IFSQSAR methods for characterizing applicability domain (AD) and calculating uncertainty estimates expressed as 95% prediction intervals (PI) for predicted properties are described and tested on 9,000 measured partition ratios and 4,000 <i>VP</i> and <i>S</i><sub><i>W</i></sub> values. The measured data are external to IFSQSAR training and validation datasets and are used to assess the predictivity of the models for “novel chemicals” in an unbiased manner. The 95% PI intervals calculated from validation datasets for partition ratios needed to be scaled by a factor of 1.25 to capture 95% of the external data. Predictions for <i>VP</i> and <i>S</i><sub><i>W</i></sub> are more uncertain, primarily due to the challenges in differentiating their physical state (i.e., liquids or solids) at room temperature. The prediction accuracy of the models for log <i>K</i><sub><i>OW</i></sub>, log <i>K</i><sub><i>AW</i></sub> and log <i>K</i><sub><i>OA</i></sub> of novel, data-poor chemicals is estimated to be in the range of 0.7 to 1.4 root mean squared error of prediction (RMSEP), with RMSEP in the range 1.7–1.8 for log <i>VP</i> and log <i>S</i><sub><i>W</i></sub>. </p><p><b>Scientific contribution</b></p><p>New partitioning models integrate empirical PPLFER equations and QSARs, allowing for seamless integration of experimental data and model predictions. This work tests the real predictivity of the models for novel chemicals which are not in the model training or external validation datasets.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00853-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141177770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design MolScore:新药设计中生成模型的评分、评估和基准框架
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-30 DOI: 10.1186/s13321-024-00861-w
Morgan Thomas, Noel M. O’Boyle, Andreas Bender, Chris De Graaf

Generative models are undergoing rapid research and application to de novo drug design. To facilitate their application and evaluation, we present MolScore. MolScore already contains many drug-design-relevant scoring functions commonly used in benchmarks such as, molecular similarity, molecular docking, predictive models, synthesizability, and more. In addition, providing performance metrics to evaluate generative model performance based on the chemistry generated. With this unification of functionality, MolScore re-implements commonly used benchmarks in the field (such as GuacaMol, MOSES, and MolOpt). Moreover, new benchmarks can be created trivially. We demonstrate this by testing a chemical language model with reinforcement learning on three new tasks of increasing complexity related to the design of 5-HT2a ligands that utilise either molecular descriptors, 266 pre-trained QSAR models, or dual molecular docking. Lastly, MolScore can be integrated into an existing Python script with just three lines of code. This framework is a step towards unifying generative model application and evaluation as applied to drug design for both practitioners and researchers. The framework can be found on GitHub and downloaded directly from the Python Package Index.

Scientific Contribution

MolScore is an open-source platform to facilitate generative molecular design and evaluation thereof for application in drug design. This platform takes important steps towards unifying existing benchmarks, providing a platform to share new benchmarks, and improves customisation, flexibility and usability for practitioners over existing solutions.

Graphical Abstract

生成模型正在快速研究并应用于新药设计。为了促进这些模型的应用和评估,我们推出了 MolScore。MolScore 已经包含了许多药物设计相关的评分函数,这些函数通常用于分子相似性、分子对接、预测模型、可合成性等基准测试。此外,它还提供了性能指标,可根据生成的化学成分评估生成模型的性能。通过功能的统一,MolScore 重新实现了该领域常用的基准(如 GuacaMol、MOSES 和 MolOpt)。此外,新的基准也可以轻松创建。我们通过在与 5-HT2a 配体设计有关的三个复杂度不断增加的新任务上测试具有强化学习功能的化学语言模型,证明了这一点,这三个新任务要么使用分子描述符,要么使用 266 个预训练 QSAR 模型,要么使用双分子对接。最后,MolScore 只需三行代码即可集成到现有的 Python 脚本中。该框架朝着统一生成模型应用和评估的方向迈出了一步,适用于从业人员和研究人员的药物设计。该框架可在 GitHub 上找到,也可直接从 Python 软件包索引中下载。科学贡献 MolScore 是一个开源平台,用于促进药物设计中应用的生成式分子设计和评估。该平台在统一现有基准、提供共享新基准的平台方面迈出了重要的一步,与现有解决方案相比,它提高了从业人员的定制性、灵活性和可用性。
{"title":"MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design","authors":"Morgan Thomas,&nbsp;Noel M. O’Boyle,&nbsp;Andreas Bender,&nbsp;Chris De Graaf","doi":"10.1186/s13321-024-00861-w","DOIUrl":"10.1186/s13321-024-00861-w","url":null,"abstract":"<div><p>Generative models are undergoing rapid research and application to de novo drug design. To facilitate their application and evaluation, we present MolScore. MolScore already contains many drug-design-relevant scoring functions commonly used in benchmarks such as, molecular similarity, molecular docking, predictive models, synthesizability, and more. In addition, providing performance metrics to evaluate generative model performance based on the chemistry generated. With this unification of functionality, MolScore re-implements commonly used benchmarks in the field (such as GuacaMol, MOSES, and MolOpt). Moreover, new benchmarks can be created trivially. We demonstrate this by testing a chemical language model with reinforcement learning on three new tasks of increasing complexity related to the design of 5-HT<sub>2a</sub> ligands that utilise either molecular descriptors, 266 pre-trained QSAR models, or dual molecular docking. Lastly, MolScore can be integrated into an existing Python script with just three lines of code. This framework is a step towards unifying generative model application and evaluation as applied to drug design for both practitioners and researchers. The framework can be found on GitHub and downloaded directly from the Python Package Index.</p><p><b>Scientific Contribution</b></p><p>MolScore is an open-source platform to facilitate generative molecular design and evaluation thereof for application in drug design. This platform takes important steps towards unifying existing benchmarks, providing a platform to share new benchmarks, and improves customisation, flexibility and usability for practitioners over existing solutions.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><img></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00861-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141177620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Consensus holistic virtual screening for drug discovery: a novel machine learning model approach 用于药物发现的共识整体虚拟筛选:一种新型机器学习模型方法。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-28 DOI: 10.1186/s13321-024-00855-8
Said Moshawih, Zhen Hui Bu, Hui Poh Goh, Nurolaini Kifli, Lam Hong Lee, Khang Wen Goh, Long Chiau Ming

In drug discovery, virtual screening is crucial for identifying potential hit compounds. This study aims to present a novel pipeline that employs machine learning models that amalgamates various conventional screening methods. A diverse array of protein targets was selected, and their corresponding datasets were subjected to active/decoy distribution analysis prior to scoring using four distinct methods: QSAR, Pharmacophore, docking, and 2D shape similarity, which were ultimately integrated into a single consensus score. The fine-tuned machine learning models were ranked using the novel formula “w_new”, consensus scores were calculated, and an enrichment study was performed for each target. Distinctively, consensus scoring outperformed other methods in specific protein targets such as PPARG and DPP4, achieving AUC values of 0.90 and 0.84, respectively. Remarkably, this approach consistently prioritized compounds with higher experimental PIC50 values compared to all other screening methodologies. Moreover, the models demonstrated a range of moderate to high performance in terms of R2 values during external validation. In conclusion, this novel workflow consistently delivered superior results, emphasizing the significance of a holistic approach in drug discovery, where both quantitative metrics and active enrichment play pivotal roles in identifying the best virtual screening methodology.

Scientific contribution

We presented a novel consensus scoring workflow in virtual screening, merging diverse methods for enhanced compound selection. We also introduced ‘w_new’, a groundbreaking metric that intricately refines machine learning model rankings by weighing various model-specific parameters, revolutionizing their efficacy in drug discovery in addition to other domains.

Graphical Abstract

在药物发现过程中,虚拟筛选对于确定潜在的热门化合物至关重要。本研究旨在介绍一种采用机器学习模型的新型筛选方法,该方法综合了各种传统筛选方法。研究人员选择了一系列不同的蛋白质靶点,并对其相应的数据集进行了活性/诱饵分布分析,然后使用四种不同的方法进行评分:在使用四种不同的方法(QSAR、Pharmacophore、Docking 和 2D 形状相似性)进行评分之前,先对其相应的数据集进行活性/诱饵分布分析,最终将其整合为一个单一的共识得分。使用新公式 "w_new "对微调后的机器学习模型进行排序,计算共识得分,并对每个靶点进行富集研究。在 PPARG 和 DPP4 等特定蛋白质靶标方面,共识评分的效果明显优于其他方法,AUC 值分别达到 0.90 和 0.84。值得注意的是,与所有其他筛选方法相比,这种方法始终优先选择实验 PIC50 值更高的化合物。此外,在外部验证过程中,这些模型的 R2 值表现出了中等到高等的性能。总之,这种新颖的工作流程始终能带来卓越的结果,强调了整体方法在药物发现中的重要性,其中定量指标和活性富集在确定最佳虚拟筛选方法中发挥了关键作用。我们还介绍了 "w_new",这是一种开创性的指标,它通过权衡各种特定模型参数来复杂地完善机器学习模型的排名,彻底改变了它们在药物发现和其他领域的功效。
{"title":"Consensus holistic virtual screening for drug discovery: a novel machine learning model approach","authors":"Said Moshawih,&nbsp;Zhen Hui Bu,&nbsp;Hui Poh Goh,&nbsp;Nurolaini Kifli,&nbsp;Lam Hong Lee,&nbsp;Khang Wen Goh,&nbsp;Long Chiau Ming","doi":"10.1186/s13321-024-00855-8","DOIUrl":"10.1186/s13321-024-00855-8","url":null,"abstract":"<div><p>In drug discovery, virtual screening is crucial for identifying potential hit compounds. This study aims to present a novel pipeline that employs machine learning models that amalgamates various conventional screening methods. A diverse array of protein targets was selected, and their corresponding datasets were subjected to active/decoy distribution analysis prior to scoring using four distinct methods: QSAR, Pharmacophore, docking, and 2D shape similarity, which were ultimately integrated into a single consensus score. The fine-tuned machine learning models were ranked using the novel formula “w_new”, consensus scores were calculated, and an enrichment study was performed for each target. Distinctively, consensus scoring outperformed other methods in specific protein targets such as PPARG and DPP4, achieving AUC values of 0.90 and 0.84, respectively. Remarkably, this approach consistently prioritized compounds with higher experimental PIC<sub>50</sub> values compared to all other screening methodologies. Moreover, the models demonstrated a range of moderate to high performance in terms of R<sup>2</sup> values during external validation. In conclusion, this novel workflow consistently delivered superior results, emphasizing the significance of a holistic approach in drug discovery, where both quantitative metrics and active enrichment play pivotal roles in identifying the best virtual screening methodology.</p><p><b>Scientific contribution</b></p><p>We presented a novel consensus scoring workflow in virtual screening, merging diverse methods for enhanced compound selection. We also introduced ‘w_new’, a groundbreaking metric that intricately refines machine learning model rankings by weighing various model-specific parameters, revolutionizing their efficacy in drug discovery in addition to other domains.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00855-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141159955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TransExION: a transformer based explainable similarity metric for comparing IONS in tandem mass spectrometry TransExION:基于转换器的可解释相似性指标,用于比较串联质谱中的离子。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-28 DOI: 10.1186/s13321-024-00858-5
Danh Bui-Thi, Youzhong Liu, Jennifer L. Lippens, Kris Laukens, Thomas De Vijlder

Small molecule identification is a crucial task in analytical chemistry and life sciences. One of the most commonly used technologies to elucidate small molecule structures is mass spectrometry. Spectral library search of product ion spectra (MS/MS) is a popular strategy to identify or find structural analogues. This approach relies on the assumption that spectral similarity and structural similarity are correlated. However, popular spectral similarity measures, usually calculated based on identical fragment matches between the MS/MS spectra, do not always accurately reflect the structural similarity. In this study, we propose TransExION, a Transformer based Explainable similarity metric for IONS. TransExION detects related fragments between MS/MS spectra through their mass difference and uses these to estimate spectral similarity. These related fragments can be nearly identical, but can also share a substructure. TransExION also provides a post-hoc explanation of its estimation, which can be used to support scientists in evaluating the spectral library search results and thus in structure elucidation of unknown molecules. Our model has a Transformer based architecture and it is trained on the data derived from GNPS MS/MS libraries. The experimental results show that it improves existing spectral similarity measures in searching and interpreting structural analogues as well as in molecular networking.

小分子鉴定是分析化学和生命科学领域的一项重要任务。质谱技术是阐明小分子结构最常用的技术之一。产物离子谱(MS/MS)的谱库搜索是识别或寻找结构类似物的常用策略。这种方法依赖于光谱相似性和结构相似性相互关联的假设。然而,流行的光谱相似性度量通常是根据 MS/MS 图谱之间的相同片段匹配度来计算的,并不总能准确反映结构相似性。在本研究中,我们提出了一种基于变换器的 IONS 可解释相似性度量方法--TransExION。TransExION 通过质量差来检测 MS/MS 图谱之间的相关片段,并利用这些片段来估计光谱相似性。这些相关片段可能几乎相同,但也可能共享一个子结构。TransExION 还会对其估算结果进行事后解释,以帮助科学家评估谱库搜索结果,从而对未知分子进行结构阐释。我们的模型采用基于 Transformer 的架构,并在 GNPS MS/MS 图谱库数据的基础上进行了训练。实验结果表明,该模型在搜索和解释结构类似物以及分子网络方面改进了现有的光谱相似性测量方法。科学贡献:我们提出了一种基于变换器的光谱相似性度量方法,它能改进小分子串联质谱的比较。我们提供了一种事后解释,可作为基于数据库光谱的未知光谱注释的良好起点。
{"title":"TransExION: a transformer based explainable similarity metric for comparing IONS in tandem mass spectrometry","authors":"Danh Bui-Thi,&nbsp;Youzhong Liu,&nbsp;Jennifer L. Lippens,&nbsp;Kris Laukens,&nbsp;Thomas De Vijlder","doi":"10.1186/s13321-024-00858-5","DOIUrl":"10.1186/s13321-024-00858-5","url":null,"abstract":"<p>Small molecule identification is a crucial task in analytical chemistry and life sciences. One of the most commonly used technologies to elucidate small molecule structures is mass spectrometry. Spectral library search of product ion spectra (MS/MS) is a popular strategy to identify or find structural analogues. This approach relies on the assumption that spectral similarity and structural similarity are correlated. However, popular spectral similarity measures, usually calculated based on identical fragment matches between the MS/MS spectra, do not always accurately reflect the structural similarity. In this study, we propose TransExION, a Transformer based Explainable similarity metric for IONS. TransExION detects related fragments between MS/MS spectra through their mass difference and uses these to estimate spectral similarity. These related fragments can be nearly identical, but can also share a substructure. TransExION also provides a post-hoc explanation of its estimation, which can be used to support scientists in evaluating the spectral library search results and thus in structure elucidation of unknown molecules. Our model has a Transformer based architecture and it is trained on the data derived from GNPS MS/MS libraries. The experimental results show that it improves existing spectral similarity measures in searching and interpreting structural analogues as well as in molecular networking.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00858-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141159959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Solvent flashcards: a visualisation tool for sustainable chemistry 溶剂卡片:可持续化学的可视化工具。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-28 DOI: 10.1186/s13321-024-00854-9
Joseph Heeley, Samuel Boobier, Jonathan D. Hirst

Selecting greener solvents during experiment design is imperative for greener chemistry. While many solvent selection guides are currently used in the pharmaceutical industry, these are often paper-based guides which can make it difficult to identify and compare specific solvents. This work presents a stand-alone version of the solvent flashcards that were developed as part of the AI4Green electronic laboratory notebook. The functionality is an intuitive and interactive interface for the visualisation of data from CHEM21, a pharmaceutical solvent selection guide that categorises solvents according to “greenness”. This open-source software is written in Python, JavaScript, HTML and CSS and allows users to directly contrast and compare specific solvents by generating colour-coded flashcards. It can be installed locally using pip, or alternatively the source code is available on GitHub: https://github.com/AI4Green/solvent_flashcards. The documentation can also be found on GitHub or on the corresponding Python Package Index webpage: https://pypi.org/project/solvent-guide/.

在实验设计过程中选择更环保的溶剂是绿色化学的当务之急。虽然制药行业目前使用许多溶剂选择指南,但这些指南通常都是纸质的,很难识别和比较特定的溶剂。这项工作展示了作为 AI4Green 电子实验笔记本一部分而开发的溶剂卡片的独立版本。该功能是一个直观的交互式界面,用于将来自 CHEM21 的数据可视化,CHEM21 是一份制药溶剂选择指南,根据 "绿色程度 "对溶剂进行分类。该开源软件由 Python、JavaScript、HTML 和 CSS 编写,允许用户通过生成彩色编码卡片直接对比和比较特定溶剂。该软件可以使用 pip 安装到本地,也可以在 GitHub 上获取源代码:https://github.com/AI4Green/solvent_flashcards 。文档也可以在 GitHub 或相应的 Python 软件包索引网页上找到:https://pypi.org/project/solvent-guide/ 。科学贡献:这款简单易用的数字工具通过新颖直观的界面提供溶剂绿色性数据的可视化,鼓励绿色化学。与传统的溶剂选择指南相比,它具有众多优势,允许用户直接定制溶剂列表,并只对最重要的溶剂进行并排比较。作为独立软件包发布将最大限度地发挥该软件的优势。
{"title":"Solvent flashcards: a visualisation tool for sustainable chemistry","authors":"Joseph Heeley,&nbsp;Samuel Boobier,&nbsp;Jonathan D. Hirst","doi":"10.1186/s13321-024-00854-9","DOIUrl":"10.1186/s13321-024-00854-9","url":null,"abstract":"<p>Selecting greener solvents during experiment design is imperative for greener chemistry. While many solvent selection guides are currently used in the pharmaceutical industry, these are often paper-based guides which can make it difficult to identify and compare specific solvents. This work presents a stand-alone version of the solvent flashcards that were developed as part of the AI4Green electronic laboratory notebook. The functionality is an intuitive and interactive interface for the visualisation of data from CHEM21, a pharmaceutical solvent selection guide that categorises solvents according to “greenness”. This open-source software is written in Python, JavaScript, HTML and CSS and allows users to directly contrast and compare specific solvents by generating colour-coded flashcards. It can be installed locally using pip, or alternatively the source code is available on GitHub: https://github.com/AI4Green/solvent_flashcards. The documentation can also be found on GitHub or on the corresponding Python Package Index webpage: https://pypi.org/project/solvent-guide/.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00854-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141159957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Development of scoring-assisted generative exploration (SAGE) and its application to dual inhibitor design for acetylcholinesterase and monoamine oxidase B 开发评分辅助生成探索(SAGE)及其在乙酰胆碱酯酶和单胺氧化酶 B 双抑制剂设计中的应用。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-24 DOI: 10.1186/s13321-024-00845-w
Hocheol Lim

De novo molecular design is the process of searching chemical space for drug-like molecules with desired properties, and deep learning has been recognized as a promising solution. In this study, I developed an effective computational method called Scoring-Assisted Generative Exploration (SAGE) to enhance chemical diversity and property optimization through virtual synthesis simulation, the generation of bridged bicyclic rings, and multiple scoring models for drug-likeness. In six protein targets, SAGE generated molecules with high scores within reasonable numbers of steps by optimizing target specificity without a constraint and even with multiple constraints such as synthetic accessibility, solubility, and metabolic stability. Furthermore, I suggested a top-ranked molecule with SAGE as dual inhibitors of acetylcholinesterase and monoamine oxidase B through multiple desired property optimization. Therefore, SAGE can generate molecules with desired properties by optimizing multiple properties simultaneously, indicating the importance of de novo design strategies in the future of drug discovery and development.

全新分子设计是在化学空间中寻找具有所需性质的类药物分子的过程,而深度学习已被公认为是一种前景广阔的解决方案。在这项研究中,我开发了一种名为 "评分辅助生成探索(SAGE)"的有效计算方法,通过虚拟合成模拟、桥接双环的生成以及药物相似性的多重评分模型,提高化学多样性和性质优化。在六个蛋白质靶点中,SAGE 通过优化靶点特异性,在不受限制的情况下,甚至在受合成可及性、可溶性和代谢稳定性等多重限制的情况下,在合理的步骤数内生成了高分分子。此外,我还利用 SAGE 提出了一个排名靠前的分子,作为乙酰胆碱酯酶和单胺氧化酶 B 的双重抑制剂。因此,SAGE 可以通过同时优化多种性质生成具有理想性质的分子,这表明从头设计策略在未来药物发现和开发中的重要性。科学贡献:本研究的科学贡献在于开发了评分辅助生成探索(SAGE)方法,这是一种新颖的计算方法,可显著提高从头分子设计能力。SAGE 独特地整合了虚拟合成模拟、复杂桥式双环的生成以及多种评分模型,以全面优化类药物的特性。通过在合理的步骤内高效生成符合广泛药理学标准(包括靶点特异性、合成可及性、溶解性和代谢稳定性)的分子,SAGE 与传统方法相比有了长足的进步。此外,应用 SAGE 发现乙酰胆碱酯酶和单胺氧化酶 B 的双重抑制剂不仅证明了其简化和改进药物开发过程的潜力,还突出了其创造更有效、更精确的靶向疗法的能力。这项研究强调了从头设计策略在重塑未来药物发现和开发中的关键和不断发展的作用,为创新疗法的发现提供了前景广阔的途径。
{"title":"Development of scoring-assisted generative exploration (SAGE) and its application to dual inhibitor design for acetylcholinesterase and monoamine oxidase B","authors":"Hocheol Lim","doi":"10.1186/s13321-024-00845-w","DOIUrl":"10.1186/s13321-024-00845-w","url":null,"abstract":"<p>De novo molecular design is the process of searching chemical space for drug-like molecules with desired properties, and deep learning has been recognized as a promising solution. In this study, I developed an effective computational method called Scoring-Assisted Generative Exploration (SAGE) to enhance chemical diversity and property optimization through virtual synthesis simulation, the generation of bridged bicyclic rings, and multiple scoring models for drug-likeness. In six protein targets, SAGE generated molecules with high scores within reasonable numbers of steps by optimizing target specificity without a constraint and even with multiple constraints such as synthetic accessibility, solubility, and metabolic stability. Furthermore, I suggested a top-ranked molecule with SAGE as dual inhibitors of acetylcholinesterase and monoamine oxidase B through multiple desired property optimization. Therefore, SAGE can generate molecules with desired properties by optimizing multiple properties simultaneously, indicating the importance of de novo design strategies in the future of drug discovery and development.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00845-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141092836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CineMol: a programmatically accessible direct-to-SVG 3D small molecule drawer CineMol:一个可通过程序直接访问的三维小分子抽屉。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-23 DOI: 10.1186/s13321-024-00851-y
David Meijer, Marnix H. Medema, Justin J. J. van der Hooft

Effective visualization of small molecules is paramount in conveying concepts and results in cheminformatics. Scalable vector graphics (SVG) are preferred for creating such visualizations, as SVGs can be easily altered in post-production and exported to other formats. A wide spectrum of software applications already exist that can visualize molecules, and customize these visualizations, in many ways. However, software packages that can output projected 3D models onto a 2D canvas directly as SVG, while being programmatically accessible from Python, are lacking. Here, we introduce CineMol, which can draw vectorized approximations of three-dimensional small molecule models in seconds, without triangulation or ray tracing, resulting in files of around 50–300 kilobytes per molecule model for compounds with up to 45 heavy atoms. The SVGs outputted by CineMol can be readily modified in popular vector graphics editing software applications. CineMol is written in Python and can be incorporated into any existing Python cheminformatics workflow, as it only depends on native Python libraries. CineMol also provides programmatic access to all its internal states, allowing for per-atom and per-bond-based customization. CineMol’s capacity to programmatically create molecular visualizations suitable for post-production offers researchers and scientists a powerful tool for enhancing the clarity and visual impact of their scientific presentations and publications in cheminformatics, metabolomics, and related scientific disciplines.

Scientific contribution

We introduce CineMol, a Python-based tool that provides a valuable solution for cheminformatics researchers by enabling the direct generation of high-quality approximations of two-dimensional SVG visualizations from three-dimensional small molecule models, all within a programmable Python framework. CineMol offers a unique combination of speed, efficiency, and accessibility, making it an indispensable tool for researchers in cheminformatics, especially when working with SVG visualizations.

有效的小分子可视化对于传达化学信息学中的概念和结果至关重要。可缩放矢量图形(SVG)是创建此类可视化效果的首选,因为 SVG 可在后期制作中轻松修改并导出为其他格式。目前已有多种软件应用程序可以将分子可视化,并以多种方式对这些可视化进行定制。然而,能够将投影的三维模型直接以 SVG 的形式输出到二维画布上,同时又能通过 Python 进行编程访问的软件包还很缺乏。在此,我们介绍 CineMol,它可以在数秒内绘制三维小分子模型的矢量化近似值,无需三角测量或光线追踪,对于重原子数不超过 45 个的化合物,每个分子模型的文件大小约为 50-300 千字节。CineMol 输出的 SVG 可在流行的矢量图形编辑软件应用程序中轻松修改。CineMol是用Python语言编写的,由于只依赖于本地Python库,因此可以集成到任何现有的Python化学信息学工作流程中。CineMol还提供了对其所有内部状态的编程访问,允许基于每个原子和每个键进行定制。CineMol 能够以编程方式创建适合后期制作的分子可视化效果,为研究人员和科学家提供了一个强大的工具,使他们在化学信息学、代谢组学和相关科学学科中的科学演示和出版物更加清晰、更具视觉冲击力。 我们介绍的 CineMol 是一款基于 Python 的工具,它为化学信息学研究人员提供了一个有价值的解决方案,能够从三维小分子模型直接生成高质量的近似二维 SVG 可视化效果,所有这一切都在一个可编程的 Python 框架内实现。CineMol 集速度、效率和易用性于一身,是化学信息学研究人员不可或缺的工具,尤其是在处理 SVG 可视化图像时。
{"title":"CineMol: a programmatically accessible direct-to-SVG 3D small molecule drawer","authors":"David Meijer,&nbsp;Marnix H. Medema,&nbsp;Justin J. J. van der Hooft","doi":"10.1186/s13321-024-00851-y","DOIUrl":"10.1186/s13321-024-00851-y","url":null,"abstract":"<div><p>Effective visualization of small molecules is paramount in conveying concepts and results in cheminformatics. Scalable vector graphics (SVG) are preferred for creating such visualizations, as SVGs can be easily altered in post-production and exported to other formats. A wide spectrum of software applications already exist that can visualize molecules, and customize these visualizations, in many ways. However, software packages that can output projected 3D models onto a 2D canvas directly as SVG, while being programmatically accessible from Python, are lacking. Here, we introduce CineMol, which can draw vectorized approximations of three-dimensional small molecule models in seconds, without triangulation or ray tracing, resulting in files of around 50–300 kilobytes per molecule model for compounds with up to 45 heavy atoms. The SVGs outputted by CineMol can be readily modified in popular vector graphics editing software applications. CineMol is written in Python and can be incorporated into any existing Python cheminformatics workflow, as it only depends on native Python libraries. CineMol also provides programmatic access to all its internal states, allowing for per-atom and per-bond-based customization. CineMol’s capacity to programmatically create molecular visualizations suitable for post-production offers researchers and scientists a powerful tool for enhancing the clarity and visual impact of their scientific presentations and publications in cheminformatics, metabolomics, and related scientific disciplines.</p><p><b>Scientific contribution</b></p><p>We introduce CineMol, a Python-based tool that provides a valuable solution for cheminformatics researchers by enabling the direct generation of high-quality approximations of two-dimensional SVG visualizations from three-dimensional small molecule models, all within a programmable Python framework. CineMol offers a unique combination of speed, efficiency, and accessibility, making it an indispensable tool for researchers in cheminformatics, especially when working with SVG visualizations.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00851-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141086377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AiZynthFinder 4.0: developments based on learnings from 3 years of industrial application AiZynthFinder 4.0:基于 3 年工业应用经验的开发。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-23 DOI: 10.1186/s13321-024-00860-x
Lakshidaa Saigiridharan, Alan Kai Hassen, Helen Lai, Paula Torren-Peraire, Ola Engkvist, Samuel Genheden

We present an updated overview of the AiZynthFinder package for retrosynthesis planning. Since the first version was released in 2020, we have added a substantial number of new features based on user feedback. Feature enhancements include policies for filter reactions, support for any one-step retrosynthesis model, a scoring framework and several additional search algorithms. To exemplify the typical use-cases of the software and highlight some learnings, we perform a large-scale analysis on several hundred thousand target molecules from diverse sources. This analysis looks at for instance route shape, stock usage and exploitation of reaction space, and points out strengths and weaknesses of our retrosynthesis approach. The software is released as open-source for educational purposes as well as to provide a reference implementation of the core algorithms for synthesis prediction. We hope that releasing the software as open-source will further facilitate innovation in developing novel methods for synthetic route prediction. AiZynthFinder is a fast, robust and extensible open-source software and can be downloaded from https://github.com/MolecularAI/aizynthfinder.

我们将介绍用于逆合成规划的 AiZynthFinder 软件包的最新概述。自 2020 年发布第一个版本以来,我们根据用户反馈增加了大量新功能。增强的功能包括过滤反应策略、支持任何一步逆合成模型、评分框架和几种额外的搜索算法。为了举例说明软件的典型用例并突出一些学习成果,我们对来自不同来源的几十万个目标分子进行了大规模分析。该分析着眼于路线形状、库存使用和反应空间的利用等方面,并指出了我们的逆合成方法的优缺点。该软件以开源形式发布,用于教育目的,并提供合成预测核心算法的参考实施。我们希望,以开源方式发布该软件将进一步促进合成路线预测新方法的创新开发。AiZynthFinder 是一款快速、强大、可扩展的开源软件,可从 https://github.com/MolecularAI/aizynthfinder 下载。
{"title":"AiZynthFinder 4.0: developments based on learnings from 3 years of industrial application","authors":"Lakshidaa Saigiridharan,&nbsp;Alan Kai Hassen,&nbsp;Helen Lai,&nbsp;Paula Torren-Peraire,&nbsp;Ola Engkvist,&nbsp;Samuel Genheden","doi":"10.1186/s13321-024-00860-x","DOIUrl":"10.1186/s13321-024-00860-x","url":null,"abstract":"<div><p>We present an updated overview of the AiZynthFinder package for retrosynthesis planning. Since the first version was released in 2020, we have added a substantial number of new features based on user feedback. Feature enhancements include policies for filter reactions, support for any one-step retrosynthesis model, a scoring framework and several additional search algorithms. To exemplify the typical use-cases of the software and highlight some learnings, we perform a large-scale analysis on several hundred thousand target molecules from diverse sources. This analysis looks at for instance route shape, stock usage and exploitation of reaction space, and points out strengths and weaknesses of our retrosynthesis approach. The software is released as open-source for educational purposes as well as to provide a reference implementation of the core algorithms for synthesis prediction. We hope that releasing the software as open-source will further facilitate innovation in developing novel methods for synthetic route prediction. AiZynthFinder is a fast, robust and extensible open-source software and can be downloaded from https://github.com/MolecularAI/aizynthfinder.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00860-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141080231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model 利用多模态生化语言模型,从目标蛋白质序列中生成具有所需效力的化合物
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-22 DOI: 10.1186/s13321-024-00852-x
Hengwei Chen, Jürgen Bajorath

Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated “biochemical” language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications.

从自然语言处理中改造而来的深度学习模型,通过对连续分子数据表示的机器翻译,为活性化合物的预测提供了新的机遇。例如,化学语言模型通常用于化合物字符串转换。此外,鉴于语言模型在翻译不同类型文本表征方面的主要通用性,可以探索非主流设计任务。在这项工作中,我们研究了从目标序列嵌入中生成具有所需效力的活性化合物的设计,这是一项颇具挑战性的预测任务。因此,我们设计了一个双组件条件语言模型,用于从多模态数据中学习。它包括一个用于生成目标序列嵌入的蛋白质语言模型组件和一个用于预测具有所需效力的新活性化合物的条件转换器。为此,对指定的 "生化 "语言模型进行了训练,以学习蛋白质序列和化合物效力值嵌入到相应化合物的组合映射,对模型推导过程中未遇到的单个活性类别进行微调,并在结构与训练集不同的化合物测试集上进行评估。生化语言模型正确再现了所有活性类别中具有不同效力的已知化合物,为该方法提供了概念验证。此外,与无条件模型相比,有条件模型始终能再现更多的已知化合物和更强效的化合物,揭示了药效条件的重大影响。生化语言模型还生成了结构多样的候选化合物,与微调化合物和测试化合物都有所不同。总之,基于效力值条件目标序列嵌入的生成式化合物设计取得了可喜的成果,使该方法在进一步探索和实际应用中具有吸引力。
{"title":"Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model","authors":"Hengwei Chen,&nbsp;Jürgen Bajorath","doi":"10.1186/s13321-024-00852-x","DOIUrl":"10.1186/s13321-024-00852-x","url":null,"abstract":"<p>Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated “biochemical” language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00852-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141078891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MolPROP: Molecular Property prediction with multimodal language and graph fusion MolPROP:利用多模态语言和图谱融合进行分子特性预测。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-22 DOI: 10.1186/s13321-024-00846-9
Zachary A. Rollins, Alan C. Cheng, Essam Metwally

Pretrained deep learning models self-supervised on large datasets of language, image, and graph representations are often fine-tuned on downstream tasks and have demonstrated remarkable adaptability in a variety of applications including chatbots, autonomous driving, and protein folding. Additional research aims to improve performance on downstream tasks by fusing high dimensional data representations across multiple modalities. In this work, we explore a novel fusion of a pretrained language model, ChemBERTa-2, with graph neural networks for the task of molecular property prediction. We benchmark the MolPROP suite of models on seven scaffold split MoleculeNet datasets and compare with state-of-the-art architectures. We find that (1) multimodal property prediction for small molecules can match or significantly outperform modern architectures on hydration free energy (FreeSolv), experimental water solubility (ESOL), lipophilicity (Lipo), and clinical toxicity tasks (ClinTox), (2) the MolPROP multimodal fusion is predominantly beneficial on regression tasks, (3) the ChemBERTa-2 masked language model pretraining task (MLM) outperformed multitask regression pretraining task (MTR) when fused with graph neural networks for multimodal property prediction, and (4) despite improvements from multimodal fusion on regression tasks MolPROP significantly underperforms on some classification tasks. MolPROP has been made available at https://github.com/merck/MolPROP.

在语言、图像和图形表示的大型数据集上进行自我监督的预训练深度学习模型通常会在下游任务中进行微调,并在聊天机器人、自动驾驶和蛋白质折叠等各种应用中表现出显著的适应性。其他研究旨在通过融合多种模式的高维数据表示来提高下游任务的性能。在这项工作中,我们探索了将预训练语言模型 ChemBERTa-2 与图神经网络融合用于分子性质预测任务的新方法。我们在七个支架拆分的 MoleculeNet 数据集上对 MolPROP 模型套件进行了基准测试,并与最先进的架构进行了比较。我们发现:(1) 在水合自由能 (FreeSolv)、实验水溶性 (ESOL)、亲油性 (Lipo) 和临床毒性 (ClinTox) 任务上,小分子的多模态性质预测可以与现代体系结构相媲美或明显优于现代体系结构;(2) MolPROP 多模态融合主要有利于回归任务、(3)ChemBERTa-2 蒙蔽语言模型预训练任务(MLM)与图神经网络融合用于多模态性质预测时,其效果优于多任务回归预训练任务(MTR);以及(4)尽管多模态融合在回归任务上有所改进,但 MolPROP 在某些分类任务上的表现明显不佳。MolPROP 可在 https://github.com/merck/MolPROP 上查阅。科学贡献:这项研究探索了一种新颖的多模态融合小分子学习语言和图表征的方法,用于分子性质预测的监督任务。MolPROP 模型套件表明,语言和图形融合在多项回归预测任务中的表现明显优于现代架构,同时也为探索多模态分子性质预测分类任务中的其他融合策略提供了机会。
{"title":"MolPROP: Molecular Property prediction with multimodal language and graph fusion","authors":"Zachary A. Rollins,&nbsp;Alan C. Cheng,&nbsp;Essam Metwally","doi":"10.1186/s13321-024-00846-9","DOIUrl":"10.1186/s13321-024-00846-9","url":null,"abstract":"<p>Pretrained deep learning models self-supervised on large datasets of language, image, and graph representations are often fine-tuned on downstream tasks and have demonstrated remarkable adaptability in a variety of applications including chatbots, autonomous driving, and protein folding. Additional research aims to improve performance on downstream tasks by fusing high dimensional data representations across multiple modalities. In this work, we explore a novel fusion of a pretrained language model, ChemBERTa-2, with graph neural networks for the task of molecular property prediction. We benchmark the MolPROP suite of models on seven scaffold split MoleculeNet datasets and compare with state-of-the-art architectures. We find that (1) multimodal property prediction for small molecules can match or significantly outperform modern architectures on hydration free energy (FreeSolv), experimental water solubility (ESOL), lipophilicity (Lipo), and clinical toxicity tasks (ClinTox), (2) the MolPROP multimodal fusion is predominantly beneficial on regression tasks, (3) the ChemBERTa-2 masked language model pretraining task (MLM) outperformed multitask regression pretraining task (MTR) when fused with graph neural networks for multimodal property prediction, and (4) despite improvements from multimodal fusion on regression tasks MolPROP significantly underperforms on some classification tasks. MolPROP has been made available at https://github.com/merck/MolPROP.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00846-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141080241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1