The solubility in a given organic solvent is a key parameter in the synthesis, analysis and chemical processing of an active pharmaceutical ingredient. In this work, we introduce a new tool for organic solvent recommendation that ranks possible solvent choices requiring only the SMILES representation of the solvents and solute involved. We report on three additional innovations: first, a differential/relative approach to solubility prediction is employed, in which solubility is modeled using pairs of measurements with the same solute but different solvents. We show that a relative framing of solubility as ranking solvents improves over a corresponding absolute solubility model across a diverse set of selected features. Second, a novel semiempirical featurization based on extended tight-binding (xtb) is applied to both the solvent and the solute, thereby providing physically meaningful representations of the problem at hand. Third, we provide an open-source implementation of this practical and convenient tool for organic solvent recommendation. Taken together, this work could be of benefit to those working in diverse areas, such as chemical engineering, material science, or synthesis planning.
{"title":"Solvmate – a hybrid physical/ML approach to solvent recommendation leveraging a rank-based problem framework†","authors":"Jan Wollschläger and Floriane Montanari","doi":"10.1039/D4DD00138A","DOIUrl":"10.1039/D4DD00138A","url":null,"abstract":"<p >The solubility in a given organic solvent is a key parameter in the synthesis, analysis and chemical processing of an active pharmaceutical ingredient. In this work, we introduce a new tool for organic solvent recommendation that ranks possible solvent choices requiring only the SMILES representation of the solvents and solute involved. We report on three additional innovations: first, a differential/relative approach to solubility prediction is employed, in which solubility is modeled using pairs of measurements with the same solute but different solvents. We show that a relative framing of solubility as ranking solvents improves over a corresponding absolute solubility model across a diverse set of selected features. Second, a novel semiempirical featurization based on extended tight-binding (xtb) is applied to both the solvent and the solute, thereby providing physically meaningful representations of the problem at hand. Third, we provide an open-source implementation of this practical and convenient tool for organic solvent recommendation. Taken together, this work could be of benefit to those working in diverse areas, such as chemical engineering, material science, or synthesis planning.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1749-1760"},"PeriodicalIF":6.2,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00138a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kesler Isoko, Joan L. Cordiner, Zoltan Kis and Peyman Z. Moghadam
In the dynamic landscape of industrial evolution, Industry 4.0 (I4.0) presents opportunities to revolutionise products, processes, and production. It is now clear that enabling technologies of this paradigm, such as the industrial internet of things (IIoT), artificial intelligence (AI), and Digital Twins (DTs), have reached an adequate level of technical maturity in the decade that followed the inception of I4.0. These technologies enable more agile, modular, and efficient operations, which are desirable business outcomes for particularly biomanufacturing companies seeking to deliver on a heterogeneous pipeline of treatments and drug product portfolios. Despite the widespread interest in the field, the level of adoption of I4.0 technologies in the biomanufacturing industry is scarce, often reserved to the big pharmaceutical manufacturers that can invest the capital in experimenting with new operating models, even though by now AI and IIoT have been democratised. This shift in approach to digitalisation is hampered by the lack of common standards and know-how describing ways I4.0 technologies should come together. As such, for the first time, this work provides a pragmatic review of the field, key patterns, trends, and potential standard operating models for smart biopharmaceutical manufacturing. This analysis aims to describe how the Quality by Design framework can evolve to become more profitable under I4.0, the recent advancements in digital twin development and how the expansion of the Process Analytical Technology (PAT) toolbox could lead to smart manufacturing. Ultimately, we aim to summarise guiding principles for executing a digital transformation strategy and outline operating models to encourage future adoption of I4.0 technologies in the biopharmaceutical industry.
{"title":"Bioprocessing 4.0: a pragmatic review and future perspectives","authors":"Kesler Isoko, Joan L. Cordiner, Zoltan Kis and Peyman Z. Moghadam","doi":"10.1039/D4DD00127C","DOIUrl":"10.1039/D4DD00127C","url":null,"abstract":"<p >In the dynamic landscape of industrial evolution, Industry 4.0 (I4.0) presents opportunities to revolutionise products, processes, and production. It is now clear that enabling technologies of this paradigm, such as the industrial internet of things (IIoT), artificial intelligence (AI), and Digital Twins (DTs), have reached an adequate level of technical maturity in the decade that followed the inception of I4.0. These technologies enable more agile, modular, and efficient operations, which are desirable business outcomes for particularly biomanufacturing companies seeking to deliver on a heterogeneous pipeline of treatments and drug product portfolios. Despite the widespread interest in the field, the level of adoption of I4.0 technologies in the biomanufacturing industry is scarce, often reserved to the big pharmaceutical manufacturers that can invest the capital in experimenting with new operating models, even though by now AI and IIoT have been democratised. This shift in approach to digitalisation is hampered by the lack of common standards and know-how describing ways I4.0 technologies should come together. As such, for the first time, this work provides a pragmatic review of the field, key patterns, trends, and potential standard operating models for smart biopharmaceutical manufacturing. This analysis aims to describe how the Quality by Design framework can evolve to become more profitable under I4.0, the recent advancements in digital twin development and how the expansion of the Process Analytical Technology (PAT) toolbox could lead to smart manufacturing. Ultimately, we aim to summarise guiding principles for executing a digital transformation strategy and outline operating models to encourage future adoption of I4.0 technologies in the biopharmaceutical industry.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1662-1681"},"PeriodicalIF":6.2,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00127c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gökçe Geylan, Leonardo De Maria, Ola Engkvist, Florian David and Ulf Norinder
Being able to predict the cell permeability of cyclic peptides is essential for unlocking their potential as a drug modality for intracellular targets. With a wide range of studies of cell permeability but a limited number of data points, the reliability of the machine learning (ML) models to predict previously unexplored chemical spaces becomes a challenge. In this work, we systemically investigate the predictive capability of ML models from the perspective of their extrapolation to never-before-seen applicability domains, with a particular focus on the permeability task. Four predictive algorithms, namely Support-Vector Machine, Random Forest, LightGBM and XGBoost, jointly with a conformal prediction framework were employed to characterize and evaluate the applicability through uncertainty quantification. Efficiency and validity of the models' predictions with multiple calibration strategies were assessed with respect to several external datasets from different parts of the chemical space through a set of experiments. The experiments showed that the predictors generalizing well to the applicability domain defined by the training data, can fail to achieve similar model performance on other parts of the chemical spaces. Our study proposes an approach to overcome such limitations by the means of improving the efficiency of models without sacrificing the validity. The trade-off between the reliability and informativeness was balanced when the models were calibrated with a subset of the data from the new targeted domain. This study outlines an approach to enable the extrapolation of predictive power and restore the models' reliability via a recalibration strategy without the need for retraining the underlying model.
要挖掘环肽作为细胞内靶点药物模式的潜力,预测环肽的细胞渗透性至关重要。由于对细胞渗透性的研究范围广泛,但数据点数量有限,因此机器学习(ML)模型预测以前未探索过的化学空间的可靠性就成了一个挑战。在这项工作中,我们从外推法的角度系统地研究了 ML 模型对前所未见的应用领域的预测能力,并特别关注渗透性任务。我们采用了四种预测算法,即支持向量机、随机森林、LightGBM 和 XGBoost,并结合保形预测框架,通过不确定性量化来描述和评估其适用性。通过一系列实验,针对来自化学空间不同部分的多个外部数据集,评估了采用多种校准策略的模型预测的效率和有效性。实验结果表明,对训练数据所定义的适用性领域具有良好普适性的预测器,在化学空间的其他部分可能无法实现类似的模型性能。我们的研究提出了一种在不牺牲有效性的前提下提高模型效率的方法来克服这种局限性。当使用新目标领域的数据子集校准模型时,可靠性和信息量之间的权衡得到了平衡。本研究概述了一种通过重新校准策略实现预测能力外推并恢复模型可靠性的方法,而无需重新训练基础模型。
{"title":"A methodology to correctly assess the applicability domain of cell membrane permeability predictors for cyclic peptides†","authors":"Gökçe Geylan, Leonardo De Maria, Ola Engkvist, Florian David and Ulf Norinder","doi":"10.1039/D4DD00056K","DOIUrl":"10.1039/D4DD00056K","url":null,"abstract":"<p >Being able to predict the cell permeability of cyclic peptides is essential for unlocking their potential as a drug modality for intracellular targets. With a wide range of studies of cell permeability but a limited number of data points, the reliability of the machine learning (ML) models to predict previously unexplored chemical spaces becomes a challenge. In this work, we systemically investigate the predictive capability of ML models from the perspective of their extrapolation to never-before-seen applicability domains, with a particular focus on the permeability task. Four predictive algorithms, namely Support-Vector Machine, Random Forest, LightGBM and XGBoost, jointly with a conformal prediction framework were employed to characterize and evaluate the applicability through uncertainty quantification. Efficiency and validity of the models' predictions with multiple calibration strategies were assessed with respect to several external datasets from different parts of the chemical space through a set of experiments. The experiments showed that the predictors generalizing well to the applicability domain defined by the training data, can fail to achieve similar model performance on other parts of the chemical spaces. Our study proposes an approach to overcome such limitations by the means of improving the efficiency of models without sacrificing the validity. The trade-off between the reliability and informativeness was balanced when the models were calibrated with a subset of the data from the new targeted domain. This study outlines an approach to enable the extrapolation of predictive power and restore the models' reliability <em>via</em> a recalibration strategy without the need for retraining the underlying model.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1761-1775"},"PeriodicalIF":6.2,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00056k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The identification of protein-reactive electrophilic compounds is critical to the design of new covalent modifier drugs, screening for toxic compounds, and the exclusion of reactive compounds from high throughput screening. In this work, we employ traditional and graph machine learning (ML) algorithms to classify molecules being reactive towards proteins or nonreactive. For training data, we built a new dataset, ProteinReactiveDB, composed primarily of covalent and noncovalent inhibitors from the DrugBank, BindingDB, and CovalentInDB databases. To assess the transferability of the trained models, we created a custom set of covalent and noncovalent inhibitors, which was constructed from the recent literature. Baseline models were developed using Morgan fingerprints as training inputs, but they performed poorly when applied to compounds outside the training set. We then trained various Graph Neural Networks (GNNs), with the best GNN model achieving an Area Under the Receiver Operator Characteristic (AUROC) curve of 0.80, precision of 0.89, and recall of 0.72. We also explore the interpretability of these GNNs using Gradient Activation Mapping (GradCAM), which shows regions of the molecules GNNs deem most relevant when making a prediction. These maps indicated that our trained models can identify electrophilic functional groups in a molecule and classify molecules as protein-reactive based on their presence. We demonstrate the use of these models by comparing their performance against common chemical filters, identifying covalent modifiers in the ChEMBL database and generating a putative covalent inhibitor based on an established noncovalent inhibitor.
{"title":"Graph neural networks for identifying protein-reactive compounds†","authors":"Victor Hugo Cano Gil and Christopher N. Rowley","doi":"10.1039/D4DD00038B","DOIUrl":"10.1039/D4DD00038B","url":null,"abstract":"<p >The identification of protein-reactive electrophilic compounds is critical to the design of new covalent modifier drugs, screening for toxic compounds, and the exclusion of reactive compounds from high throughput screening. In this work, we employ traditional and graph machine learning (ML) algorithms to classify molecules being reactive towards proteins or nonreactive. For training data, we built a new dataset, ProteinReactiveDB, composed primarily of covalent and noncovalent inhibitors from the DrugBank, BindingDB, and CovalentInDB databases. To assess the transferability of the trained models, we created a custom set of covalent and noncovalent inhibitors, which was constructed from the recent literature. Baseline models were developed using Morgan fingerprints as training inputs, but they performed poorly when applied to compounds outside the training set. We then trained various Graph Neural Networks (GNNs), with the best GNN model achieving an Area Under the Receiver Operator Characteristic (AUROC) curve of 0.80, precision of 0.89, and recall of 0.72. We also explore the interpretability of these GNNs using Gradient Activation Mapping (GradCAM), which shows regions of the molecules GNNs deem most relevant when making a prediction. These maps indicated that our trained models can identify electrophilic functional groups in a molecule and classify molecules as protein-reactive based on their presence. We demonstrate the use of these models by comparing their performance against common chemical filters, identifying covalent modifiers in the ChEMBL database and generating a putative covalent inhibitor based on an established noncovalent inhibitor.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1776-1792"},"PeriodicalIF":6.2,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00038b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Claudio Avila, Adam West, Anna C. Vicini, William Waddington, Christopher Brearley, James Clarke and Andrew M. Derrick
Across the chemical sciences, synthesis planning is a key aspect for defining synthesis routes, starting from idea generation, combining literature searches and laboratory experimentation, and including scaling-up considerations for large scale manufacturing. This iterative process, which relies heavily on information sharing, is crucial in pharmaceutical development, where drug candidates are transformed into commercially viable Active Pharmaceutical Ingredients (APIs), impacting the access to medicines for billions of people. In this work, we demonstrate that by capturing chemical pathway ideas digitally, at the point of conception, we can systematically merge these ideas with synthetic knowledge derived from predictive algorithms. This serves as a preliminary step for further route evaluation. To achieve this, we introduce a new method for storing, analysing, and displaying chemical information using graph databases and graph representations, illustrated with the commercial synthesis planning of the GLP-1 inhibitor Lotiglipron. Compared to traditional methods, graph databases naturally fit the substrate-arrow-product model traditionally used by chemists, offering a modern alternative to store and access chemical knowledge. This framework facilitates a universal chemistry approach, allowing to share and combine data from many different sources and organisations, and enabling new ways to optimise the complete route selection process.
{"title":"Chemistry in a graph: modern insights into commercial organic synthesis planning†","authors":"Claudio Avila, Adam West, Anna C. Vicini, William Waddington, Christopher Brearley, James Clarke and Andrew M. Derrick","doi":"10.1039/D4DD00120F","DOIUrl":"10.1039/D4DD00120F","url":null,"abstract":"<p >Across the chemical sciences, synthesis planning is a key aspect for defining synthesis routes, starting from idea generation, combining literature searches and laboratory experimentation, and including scaling-up considerations for large scale manufacturing. This iterative process, which relies heavily on information sharing, is crucial in pharmaceutical development, where drug candidates are transformed into commercially viable Active Pharmaceutical Ingredients (APIs), impacting the access to medicines for billions of people. In this work, we demonstrate that by capturing chemical pathway ideas digitally, at the point of conception, we can systematically merge these ideas with synthetic knowledge derived from predictive algorithms. This serves as a preliminary step for further route evaluation. To achieve this, we introduce a new method for storing, analysing, and displaying chemical information using graph databases and graph representations, illustrated with the commercial synthesis planning of the GLP-1 inhibitor Lotiglipron. Compared to traditional methods, graph databases naturally fit the substrate-arrow-product model traditionally used by chemists, offering a modern alternative to store and access chemical knowledge. This framework facilitates a universal chemistry approach, allowing to share and combine data from many different sources and organisations, and enabling new ways to optimise the complete route selection process.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1682-1694"},"PeriodicalIF":6.2,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00120f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
François Cornet, Bardi Benediktsson, Bjarke Hastrup, Mikkel N. Schmidt and Arghya Bhowmik
Organometallic complexes are ubiquitous in numerous technological applications, and in particular in homogeneous catalysis. Optimization of such complexes for specific applications is challenging due to the large variety of possible metal–ligand combinations and ligand–ligand interactions. Here we present OM-Diff, an inverse-design framework based on a diffusion generative model for in silico design of such complexes. Due to the importance of the spatial structure of a catalyst, the model operates on all-atom (including H) representations in 3D space. To handle the symmetries inherent to that data representation, OM-Diff combines an equivariant diffusion model with an equivariant property predictor. The diffusion model generates ligands conditioned on a specified metal-center, while the property predictor guides the generation towards novel complexes with desired properties. We demonstrate the potential of OM-Diff by designing optimized catalysts for a family of cross-coupling reactions, and validating a selection of novel proposed compounds with DFT calculations.
{"title":"OM-Diff: inverse-design of organometallic catalysts with guided equivariant denoising diffusion†","authors":"François Cornet, Bardi Benediktsson, Bjarke Hastrup, Mikkel N. Schmidt and Arghya Bhowmik","doi":"10.1039/D4DD00099D","DOIUrl":"10.1039/D4DD00099D","url":null,"abstract":"<p >Organometallic complexes are ubiquitous in numerous technological applications, and in particular in homogeneous catalysis. Optimization of such complexes for specific applications is challenging due to the large variety of possible metal–ligand combinations and ligand–ligand interactions. Here we present OM-Diff, an inverse-design framework based on a diffusion generative model for <em>in silico</em> design of such complexes. Due to the importance of the spatial structure of a catalyst, the model operates on all-atom (including H) representations in 3D space. To handle the symmetries inherent to that data representation, OM-Diff combines an equivariant diffusion model with an equivariant property predictor. The diffusion model generates ligands conditioned on a specified metal-center, while the property predictor guides the generation towards novel complexes with desired properties. We demonstrate the potential of OM-Diff by designing optimized catalysts for a family of cross-coupling reactions, and validating a selection of novel proposed compounds with DFT calculations.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1793-1811"},"PeriodicalIF":6.2,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00099d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Hödl, Tal Kachman, Yoram Bachrach, Wilhelm T. S. Huck and William E. Robinson
Language models trained on molecular string representations have shown strong performance in predictive and generative tasks. However, practical applications require not only making accurate predictions, but also explainability – the ability to explain the reasons and rationale behind the predictions. In this work, we explore explainability for a chemical language model by adapting a transformer-specific and a model-agnostic input attribution technique. We fine-tune a pretrained model to predict aqueous solubility, compare training and architecture variants, and evaluate visualizations of attributed relevance. The model-agnostic SHAP technique provides sensible attributions, highlighting the positive influence of individual electronegative atoms, but does not explain the model in terms of functional groups or explain how the model represents molecular strings internally to make predictions. In contrast, the adapted transformer-specific explainability technique produces sparse attributions, which cannot be directly attributed to functional groups relevant to solubility. Instead, the attributions are more characteristic of how the model maps molecular strings to its latent space, which seems to represent features relevant to molecular similarity rather than functional groups. These findings provide insight into the representations underpinning chemical language models, which we propose may be leveraged for the design of informative chemical spaces for training more accurate, advanced and explainable models.
{"title":"What can attribution methods show us about chemical language models?†‡","authors":"Stefan Hödl, Tal Kachman, Yoram Bachrach, Wilhelm T. S. Huck and William E. Robinson","doi":"10.1039/D4DD00084F","DOIUrl":"10.1039/D4DD00084F","url":null,"abstract":"<p >Language models trained on molecular string representations have shown strong performance in predictive and generative tasks. However, practical applications require not only making accurate predictions, but also explainability – the ability to explain the reasons and rationale behind the predictions. In this work, we explore explainability for a chemical language model by adapting a transformer-specific and a model-agnostic input attribution technique. We fine-tune a pretrained model to predict aqueous solubility, compare training and architecture variants, and evaluate visualizations of attributed relevance. The model-agnostic SHAP technique provides sensible attributions, highlighting the positive influence of individual electronegative atoms, but does not explain the model in terms of functional groups or explain how the model represents molecular strings internally to make predictions. In contrast, the adapted transformer-specific explainability technique produces sparse attributions, which cannot be directly attributed to functional groups relevant to solubility. Instead, the attributions are more characteristic of how the model maps molecular strings to its latent space, which seems to represent features relevant to molecular similarity rather than functional groups. These findings provide insight into the representations underpinning chemical language models, which we propose may be leveraged for the design of informative chemical spaces for training more accurate, advanced and explainable models.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1738-1748"},"PeriodicalIF":6.2,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00084f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate prediction of diverse chemical properties is crucial for advancing molecular design and materials discovery. Here we present a versatile approach that uses the intermediate information of a universal neural network potential as a general-purpose descriptor for chemical property prediction. Our method is based on the insight that by training a sophisticated neural network architecture for universal force fields, it learns transferable representations of atomic environments. We show that transfer learning with graph neural network potentials such as M3GNet and MACE achieves accuracy comparable to state-of-the-art methods for predicting the NMR chemical shifts by using quantum machine learning as well as a standard classical regression model, despite the compactness of its descriptors. In particular, the MACE descriptor demonstrates the highest accuracy to date on the 13C NMR chemical shift benchmarks for drug molecules. This work provides an efficient way to accurately predict properties, potentially accelerating the discovery of new molecules and materials.
{"title":"Universal neural network potentials as descriptors: towards scalable chemical property prediction using quantum and classical computers","authors":"Tomoya Shiota, Kenji Ishihara and Wataru Mizukami","doi":"10.1039/D4DD00098F","DOIUrl":"10.1039/D4DD00098F","url":null,"abstract":"<p >Accurate prediction of diverse chemical properties is crucial for advancing molecular design and materials discovery. Here we present a versatile approach that uses the intermediate information of a universal neural network potential as a general-purpose descriptor for chemical property prediction. Our method is based on the insight that by training a sophisticated neural network architecture for universal force fields, it learns transferable representations of atomic environments. We show that transfer learning with graph neural network potentials such as M3GNet and MACE achieves accuracy comparable to state-of-the-art methods for predicting the NMR chemical shifts by using quantum machine learning as well as a standard classical regression model, despite the compactness of its descriptors. In particular, the MACE descriptor demonstrates the highest accuracy to date on the <small><sup>13</sup></small>C NMR chemical shift benchmarks for drug molecules. This work provides an efficient way to accurately predict properties, potentially accelerating the discovery of new molecules and materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1714-1728"},"PeriodicalIF":6.2,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00098f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boris N. Slautin, Utkarsh Pratiush, Ilia N. Ivanov, Yongtao Liu, Rohit Pant, Xiaohang Zhang, Ichiro Takeuchi, Maxim A. Ziatdinov and Sergei V. Kalinin
The rapid growth of automated and autonomous instrumentation brings forth opportunities for the co-orchestration of multimodal tools that are equipped with multiple sequential detection methods or several characterization techniques to explore identical samples. This is exemplified by combinatorial libraries that can be explored in multiple locations via multiple tools simultaneously or downstream characterization in automated synthesis systems. In co-orchestration approaches, information gained in one modality should accelerate the discovery of other modalities. Correspondingly, an orchestrating agent should select the measurement modality based on the anticipated knowledge gain and measurement cost. Herein, we propose and implement a co-orchestration approach for conducting measurements with complex observables, such as spectra or images. The method relies on combining dimensionality reduction by variational autoencoders with representation learning for control over the latent space structure and integration into an iterative workflow via multi-task Gaussian Processes (GPs). This approach further allows for the native incorporation of the system's physics via a probabilistic model as a mean function of the GPs. We illustrate this method for different modes of piezoresponse force microscopy and micro-Raman spectroscopy on a combinatorial Sm-BiFeO3 library. However, the proposed framework is general and can be extended to multiple measurement modalities and arbitrary dimensionality of the measured signals.
{"title":"Co-orchestration of multiple instruments to uncover structure–property relationships in combinatorial libraries†","authors":"Boris N. Slautin, Utkarsh Pratiush, Ilia N. Ivanov, Yongtao Liu, Rohit Pant, Xiaohang Zhang, Ichiro Takeuchi, Maxim A. Ziatdinov and Sergei V. Kalinin","doi":"10.1039/D4DD00109E","DOIUrl":"10.1039/D4DD00109E","url":null,"abstract":"<p >The rapid growth of automated and autonomous instrumentation brings forth opportunities for the co-orchestration of multimodal tools that are equipped with multiple sequential detection methods or several characterization techniques to explore identical samples. This is exemplified by combinatorial libraries that can be explored in multiple locations <em>via</em> multiple tools simultaneously or downstream characterization in automated synthesis systems. In co-orchestration approaches, information gained in one modality should accelerate the discovery of other modalities. Correspondingly, an orchestrating agent should select the measurement modality based on the anticipated knowledge gain and measurement cost. Herein, we propose and implement a co-orchestration approach for conducting measurements with complex observables, such as spectra or images. The method relies on combining dimensionality reduction by variational autoencoders with representation learning for control over the latent space structure and integration into an iterative workflow <em>via</em> multi-task Gaussian Processes (GPs). This approach further allows for the native incorporation of the system's physics <em>via</em> a probabilistic model as a mean function of the GPs. We illustrate this method for different modes of piezoresponse force microscopy and micro-Raman spectroscopy on a combinatorial Sm-BiFeO<small><sub>3</sub></small> library. However, the proposed framework is general and can be extended to multiple measurement modalities and arbitrary dimensionality of the measured signals.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 8","pages":" 1602-1611"},"PeriodicalIF":6.2,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00109e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuri Cho, Ruben Laplaza, Sergi Vela and Clémence Corminboeuf
Exploiting crystallographic data repositories for large-scale quantum chemical computations requires the rapid and accurate extraction of the molecular structure, charge and spin from the crystallographic information file. Here, we develop a general approach to assign the ground state spin of transition metal complexes, in complement to our previous efforts on determining metal oxidation states and bond order within the cell2mol software. Starting from a database of 31k transition metal complexes extracted from the Cambridge Structural Database with cell2mol, we construct the TM-GSspin dataset, which contains 2063 mononuclear first row transition metal complexes and their computed ground state spins. TM-GSspin is highly diverse in terms of metals, metal oxidation states, coordination geometries, and coordination sphere compositions. Based on TM-GSspin, we identify correlations between structural and electronic features of the complexes and their ground state spins to develop a rule-based spin state assignment model. Leveraging this knowledge, we construct interpretable descriptors and build a statistical model achieving 98% cross-validated accuracy in predicting the ground state spin across the board. Our approach provides a practical way to determine the ground state spin of transition metal complexes directly from crystal structures without additional computations, thus enabling the automated use of crystallographic data for large-scale computations involving transition metal complexes.
{"title":"Automated prediction of ground state spin for transition metal complexes†","authors":"Yuri Cho, Ruben Laplaza, Sergi Vela and Clémence Corminboeuf","doi":"10.1039/D4DD00093E","DOIUrl":"10.1039/D4DD00093E","url":null,"abstract":"<p >Exploiting crystallographic data repositories for large-scale quantum chemical computations requires the rapid and accurate extraction of the molecular structure, charge and spin from the crystallographic information file. Here, we develop a general approach to assign the ground state spin of transition metal complexes, in complement to our previous efforts on determining metal oxidation states and bond order within the <em>cell2mol</em> software. Starting from a database of 31k transition metal complexes extracted from the Cambridge Structural Database with <em>cell2mol</em>, we construct the TM-GSspin dataset, which contains 2063 mononuclear first row transition metal complexes and their computed ground state spins. TM-GSspin is highly diverse in terms of metals, metal oxidation states, coordination geometries, and coordination sphere compositions. Based on TM-GSspin, we identify correlations between structural and electronic features of the complexes and their ground state spins to develop a rule-based spin state assignment model. Leveraging this knowledge, we construct interpretable descriptors and build a statistical model achieving 98% cross-validated accuracy in predicting the ground state spin across the board. Our approach provides a practical way to determine the ground state spin of transition metal complexes directly from crystal structures without additional computations, thus enabling the automated use of crystallographic data for large-scale computations involving transition metal complexes.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 8","pages":" 1638-1647"},"PeriodicalIF":6.2,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00093e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141613716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}