Digital discovery最新文献_第9页

Tailoring phosphine ligands for improved C–H activation: insights from Δ-machine learning† 定制膦配体以改善 C-H 活化：Δ机器学习的启示

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-28 DOI: 10.1039/D4DD00037D

Tianbai Huang, Robert Geitner, Alexander Croy and Stefanie Gräfe

Transition metal complexes have played crucial roles in various homogeneous catalytic processes due to their exceptional versatility. This adaptability stems not only from the central metal ions but also from the vast array of choices of the ligand spheres, which form an enormously large chemical space. For example, Rh complexes, with a well-designed ligand sphere, are known to be efficient in catalyzing the C–H activation process in alkanes. To investigate the structure–property relation of the Rh complex and identify the optimal ligand that minimizes the calculated reaction energy ΔE of an alkane C–H activation, we have applied a Δ-machine learning method trained on various features to study 1743 pairs of reactants (Rh(PLP)(Cl)(CO)) and intermediates (Rh(PLP)(Cl)(CO)(H)(propyl)). Our findings demonstrate that the models exhibit robust predictive performance when trained on features derived from electron density (R² = 0.816), and SOAPs (R² = 0.819), a set of position-based descriptors. Leveraging the model trained on xTB-SOAPs that only depend on the xTB-equilibrium structures, we propose an efficient and accurate screening procedure to explore the extensive chemical space of bisphosphine ligands. By applying this screening procedure, we identify ten newly selected reactant–intermediate pairs with an average ΔE of 33.2 kJ mol⁻¹, remarkably lower than the average ΔE of the original data set of 68.0 kJ mol⁻¹. This underscores the efficacy of our screening procedure in pinpointing structures with significantly lower energy levels.

过渡金属复合物因其卓越的多功能性，在各种均相催化过程中发挥着至关重要的作用。这种适应性不仅源于中心金属离子，还源于配体球的多种选择，它们构成了一个巨大的化学空间。例如，具有精心设计的配体球的 Rh 复合物在催化烷烃中的 C-H 活化过程中具有很高的效率。为了研究 Rh 配合物的结构-性质关系，并找出能使烷烃 C-H 活化的计算反应能量 ΔE 最小化的最佳配体，我们采用了根据各种特征训练的 Δ 机器学习方法，研究了 1743 对反应物（Rh(PLP)(Cl)(CO)）和中间体（Rh(PLP)(Cl)(CO)(H)(丙基)）。我们的研究结果表明，当根据电子密度（R2 = 0.816）和 SOAPs（R2 = 0.819）（一组基于位置的描述符）得出的特征进行训练时，模型表现出强大的预测性能。利用仅依赖于 xTB 平衡结构的 xTB-SOAPs 训练模型，我们提出了一种高效准确的筛选程序，用于探索双膦配体的广泛化学空间。通过应用这一筛选程序，我们确定了十对新选出的反应物-中间体，其平均ΔE 为 33.2 kJ mol-1，明显低于原始数据集 68.0 kJ mol-1 的平均ΔE。这说明我们的筛选程序在精确定位能级明显较低的结构方面非常有效。

{"title":"Tailoring phosphine ligands for improved C–H activation: insights from Δ-machine learning†","authors":"Tianbai Huang, Robert Geitner, Alexander Croy and Stefanie Gräfe","doi":"10.1039/D4DD00037D","DOIUrl":"10.1039/D4DD00037D","url":null,"abstract":"Transition metal complexes have played crucial roles in various homogeneous catalytic processes due to their exceptional versatility. This adaptability stems not only from the central metal ions but also from the vast array of choices of the ligand spheres, which form an enormously large chemical space. For example, Rh complexes, with a well-designed ligand sphere, are known to be efficient in catalyzing the C–H activation process in alkanes. To investigate the structure–property relation of the Rh complex and identify the optimal ligand that minimizes the calculated reaction energy ΔE of an alkane C–H activation, we have applied a Δ-machine learning method trained on various features to study 1743 pairs of reactants (Rh(PLP)(Cl)(CO)) and intermediates (Rh(PLP)(Cl)(CO)(H)(propyl)). Our findings demonstrate that the models exhibit robust predictive performance when trained on features derived from electron density (R2 = 0.816), and SOAPs (R2 = 0.819), a set of position-based descriptors. Leveraging the model trained on xTB-SOAPs that only depend on the xTB-equilibrium structures, we propose an efficient and accurate screening procedure to explore the extensive chemical space of bisphosphine ligands. By applying this screening procedure, we identify ten newly selected reactant–intermediate pairs with an average ΔE of 33.2 kJ mol−1, remarkably lower than the average ΔE of the original data set of 68.0 kJ mol−1. This underscores the efficacy of our screening procedure in pinpointing structures with significantly lower energy levels.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1350-1364"},"PeriodicalIF":6.2,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00037d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automated approaches, reaction parameterisation, and data science in organometallic chemistry and catalysis: towards improving synthetic chemistry and accelerating mechanistic understanding 有机金属化学和催化中的自动化方法、反应参数化和数据科学：改善合成化学并加速机理理解

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-24 DOI: 10.1039/D3DD00249G

Stuart C. Smith, Christopher S. Horbaczewskyj, Theo F. N. Tanner, Jacob J. Walder and Ian J. S. Fairlamb

Automation technologies and data science techniques have been successfully applied to optimisation and discovery activities in the chemical sciences for decades. As the sophistication of these techniques and technologies have evolved, so too has the ambition to expand their scope of application to problems of significant synthetic difficulty. Of these applications, some of the most challenging involve investigation of chemical mechanism in organometallic processes (with particular emphasis on air- and moisture-sensitive processes), particularly with the reagent and/or catalyst used. We discuss herein the development of enabling methodologies to allow the study of these challenging systems and highlight some important applications of these technologies in problems of considerable interest to applied synthetic chemists.

几十年来，自动化技术和数据科学技术已成功应用于化学科学领域的优化和发现活动。随着这些技术和工艺的不断发展，人们也希望扩大其应用范围，以解决具有重大合成难度的问题。在这些应用中，一些最具挑战性的应用涉及有机金属过程（特别强调对空气和湿气敏感的过程）中化学机制的研究，尤其是所使用的试剂和/或催化剂。我们将在本文中讨论为研究这些具有挑战性的系统而开发的有利方法，并重点介绍这些技术在应用合成化学家相当感兴趣的问题中的一些重要应用。

引用次数: 0

Flexible, model-agnostic method for materials data extraction from text using general purpose language models 使用通用语言模型从文本中提取材料数据的灵活、模型诊断方法

Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-24 DOI: 10.1039/D4DD00016A

Maciej P. Polak, Shrey Modi, Anna Latosinska, Jinming Zhang, Ching-Wen Wang, Shaonan Wang, Ayan Deep Hazra and Dane Morgan

Accurate and comprehensive material databases extracted from research papers are crucial for materials science and engineering, but their development requires significant human effort. With large language models (LLMs) transforming the way humans interact with text, LLMs provide an opportunity to revolutionize data extraction. In this study, we demonstrate a simple and efficient method for extracting materials data from full-text research papers leveraging the capabilities of LLMs combined with human supervision. This approach is particularly suitable for mid-sized databases and requires minimal to no coding or prior knowledge about the extracted property. It offers high recall and nearly perfect precision in the resulting database. The method is easily adaptable to new and superior language models, ensuring continued utility. We show this by evaluating and comparing its performance on GPT-3 and GPT-3.5/4 (which underlie ChatGPT), as well as free alternatives such as BART and DeBERTaV3. We provide a detailed analysis of the method's performance in extracting sentences containing bulk modulus data, achieving up to 90% precision at 96% recall, depending on the amount of human effort involved. We further demonstrate the method's broader effectiveness by developing a database of critical cooling rates for metallic glasses over twice the size of previous human curated databases.

从研究论文中提取准确而全面的材料数据库对材料科学与工程至关重要，但其开发需要大量人力。随着大型语言模型（LLM）改变了人类与文本交互的方式，LLM 为数据提取提供了革命性的机遇。在本研究中，我们展示了一种简单高效的方法，利用 LLM 的能力，结合人工监督，从研究论文全文中提取材料数据。这种方法特别适用于中等规模的数据库，只需极少的编码，甚至不需要关于提取属性的先验知识。它在生成的数据库中提供了高召回率和近乎完美的精确度。这种方法很容易适应新的和更优越的语言模型，从而确保持续的实用性。我们通过评估和比较该方法在 GPT-3 和 GPT-3.5/4（ChatGPT 的基础）以及 BART 和 DeBERTaV3 等免费替代方法上的性能，证明了这一点。我们详细分析了该方法在提取包含大量模态数据的句子时的性能，其精确度高达 90%，召回率为 96%，具体取决于所涉及的人工工作量。通过开发一个金属玻璃临界冷却率数据库，我们进一步证明了该方法在更大范围内的有效性，该数据库的规模是之前人类策划的数据库的两倍。

{"title":"Flexible, model-agnostic method for materials data extraction from text using general purpose language models","authors":"Maciej P. Polak, Shrey Modi, Anna Latosinska, Jinming Zhang, Ching-Wen Wang, Shaonan Wang, Ayan Deep Hazra and Dane Morgan","doi":"10.1039/D4DD00016A","DOIUrl":"10.1039/D4DD00016A","url":null,"abstract":"Accurate and comprehensive material databases extracted from research papers are crucial for materials science and engineering, but their development requires significant human effort. With large language models (LLMs) transforming the way humans interact with text, LLMs provide an opportunity to revolutionize data extraction. In this study, we demonstrate a simple and efficient method for extracting materials data from full-text research papers leveraging the capabilities of LLMs combined with human supervision. This approach is particularly suitable for mid-sized databases and requires minimal to no coding or prior knowledge about the extracted property. It offers high recall and nearly perfect precision in the resulting database. The method is easily adaptable to new and superior language models, ensuring continued utility. We show this by evaluating and comparing its performance on GPT-3 and GPT-3.5/4 (which underlie ChatGPT), as well as free alternatives such as BART and DeBERTaV3. We provide a detailed analysis of the method's performance in extracting sentences containing bulk modulus data, achieving up to 90% precision at 96% recall, depending on the amount of human effort involved. We further demonstrate the method's broader effectiveness by developing a database of critical cooling rates for metallic glasses over twice the size of previous human curated databases.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1221-1235"},"PeriodicalIF":0.0,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00016a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficiently solving the curse of feature-space dimensionality for improved peptide classification 有效解决特征空间维度诅咒，改进多肽分类方法

Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-23 DOI: 10.1039/D4DD00079J

Mario Negovetić, Erik Otović, Daniela Kalafatovic and Goran Mauša

Machine learning is becoming an important tool for predicting peptide function that holds promise for accelerating their discovery. In this paper, we explore feature selection techniques to improve data mining of antimicrobial and catalytic peptides, boost predictive performance and model explainability. SMILES is a widely employed software-readable format for the chemical structures of peptides, and it allows for extraction of numerous molecular descriptors. To reduce the high number of features therein, we conduct a systematic data preprocessing procedure including the widespread wrapper techniques and a computationally better solution provided by the filter technique to build a classification model and make the search for relevant numerical descriptors more efficient without reducing its effectiveness. Comparison of the outcomes of four model implementations in terms of execution time and classification performance together with Shapley-based model explainability method provide valuable insight into the impact of feature selection and suitability of the models with SMILE-derived molecular descriptors. The best results were achieved using the filter method with a ROC-AUC score of 0.954 for catalytic and 0.977 for antimicrobial peptides, with the execution time of feature selection lower by 2 or 3 orders of magnitude. The proposed models were also validated by comparison with established models used for the prediction of antimicrobial and catalytic functions.

机器学习正成为预测多肽功能的重要工具，有望加速多肽的发现。本文探讨了特征选择技术，以改进抗菌肽和催化肽的数据挖掘，提高预测性能和模型的可解释性。SMILES 是一种广泛使用的肽化学结构软件可读格式，可提取大量分子描述符。为了减少其中的大量特征，我们进行了系统的数据预处理，包括广泛使用的包装技术和过滤技术提供的计算性能更好的解决方案，以建立分类模型，并在不降低其有效性的情况下提高搜索相关数字描述符的效率。从执行时间和分类性能以及基于 Shapley 的模型可解释性方法的角度比较了四种模型的实现结果，为了解特征选择的影响和模型与 SMILE 衍生分子描述符的适合性提供了有价值的见解。使用过滤法取得了最佳结果，催化肽的 ROC-AUC 得分为 0.954，抗菌肽的 ROC-AUC 得分为 0.977，特征选择的执行时间降低了 2 或 3 个数量级。通过与用于预测抗菌和催化功能的成熟模型进行比较，也验证了所提出的模型。

{"title":"Efficiently solving the curse of feature-space dimensionality for improved peptide classification","authors":"Mario Negovetić, Erik Otović, Daniela Kalafatovic and Goran Mauša","doi":"10.1039/D4DD00079J","DOIUrl":"10.1039/D4DD00079J","url":null,"abstract":"Machine learning is becoming an important tool for predicting peptide function that holds promise for accelerating their discovery. In this paper, we explore feature selection techniques to improve data mining of antimicrobial and catalytic peptides, boost predictive performance and model explainability. SMILES is a widely employed software-readable format for the chemical structures of peptides, and it allows for extraction of numerous molecular descriptors. To reduce the high number of features therein, we conduct a systematic data preprocessing procedure including the widespread wrapper techniques and a computationally better solution provided by the filter technique to build a classification model and make the search for relevant numerical descriptors more efficient without reducing its effectiveness. Comparison of the outcomes of four model implementations in terms of execution time and classification performance together with Shapley-based model explainability method provide valuable insight into the impact of feature selection and suitability of the models with SMILE-derived molecular descriptors. The best results were achieved using the filter method with a ROC-AUC score of 0.954 for catalytic and 0.977 for antimicrobial peptides, with the execution time of feature selection lower by 2 or 3 orders of magnitude. The proposed models were also validated by comparison with established models used for the prediction of antimicrobial and catalytic functions.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1182-1193"},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00079j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

InterMat: accelerating band offset prediction in semiconductor interfaces with DFT and deep learning† InterMat：利用 DFT 和深度学习加速半导体界面的带偏移预测

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-23 DOI: 10.1039/D4DD00031E

Kamal Choudhary and Kevin F. Garrity

We introduce a computational framework (InterMat) to predict band offsets of semiconductor interfaces using density functional theory (DFT) and graph neural networks (GNN). As a first step, we benchmark OptB88vdW generalized gradient approximation (GGA) work functions and electron affinities for surfaces against experimental data with accuracies of 0.29 eV and 0.39 eV, respectively. Similarly, we evaluate band offset values using independent unit (IU) and alternate slab junction (ASJ) models leading to accuracies of 0.45 eV and 0.22 eV, respectively. We use bulk band structure calculations with the TBmBJ meta-GGA functional to correct for band gap underestimation when predicting conduction band properties. During ASJ structure generation, we use Zur's algorithm along with a unified GNN force-field to tackle the conformation challenges of interface design. At present, we have 607 surface work functions calculated with DFT, from which we can compute 183 921 IU band offsets as well as 593 directly calculated ASJ band offsets. Finally, as the space of all possible heterojunctions is too large to simulate with DFT, we develop generalized GNN models to quickly predict bulk band edges with an accuracy of 0.26 eV. We show how these models can be used to predict relevant quantities including ionization potentials, electron affinities, and IU-based band offsets. We establish simple rules using the above models to pre-screen potential semiconductor devices from a vast pool of nearly 1.4 trillion candidate interfaces. InterMat is available at website: https://github.com/usnistgov/intermat.

我们引入了一个计算框架（InterMat），利用密度泛函理论（DFT）和图神经网络（GNN）预测半导体界面的带偏移。第一步，我们根据实验数据对 OptB88vdW 广义梯度近似 (GGA) 工作函数和表面电子亲和力进行了基准测试，精确度分别为 0.29 eV 和 0.39 eV。同样，我们使用独立单元（IU）和交替板结（ASJ）模型评估了带偏移值，精确度分别为 0.45 eV 和 0.22 eV。在预测导带特性时，我们使用 TBmBJ 元 GGA 函数进行体带结构计算，以纠正带隙低估。在 ASJ 结构生成过程中，我们使用祖尔算法和统一的 GNN 力场来解决界面设计中的构象难题。目前，我们通过 DFT 计算出了 607 个表面功函数，从中可以计算出 183921 个 IU 带偏移和 593 个直接计算的 ASJ 带偏移。最后，由于所有可能的异质结空间太大，无法用 DFT 进行模拟，我们开发了广义 GNN 模型，以 0.26 eV 的精度快速预测体带边缘。我们展示了如何利用这些模型预测电离势、电子亲和力和基于 IU 的带偏移等相关量。我们利用上述模型建立了简单的规则，从将近 1.4 万亿个候选界面的庞大池中预先筛选出潜在的半导体器件。InterMat 可在网站上获取：url{https://github.com/usnistgov/intermat}

{"title":"InterMat: accelerating band offset prediction in semiconductor interfaces with DFT and deep learning†","authors":"Kamal Choudhary and Kevin F. Garrity","doi":"10.1039/D4DD00031E","DOIUrl":"10.1039/D4DD00031E","url":null,"abstract":"We introduce a computational framework (InterMat) to predict band offsets of semiconductor interfaces using density functional theory (DFT) and graph neural networks (GNN). As a first step, we benchmark OptB88vdW generalized gradient approximation (GGA) work functions and electron affinities for surfaces against experimental data with accuracies of 0.29 eV and 0.39 eV, respectively. Similarly, we evaluate band offset values using independent unit (IU) and alternate slab junction (ASJ) models leading to accuracies of 0.45 eV and 0.22 eV, respectively. We use bulk band structure calculations with the TBmBJ meta-GGA functional to correct for band gap underestimation when predicting conduction band properties. During ASJ structure generation, we use Zur's algorithm along with a unified GNN force-field to tackle the conformation challenges of interface design. At present, we have 607 surface work functions calculated with DFT, from which we can compute 183 921 IU band offsets as well as 593 directly calculated ASJ band offsets. Finally, as the space of all possible heterojunctions is too large to simulate with DFT, we develop generalized GNN models to quickly predict bulk band edges with an accuracy of 0.26 eV. We show how these models can be used to predict relevant quantities including ionization potentials, electron affinities, and IU-based band offsets. We establish simple rules using the above models to pre-screen potential semiconductor devices from a vast pool of nearly 1.4 trillion candidate interfaces. InterMat is available at website: https://github.com/usnistgov/intermat.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1365-1377"},"PeriodicalIF":6.2,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00031e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating the reliability and interpretability of machine learning frameworks for chemical retrosynthesis† 研究用于化学逆合成的机器学习框架的可靠性和可解释性

Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-23 DOI: 10.1039/D4DD00007B

Friedrich Hastedt, Rowan M. Bailey, Klaus Hellgardt, Sophia N. Yaliraki, Ehecatl Antonio del Rio Chanona and Dongda Zhang

Machine learning models for chemical retrosynthesis have attracted substantial interest in recent years. Unaddressed challenges, particularly the absence of robust evaluation metrics for performance comparison, and the lack of black-box interpretability, obscure model limitations and impede progress in the field. We present an automated benchmarking pipeline designed for effective model performance comparisons. With an emphasis on user-friendly design, we aim to streamline accessibility and facilitate utilisation within the research community. Additionally, we suggest and perform a new interpretability study to uncover the degree of chemical understanding acquired by retrosynthesis models. Our results reveal that frameworks based on chemical reaction rules yield the most diverse, chemically valid, and feasible reactions, whereas purely data-driven frameworks suffer from unfeasible and invalid predictions. The interpretability study emphasises that incorporating reaction rules not only enhances model performance but also improves interpretability. For simple molecules, we show that Graph Neural Networks identify relevant functional groups in the product molecule, offering model interpretability. Sequence-to-sequence Transformers are not found to provide such an explanation. As the molecule and reaction mechanism grow more complex, both data-driven models propose unfeasible disconnections without offering a chemical rationale. We stress the importance of incorporating chemically meaningful descriptors within deep-learning models. Our study provides valuable guidance for the future development of retrosynthesis frameworks.

近年来，用于化学逆合成的机器学习模型引起了广泛关注。但其中存在的挑战尚未得到解决，特别是缺乏用于性能比较的稳健评估指标，以及缺乏黑盒子可解释性，这些都掩盖了模型的局限性，阻碍了该领域的发展。我们提出了一个自动基准管道，旨在进行有效的模型性能比较。我们将重点放在用户友好型设计上，旨在简化可访问性并促进研究界的使用。此外，我们建议并开展了一项新的可解释性研究，以揭示逆合成模型对化学的理解程度。我们的研究结果表明，基于化学反应规则的框架能产生最多样、化学上最有效和最可行的反应，而纯数据驱动的框架则存在预测不可行和无效的问题。可解释性研究强调，纳入反应规则不仅能提高模型性能，还能改善可解释性。对于简单的分子，我们表明图形神经网络可以识别产品分子中的相关官能团，从而提供模型的可解释性。而序列到序列变换器则无法提供这样的解释。随着分子和反应机理变得越来越复杂，这两种数据驱动模型都提出了不可行的断开，却没有提供化学原理。我们强调在深度学习模型中加入化学意义描述符的重要性。我们的研究为逆合成框架的未来发展提供了宝贵的指导。

{"title":"Investigating the reliability and interpretability of machine learning frameworks for chemical retrosynthesis†","authors":"Friedrich Hastedt, Rowan M. Bailey, Klaus Hellgardt, Sophia N. Yaliraki, Ehecatl Antonio del Rio Chanona and Dongda Zhang","doi":"10.1039/D4DD00007B","DOIUrl":"10.1039/D4DD00007B","url":null,"abstract":"Machine learning models for chemical retrosynthesis have attracted substantial interest in recent years. Unaddressed challenges, particularly the absence of robust evaluation metrics for performance comparison, and the lack of black-box interpretability, obscure model limitations and impede progress in the field. We present an automated benchmarking pipeline designed for effective model performance comparisons. With an emphasis on user-friendly design, we aim to streamline accessibility and facilitate utilisation within the research community. Additionally, we suggest and perform a new interpretability study to uncover the degree of chemical understanding acquired by retrosynthesis models. Our results reveal that frameworks based on chemical reaction rules yield the most diverse, chemically valid, and feasible reactions, whereas purely data-driven frameworks suffer from unfeasible and invalid predictions. The interpretability study emphasises that incorporating reaction rules not only enhances model performance but also improves interpretability. For simple molecules, we show that Graph Neural Networks identify relevant functional groups in the product molecule, offering model interpretability. Sequence-to-sequence Transformers are not found to provide such an explanation. As the molecule and reaction mechanism grow more complex, both data-driven models propose unfeasible disconnections without offering a chemical rationale. We stress the importance of incorporating chemically meaningful descriptors within deep-learning models. Our study provides valuable guidance for the future development of retrosynthesis frameworks.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1194-1212"},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00007b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Chemspyd: an open-source python interface for Chemspeed robotic chemistry and materials platforms† Chemspyd：用于 Chemspeed 机器人化学和材料平台的开源 python 界面

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-22 DOI: 10.1039/D4DD00046C

Martin Seifrid, Felix Strieth-Kalthoff, Mohammad Haddadnia, Tony C. Wu, Emre Alca, Leticia Bodo, Sebastian Arellano-Rubach, Naruki Yoshikawa, Marta Skreta, Rachel Keunen and Alán Aspuru-Guzik

We introduce Chemspyd, a lightweight, open-source Python package for operating the popular laboratory robotic platforms from Chemspeed Technologies. As an add-on to the existing proprietary software suite, Chemspyd enables dynamic communication with the automated platform, laying the foundation for its modular integration into customizable, higher-level laboratory workflows. We show the applicability of Chemspyd in a set of case studies from chemistry and materials science. We demonstrate how the package can be used with large language models to provide a natural language interface. By providing an open-source software interface for a commercial robotic platform, we hope to inspire the development of open interfaces that facilitate the flexible, adaptive integration of existing laboratory equipment into automated laboratories.

我们介绍的 Chemspyd 是一款轻量级开源 Python 软件包，用于操作 Chemspeed Technologies 公司的流行实验室机器人平台。作为现有专有软件套件的附加组件，Chemspyd 实现了与自动化平台的动态通信，为将其模块化集成到可定制的、更高级别的实验室工作流程中奠定了基础。我们在一组化学和材料科学案例研究中展示了 Chemspyd 的适用性。我们展示了该软件包如何与大型语言模型一起使用，以提供自然语言界面。通过为商用机器人平台提供开源软件接口，我们希望能够激励开放接口的开发，从而促进现有实验室设备灵活、自适应地集成到自动化实验室中。

引用次数: 0

PASCAL: the perovskite automated spin coat assembly line accelerates composition screening in triple-halide perovskite alloys† PASCAL：包晶自动旋涂组装线加快了三卤化物包晶合金的成分筛选速度

Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-22 DOI: 10.1039/D4DD00075G

Deniz N. Cakan, Rishi E. Kumar, Eric Oberholtz, Moses Kodur, Jack R. Palmer, Apoorva Gupta, Ken Kaushal, Hendrik M. Vossler and David P. Fenning

The Perovskite Automated Spin Coat Assembly Line – PASCAL – is introduced as a materials acceleration platform for the deposition and characterization of spin-coated thin films, with specific application to halide perovskites. We first demonstrate improved consistency of perovskite film fabrication by controlling process parameters, the influence of which is uniquely exposed under the automated experimental framework. Next, we report on an automated campaign of composition engineering to improve the durability of perovskite absorbers for tandem solar cell applications. We screen compositions spanning the triple-cation, triple-halide composition space, MA_xFA_0.78Cs_0.22−xPb(I_0.8−y−zBr_yCl_z)₃. Data-driven clustering identifies four characteristic behaviors within this space regarding figures of merit for durability and open-circuit voltage, with data from each sample acquired automatically in PASCAL characterization line. Finally, a film composition durable to light and elevated temperature exposure is identified via a regression model trained on the high-throughput dataset. The approach, hardware, and data detailed herein highlight automated platforms as an opportunity to accelerate the identification and discovery of novel thin film materials and demonstrates the efficacy of PASCAL specifically for automation of solution-processed optoelectronic thin film research.

我们介绍了包光体自动旋涂装配线（PASCAL），这是一个用于旋涂薄膜沉积和表征的材料加速平台，特别适用于卤化物包光体。我们首先展示了通过控制工艺参数提高的包光体薄膜制造一致性，这些参数的影响在自动化实验框架下得到了独特的体现。接下来，我们报告了为提高串联太阳能电池应用中包晶石吸收剂的耐久性而进行的自动成分工程活动。我们筛选的成分跨越了三阳离子、三卤化物成分空间，即 MAxFA0.78Cs0.22-xPb(I0.8-y-zBryClz)3。数据驱动聚类确定了该空间中有关耐久性和开路电压的四个特征行为，每个样品的数据都是在 PASCAL 表征线中自动获取的。最后，通过在高通量数据集上训练的回归模型，确定了耐光和耐高温暴露的薄膜成分。本文详述的方法、硬件和数据突出表明，自动化平台是加速鉴定和发现新型薄膜材料的一个机会，并证明了 PASCAL 专门用于自动化溶液处理光电薄膜研究的功效。

{"title":"PASCAL: the perovskite automated spin coat assembly line accelerates composition screening in triple-halide perovskite alloys†","authors":"Deniz N. Cakan, Rishi E. Kumar, Eric Oberholtz, Moses Kodur, Jack R. Palmer, Apoorva Gupta, Ken Kaushal, Hendrik M. Vossler and David P. Fenning","doi":"10.1039/D4DD00075G","DOIUrl":"10.1039/D4DD00075G","url":null,"abstract":"The Perovskite Automated Spin Coat Assembly Line – PASCAL – is introduced as a materials acceleration platform for the deposition and characterization of spin-coated thin films, with specific application to halide perovskites. We first demonstrate improved consistency of perovskite film fabrication by controlling process parameters, the influence of which is uniquely exposed under the automated experimental framework. Next, we report on an automated campaign of composition engineering to improve the durability of perovskite absorbers for tandem solar cell applications. We screen compositions spanning the triple-cation, triple-halide composition space, MAxFA0.78Cs0.22−xPb(I0.8−y−zBryClz)3. Data-driven clustering identifies four characteristic behaviors within this space regarding figures of merit for durability and open-circuit voltage, with data from each sample acquired automatically in PASCAL characterization line. Finally, a film composition durable to light and elevated temperature exposure is identified via a regression model trained on the high-throughput dataset. The approach, hardware, and data detailed herein highlight automated platforms as an opportunity to accelerate the identification and discovery of novel thin film materials and demonstrates the efficacy of PASCAL specifically for automation of solution-processed optoelectronic thin film research.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1236-1246"},"PeriodicalIF":0.0,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00075g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semi-supervised learning of images with strong rotational disorder: assembling nanoparticle libraries† 强旋转无序图像的半监督学习：组装纳米粒子库

Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-21 DOI: 10.1039/D3DD00196B

Maxim A. Ziatdinov, Muammer Yusuf Yaman, Yongtao Liu, David Ginger and Sergei V. Kalinin

The proliferation of optical, electron, and scanning probe microscopies gives rise to large volumes of imaging data of objects as diversified as cells, bacteria, and pollen, to nanoparticles and atoms and molecules. In most cases, the experimental data streams contain images having arbitrary rotations and translations within the image. At the same time, for many cases, small amounts of labeled data are available in the form of prior published results, image collections, and catalogs, or even theoretical models. Here we develop an approach that allows generalizing from a small subset of labeled data with a weak orientational disorder to a large unlabeled dataset with a much stronger orientational (and positional) disorder, i.e., it performs a classification of image data given a small number of examples even in the presence of a distribution shift between the labeled and unlabeled parts. This approach is based on the semi-supervised rotationally invariant variational autoencoder (ss-rVAE) model consisting of the encoder–decoder “block” that learns a rotationally-invariant latent representation of data and a classifier for categorizing data into different discrete classes. The classifier part of the trained ss-rVAE inherits the rotational (and translational) invariances and can be deployed independently of the other parts of the model. The performance of the ss-rVAE is illustrated using the synthetic data sets with known factors of variation. We further demonstrate its application for experimental data sets of nanoparticles, creating nanoparticle libraries and disentangling the representations defining the physical factors of variation in the data.

随着光学显微镜、电子显微镜和扫描探针显微镜的普及，产生了大量的成像数据，这些数据涉及细胞、细菌、花粉、纳米粒子、原子和分子等各种物体。在大多数情况下，实验数据流包含在图像内任意旋转和平移的图像。同时，在许多情况下，少量的标注数据是以先前发表的结果、图像集和目录，甚至理论模型的形式存在的。在此，我们开发了一种方法，可将方向性失调较弱的标注数据子集泛化为方向性（和位置性）失调更强的大型未标注数据集，也就是说，即使在标注和未标注部分之间存在分布偏移的情况下，也能根据少量示例对图像数据进行分类。这种方法基于半监督旋转不变变异自动编码器（ss-rVAE）模型，该模型由编码器-解码器 "块 "和分类器组成，编码器-解码器 "块 "用于学习数据的旋转不变潜表示，分类器用于将数据归类为不同的离散类别。经过训练的 ss-rVAE 的分类器部分继承了旋转（和平移）不变性，可以独立于模型的其他部分进行部署。我们使用已知变异系数的合成数据集说明了 ss-rVAE 的性能。我们还进一步展示了它在纳米粒子实验数据集中的应用，创建了纳米粒子库，并将定义数据中物理变异因子的表征进行了拆分。

{"title":"Semi-supervised learning of images with strong rotational disorder: assembling nanoparticle libraries†","authors":"Maxim A. Ziatdinov, Muammer Yusuf Yaman, Yongtao Liu, David Ginger and Sergei V. Kalinin","doi":"10.1039/D3DD00196B","DOIUrl":"10.1039/D3DD00196B","url":null,"abstract":"The proliferation of optical, electron, and scanning probe microscopies gives rise to large volumes of imaging data of objects as diversified as cells, bacteria, and pollen, to nanoparticles and atoms and molecules. In most cases, the experimental data streams contain images having arbitrary rotations and translations within the image. At the same time, for many cases, small amounts of labeled data are available in the form of prior published results, image collections, and catalogs, or even theoretical models. Here we develop an approach that allows generalizing from a small subset of labeled data with a weak orientational disorder to a large unlabeled dataset with a much stronger orientational (and positional) disorder, i.e., it performs a classification of image data given a small number of examples even in the presence of a distribution shift between the labeled and unlabeled parts. This approach is based on the semi-supervised rotationally invariant variational autoencoder (ss-rVAE) model consisting of the encoder–decoder “block” that learns a rotationally-invariant latent representation of data and a classifier for categorizing data into different discrete classes. The classifier part of the trained ss-rVAE inherits the rotational (and translational) invariances and can be deployed independently of the other parts of the model. The performance of the ss-rVAE is illustrated using the synthetic data sets with known factors of variation. We further demonstrate its application for experimental data sets of nanoparticles, creating nanoparticle libraries and disentangling the representations defining the physical factors of variation in the data.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1213-1220"},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Race to the bottom: Bayesian optimisation for chemical problems† 争分夺秒：化学问题的贝叶斯优化

Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-05-20 DOI: 10.1039/D3DD00234A

Yifan Wu, Aron Walsh and Alex M. Ganose

What is the minimum number of experiments, or calculations, required to find an optimal solution? Relevant chemical problems range from identifying a compound with target functionality within a given phase space to controlling materials synthesis and device fabrication conditions. A common feature in this application domain is that both the dimensionality of the problems and the cost of evaluations are high. The selection of an appropriate optimisation technique is key, with standard choices including iterative (e.g. steepest descent) and heuristic (e.g. simulated annealing) approaches, which are complemented by a new generation of statistical machine learning methods. We introduce Bayesian optimisation and highlight recent success cases in materials research. The challenges of using machine learning with automated research workflows that produce small and noisy data sets are discussed. Finally, we outline opportunities for developments in multi-objective and parallel algorithms for robust and efficient search strategies.

找到最优解所需的最少实验或计算次数是多少？相关化学问题的范围很广，从在给定相空间内确定具有目标功能的化合物，到控制材料合成和设备制造条件。这一应用领域的共同特点是问题的维度和评估成本都很高。选择合适的优化技术是关键，标准选择包括迭代（如最陡坡下降）和启发式（如模拟退火）方法，并辅以新一代统计机器学习方法。在此，我们关注贝叶斯优化的进展。我们重点介绍了最近的成功案例，并讨论了将机器学习与自动化研究工作流程结合使用所面临的挑战，因为自动化研究工作流程产生的数据集较小且噪声较大。最后，我们概述了为实现稳健高效搜索而开发混合算法的机遇。

引用次数: 0