Digital discovery最新文献_第5页

Hybrid-LLM-GNN: integrating large language models and graph neural networks for enhanced materials property prediction†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-18 DOI: 10.1039/D4DD00199K

Youjia Li, Vishu Gupta, Muhammed Nur Talha Kilic, Kamal Choudhary, Daniel Wines, Wei-keng Liao, Alok Choudhary and Ankit Agrawal

Graph-centric learning has attracted significant interest in materials informatics. Accordingly, a family of graph-based machine learning models, primarily utilizing Graph Neural Networks (GNN), has been developed to provide accurate prediction of material properties. In recent years, Large Language Models (LLM) have revolutionized existing scientific workflows that process text representations, thanks to their exceptional ability to utilize extensive common knowledge for understanding semantics. With the help of automated text representation tools, fine-tuned LLMs have demonstrated competitive prediction accuracy as standalone predictors. In this paper, we propose to integrate the insights from GNNs and LLMs to enhance both prediction accuracy and model interpretability. Inspired by the feature-extraction-based transfer learning study for the GNN model, we introduce a novel framework that extracts and combines GNN and LLM embeddings to predict material properties. In this study, we employed ALIGNN as the GNN model and utilized BERT and MatBERT as the LLM model. We evaluated the proposed framework in cross-property scenarios using 7 properties. We find that the combined feature extraction approach using GNN and LLM outperforms the GNN-only approach in the majority of the cases with up to 25% improvement in accuracy. We conducted model explanation analysis through text erasure to interpret the model predictions by examining the contribution of different parts of the text representation.

{"title":"Hybrid-LLM-GNN: integrating large language models and graph neural networks for enhanced materials property prediction†","authors":"Youjia Li, Vishu Gupta, Muhammed Nur Talha Kilic, Kamal Choudhary, Daniel Wines, Wei-keng Liao, Alok Choudhary and Ankit Agrawal","doi":"10.1039/D4DD00199K","DOIUrl":"https://doi.org/10.1039/D4DD00199K","url":null,"abstract":"Graph-centric learning has attracted significant interest in materials informatics. Accordingly, a family of graph-based machine learning models, primarily utilizing Graph Neural Networks (GNN), has been developed to provide accurate prediction of material properties. In recent years, Large Language Models (LLM) have revolutionized existing scientific workflows that process text representations, thanks to their exceptional ability to utilize extensive common knowledge for understanding semantics. With the help of automated text representation tools, fine-tuned LLMs have demonstrated competitive prediction accuracy as standalone predictors. In this paper, we propose to integrate the insights from GNNs and LLMs to enhance both prediction accuracy and model interpretability. Inspired by the feature-extraction-based transfer learning study for the GNN model, we introduce a novel framework that extracts and combines GNN and LLM embeddings to predict material properties. In this study, we employed ALIGNN as the GNN model and utilized BERT and MatBERT as the LLM model. We evaluated the proposed framework in cross-property scenarios using 7 properties. We find that the combined feature extraction approach using GNN and LLM outperforms the GNN-only approach in the majority of the cases with up to 25% improvement in accuracy. We conducted model explanation analysis through text erasure to interpret the model predictions by examining the contribution of different parts of the text representation.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 376-383"},"PeriodicalIF":6.2,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00199k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Schedule optimization for chemical library synthesis† 化学文库合成工艺流程优化。

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-17 DOI: 10.1039/D4DD00327F

Qianxiang Ai, Fanwang Meng, Runzhong Wang, J. Cullen Klein, Alexander G. Godfrey and Connor W. Coley

Automated chemistry platforms hold the potential to enable large-scale organic synthesis campaigns, such as producing a library of compounds for biological evaluation. The efficiency of such platforms will depend on the schedule according to which the synthesis operations are executed. In this work, we study the scheduling problem for chemical library synthesis, where operations from interdependent synthetic routes are scheduled to minimize the makespan—the total duration of the synthesis campaign. We formalize this problem as a flexible job-shop scheduling problem with chemistry-relevant constraints in the form of a mixed integer linear program (MILP), which we then solve in order to design an optimized schedule. The scheduler's ability to produce valid, optimal schedules is demonstrated by 720 simulated scheduling instances for realistically accessible chemical libraries. Reductions in makespan up to 58%, with an average reduction of 20%, are observed compared to the baseline scheduling approach.

自动化化学平台具有实现大规模有机合成活动的潜力，例如生产用于生物评价的化合物库。这种平台的效率将取决于执行合成操作的时间表。在这项工作中，我们研究了化学库合成的调度问题，其中来自相互依赖的合成路线的操作被调度以最小化合成活动的总持续时间。我们将该问题形式化为具有化学相关约束的柔性作业车间调度问题，并以混合整数线性规划（MILP）的形式对其进行求解，从而设计出最优调度。通过720个实际可访问的化学库的模拟调度实例，证明了调度程序产生有效、最优调度的能力。与基线调度方法相比，最大作业时间减少了58%，平均减少了20%。

引用次数: 0

Calibration-free quantification and automated data analysis for high-throughput reaction screening†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-17 DOI: 10.1039/D4DD00347K

Felix Katzenburg, Florian Boser, Felix R. Schäfer, Philipp M. Pflüger and Frank Glorius

The accelerated generation of reaction data through high-throughput experimentation and automation has the potential to boost organic synthesis. However, efforts to generate diverse reaction datasets or identify generally applicable reaction conditions are still hampered by limitations in reaction yield quantification. In this work, we present an automatable screening workflow that facilitates the analysis of reaction arrays with distinct products without relying on the isolation of product references for external calibrations. The workflow is enabled by a flexible liquid handler and parallel GC-MS and GC-Polyarc-FID analysis while we introduce pyGecko, an open-source Python library for processing GC raw data. pyGecko offers comprehensive analysis tools allowing for the determination of reaction outcomes of a 96-reaction array in under a minute. Our workflow's utility is showcased for the scope evaluation of a site-selective thiolation of halogenated heteroarenes and the comparison of four cross-coupling protocols for challenging C–N bond formations.

{"title":"Calibration-free quantification and automated data analysis for high-throughput reaction screening†","authors":"Felix Katzenburg, Florian Boser, Felix R. Schäfer, Philipp M. Pflüger and Frank Glorius","doi":"10.1039/D4DD00347K","DOIUrl":"https://doi.org/10.1039/D4DD00347K","url":null,"abstract":"The accelerated generation of reaction data through high-throughput experimentation and automation has the potential to boost organic synthesis. However, efforts to generate diverse reaction datasets or identify generally applicable reaction conditions are still hampered by limitations in reaction yield quantification. In this work, we present an automatable screening workflow that facilitates the analysis of reaction arrays with distinct products without relying on the isolation of product references for external calibrations. The workflow is enabled by a flexible liquid handler and parallel GC-MS and GC-Polyarc-FID analysis while we introduce pyGecko, an open-source Python library for processing GC raw data. pyGecko offers comprehensive analysis tools allowing for the determination of reaction outcomes of a 96-reaction array in under a minute. Our workflow's utility is showcased for the scope evaluation of a site-selective thiolation of halogenated heteroarenes and the comparison of four cross-coupling protocols for challenging C–N bond formations.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 384-392"},"PeriodicalIF":6.2,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00347k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A hitchhiker's guide to deep chemical language processing for bioactivity prediction† 一本用于生物活性预测的深层化学语言处理的搭便车指南。

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-16 DOI: 10.1039/D4DD00311J

Rıza Özçelik and Francesca Grisoni

Deep learning has significantly accelerated drug discovery, with ‘chemical language’ processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many ‘bells and whistles’. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This ‘hitchhiker's guide’ not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.

深度学习极大地加速了药物的发现，“化学语言”处理（CLP）正在成为一种突出的方法。CLP方法从分子字符串表示（例如，简化分子输入行输入系统[SMILES]和自引用嵌入字符串[selfie]）中学习，方法类似于自然语言处理。尽管它们越来越重要，但训练预测CLP模型远非微不足道，因为它涉及许多“铃铛和口哨”。在这里，我们分析了CLP的关键要素，并为新手和专家提供了指导。我们的研究跨越了三种神经网络架构，两种字符串表示，三种嵌入策略，跨越了十个生物活性数据集，用于分类和回归目的。这本“搭便车指南”不仅强调了某些方法决策的重要性，而且还为研究人员提供了关于理想选择的实用建议，例如，在神经网络架构，分子表示和超参数优化方面。

引用次数: 0

Acquisition of absorption and fluorescence spectral data using chatbots† 利用聊天机器人获取吸收和荧光光谱数据†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-16 DOI: 10.1039/D4DD00255E

Masahiko Taniguchi and Jonathan S. Lindsey

The field of photochemistry underpins broad scientific endeavors, encompasses diverse molecular substances, and incorporates descriptions of qualitative and quantitative properties, all of which together may be representative of many scientific disciplines. Yet finding absorption and fluorescence spectra along with companion values of the molar absorption coefficient (ε) and fluorescence quantum yield (Φ_f) for a given compound is an arduous task even with the most advanced search methods. To gauge whether chatbots could be used to reliably search the literature, the absorption and fluorescence spectra and quantitative parameters (ε and Φ_f) for 16 popular dyes and fluorophores were sought using ChatGPT 3.5, ChatGPT 4o, Microsoft Copilot, Google Gemini, Gemini advanced, and Meta AI. In most cases, the values of ε and Φ_f returned by the chatbots accurately cohered with known values from established resources, whereas the retrieval of spectra was only marginally successful. The chatbots were further challenged to find data for fictive compounds (e.g., rhodamine 7G). The results from each chatbot were categorized as follows: “fabricated” (provides numbers that do not exist in the context queried), “fooled” (mis-identifies the compound but does not return any data), “feigned” (acts as if the fictive compound is real but does not provide any data), or “faithful” (responds that the compound is not known or is not available). In summary, the present shortcomings should not cloud the view that chatbots – judiciously used – already provide a valuable resource for the challenging scientific task of finding granular data, and to lesser degree, spectral traces for known compounds.

光化学领域支撑着广泛的科学努力，包括各种分子物质，并结合了定性和定量性质的描述，所有这些都可以代表许多科学学科。然而，即使使用最先进的搜索方法，寻找给定化合物的吸收光谱和荧光光谱以及摩尔吸收系数（ε）和荧光量子产率（Φf）的伴随值也是一项艰巨的任务。为了衡量聊天机器人是否可以可靠地用于文献检索，我们使用ChatGPT 3.5、ChatGPT 40、Microsoft Copilot、谷歌Gemini、Gemini advanced和Meta AI对16种常用染料和荧光团的吸收光谱和荧光光谱以及定量参数（ε和Φf）进行了搜索。在大多数情况下，聊天机器人返回的ε和Φf值与已有资源的已知值准确一致，而光谱检索仅略微成功。这些聊天机器人还面临着寻找有效化合物（如罗丹明7G）数据的进一步挑战。每个聊天机器人的结果被分类如下：“捏造”（提供在查询的上下文中不存在的数字），“愚弄”（错误识别化合物但不返回任何数据），“假装”（假装虚构的化合物是真实的，但不提供任何数据），或“忠实”（回应化合物未知或不可用）。总之，目前的缺点不应该掩盖这样的观点，即聊天机器人——明智地使用——已经为寻找颗粒数据的挑战性科学任务提供了宝贵的资源，在较小程度上，为已知化合物的光谱痕迹提供了宝贵的资源。

{"title":"Acquisition of absorption and fluorescence spectral data using chatbots†","authors":"Masahiko Taniguchi and Jonathan S. Lindsey","doi":"10.1039/D4DD00255E","DOIUrl":"https://doi.org/10.1039/D4DD00255E","url":null,"abstract":"The field of photochemistry underpins broad scientific endeavors, encompasses diverse molecular substances, and incorporates descriptions of qualitative and quantitative properties, all of which together may be representative of many scientific disciplines. Yet finding absorption and fluorescence spectra along with companion values of the molar absorption coefficient (ε) and fluorescence quantum yield (Φf) for a given compound is an arduous task even with the most advanced search methods. To gauge whether chatbots could be used to reliably search the literature, the absorption and fluorescence spectra and quantitative parameters (ε and Φf) for 16 popular dyes and fluorophores were sought using ChatGPT 3.5, ChatGPT 4o, Microsoft Copilot, Google Gemini, Gemini advanced, and Meta AI. In most cases, the values of ε and Φf returned by the chatbots accurately cohered with known values from established resources, whereas the retrieval of spectra was only marginally successful. The chatbots were further challenged to find data for fictive compounds (e.g., rhodamine 7G). The results from each chatbot were categorized as follows: “fabricated” (provides numbers that do not exist in the context queried), “fooled” (mis-identifies the compound but does not return any data), “feigned” (acts as if the fictive compound is real but does not provide any data), or “faithful” (responds that the compound is not known or is not available). In summary, the present shortcomings should not cloud the view that chatbots – judiciously used – already provide a valuable resource for the challenging scientific task of finding granular data, and to lesser degree, spectral traces for known compounds.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 21-34"},"PeriodicalIF":6.2,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00255e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting mechanical properties of non-equimolar high-entropy carbides using machine learning† 用机器学习预测非等摩尔高熵碳化物的力学性能

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-12 DOI: 10.1039/D4DD00243A

Xi Zhao, Shu-guang Cheng, Sen Yu, Jiming Zheng, Rui-Zhi Zhang and Meng Guo

High-entropy carbides (HECs) have garnered significant attention due to their unique mechanical properties. However, the design of novel HECs has been limited by extensive trial-and-error strategies, along with insufficient knowledge and computational capabilities. In this work, the intrinsic correlations between elements in the high-dimensional compositional space of HECs are investigated using high-throughput density functional theory calculations and two machine learning models, which enable us to predict the Young's modulus, hardness and wear resistance with only a chemical formula provided. Our models demonstrate a low root mean square error (11.5 GPa) and mean absolute error (9.0 GPa) in predicting the elastic modulus of HECs with arbitrary non-equimolar compositions. We further established a database of 566 370 HECs and identified 15 novel HECs with the best mechanical properties. Our models can rapidly explore the mechanical properties of HECs with descriptor–property correlation analysis, and hence provide an efficient method for accelerating the design of non-equimolar high-entropy materials with desired performance.

高熵碳化物（HECs）由于其独特的力学性能而引起了人们的广泛关注。然而，新型hec的设计受到广泛的试错策略以及知识和计算能力不足的限制。在这项工作中，利用高通量密度泛函理论计算和两种机器学习模型研究了HECs高维组成空间中元素之间的内在相关性，这使我们能够仅通过提供化学式来预测杨氏模量，硬度和耐磨性。我们的模型在预测任意非等摩尔成分的HECs弹性模量时具有较低的均方根误差（11.5 GPa）和平均绝对误差（9.0 GPa）。我们进一步建立了566 370个hec的数据库，并鉴定出15个具有最佳力学性能的新型hec。我们的模型可以通过描述-性能相关分析快速探索HECs的力学性能，从而为加速设计具有理想性能的非等摩尔高熵材料提供了一种有效的方法。

{"title":"Predicting mechanical properties of non-equimolar high-entropy carbides using machine learning†","authors":"Xi Zhao, Shu-guang Cheng, Sen Yu, Jiming Zheng, Rui-Zhi Zhang and Meng Guo","doi":"10.1039/D4DD00243A","DOIUrl":"https://doi.org/10.1039/D4DD00243A","url":null,"abstract":"High-entropy carbides (HECs) have garnered significant attention due to their unique mechanical properties. However, the design of novel HECs has been limited by extensive trial-and-error strategies, along with insufficient knowledge and computational capabilities. In this work, the intrinsic correlations between elements in the high-dimensional compositional space of HECs are investigated using high-throughput density functional theory calculations and two machine learning models, which enable us to predict the Young's modulus, hardness and wear resistance with only a chemical formula provided. Our models demonstrate a low root mean square error (11.5 GPa) and mean absolute error (9.0 GPa) in predicting the elastic modulus of HECs with arbitrary non-equimolar compositions. We further established a database of 566 370 HECs and identified 15 novel HECs with the best mechanical properties. Our models can rapidly explore the mechanical properties of HECs with descriptor–property correlation analysis, and hence provide an efficient method for accelerating the design of non-equimolar high-entropy materials with desired performance.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 264-274"},"PeriodicalIF":6.2,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00243a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-09 DOI: 10.1039/D4DD00250D

Matthew D. Witman and Peter Schindler

Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local vs. global property prediction tasks, small vs. large datasets, and structure vs. compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.

{"title":"MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols†","authors":"Matthew D. Witman and Peter Schindler","doi":"10.1039/D4DD00250D","DOIUrl":"https://doi.org/10.1039/D4DD00250D","url":null,"abstract":"Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local vs. global property prediction tasks, small vs. large datasets, and structure vs. compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 625-635"},"PeriodicalIF":6.2,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00250d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

27Al NMR chemical shifts in zeolite MFI via machine learning acceleration of structure sampling and shift prediction†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-09 DOI: 10.1039/D4DD00306C

Daniel Willimetz, Andreas Erlebach, Christopher J. Heard and Lukáš Grajciar

Zeolites, such as MFI, are versatile microporous aluminosilicate materials that are widely used in catalysis and adsorption processes. The location and the character of the aluminium within the zeolite framework is one of the important determinants of performance in industrial applications, and is typically probed by ²⁷Al NMR spectroscopy. However, interpretation of ²⁷Al NMR spectra is challenging, as first-principles computational modelling struggles to achieve the timescales and model complexity needed to provide reliable assignments. In this study, we deploy advanced machine learning-based methods to help bridge the time and model complexity scale by first utilizing neural network interatomic potentials to achieve significant speed-up in structure sampling compared to traditional density functional theory (DFT) approaches, and second by training regression models to cost-effectively predict the ²⁷Al chemical shifts. This allows us, for the H-MFI zeolite as a use case, to comprehensively explore the effect of various conditions relevant to catalysis, including water loading, temperature, and the aluminium concentration, on the ²⁷Al chemical shifts. We demonstrate that both water content and temperature significantly affect the chemical shift and do so in a non-trivial way that is highly T-site dependent, highlighting a need for adoption of realistic, case-specific models. We also observe that our approach is able to achieve close to quantitative agreement with relevant experimental data for such a complex zeolite as MFI, allowing for the tentative assignment of the experimental NMR peaks to specific T-sites. These findings provide a testament to the capabilities of machine learning approaches in providing reliable predictions of important spectroscopic observables for complex industrially relevant materials under realistic conditions.

沸石，如MFI，是一种用途广泛的微孔铝硅酸盐材料，广泛用于催化和吸附过程。沸石框架内铝的位置和性质是工业应用中性能的重要决定因素之一，通常通过27Al核磁共振光谱来探测。然而，27Al核磁共振谱的解释是具有挑战性的，因为第一性原理计算模型难以达到提供可靠分配所需的时间尺度和模型复杂性。在这项研究中，我们采用了先进的基于机器学习的方法，首先利用神经网络原子间势来实现与传统密度泛函理论（DFT）方法相比的结构采样显著加速，然后通过训练回归模型来经济有效地预测27Al化学位移，从而帮助弥合时间和模型复杂性尺度。这使我们能够以H-MFI沸石为例，全面探索与催化相关的各种条件（包括水负载、温度和铝浓度）对27Al化学位移的影响。我们证明，含水量和温度都显著影响化学转移，并以高度依赖于t位点的非平凡方式这样做，强调需要采用现实的，具体的案例模型。我们还观察到，我们的方法能够与MFI等复杂沸石的相关实验数据实现接近定量的一致，从而允许将实验NMR峰暂时分配到特定的t位点。这些发现证明了机器学习方法在现实条件下为复杂工业相关材料提供重要光谱观测值的可靠预测的能力。

{"title":"27Al NMR chemical shifts in zeolite MFI via machine learning acceleration of structure sampling and shift prediction†","authors":"Daniel Willimetz, Andreas Erlebach, Christopher J. Heard and Lukáš Grajciar","doi":"10.1039/D4DD00306C","DOIUrl":"https://doi.org/10.1039/D4DD00306C","url":null,"abstract":"Zeolites, such as MFI, are versatile microporous aluminosilicate materials that are widely used in catalysis and adsorption processes. The location and the character of the aluminium within the zeolite framework is one of the important determinants of performance in industrial applications, and is typically probed by 27Al NMR spectroscopy. However, interpretation of 27Al NMR spectra is challenging, as first-principles computational modelling struggles to achieve the timescales and model complexity needed to provide reliable assignments. In this study, we deploy advanced machine learning-based methods to help bridge the time and model complexity scale by first utilizing neural network interatomic potentials to achieve significant speed-up in structure sampling compared to traditional density functional theory (DFT) approaches, and second by training regression models to cost-effectively predict the 27Al chemical shifts. This allows us, for the H-MFI zeolite as a use case, to comprehensively explore the effect of various conditions relevant to catalysis, including water loading, temperature, and the aluminium concentration, on the 27Al chemical shifts. We demonstrate that both water content and temperature significantly affect the chemical shift and do so in a non-trivial way that is highly T-site dependent, highlighting a need for adoption of realistic, case-specific models. We also observe that our approach is able to achieve close to quantitative agreement with relevant experimental data for such a complex zeolite as MFI, allowing for the tentative assignment of the experimental NMR peaks to specific T-sites. These findings provide a testament to the capabilities of machine learning approaches in providing reliable predictions of important spectroscopic observables for complex industrially relevant materials under realistic conditions.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 275-288"},"PeriodicalIF":6.2,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00306c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comprehensive sampling of coverage effects in catalysis by leveraging generalization in neural network models† 利用神经网络模型的泛化对催化过程中覆盖效应进行综合采样

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-09 DOI: 10.1039/D4DD00328D

Daniel Schwalbe-Koda, Nitish Govindarajan and Joel B. Varley

Sampling high-coverage configurations and predicting adsorbate–adsorbate interactions on surfaces are highly relevant to understand realistic interfaces in heterogeneous catalysis. However, the combinatorial explosion in the number of adsorbate configurations among diverse site environments presents a considerable challenge in accurately estimating these interactions. Here, we propose a strategy combining high-throughput simulation pipelines and a neural network-based model with the MACE architecture to increase sampling efficiency and speed. By training the models on unrelaxed structures and energies, which can be quickly obtained from single-point DFT calculations, we achieve excellent performance for both in-domain and out-of-domain predictions, including generalization to different facets, coverage regimes and low-energy configurations. From this systematic understanding of model robustness, we exhaustively sample the configuration phase space of catalytic systems without active learning. In particular, by predicting binding energies for over 14 million structures within the neural network model and the simulated annealing method, we predict coverage-dependent adsorption energies for CO adsorption on six Cu facets (111, 100, 211, 331, 410 and 711) and the co-adsorption of CO and CHOH on Rh(111). When validated by targeted post-sampling relaxations, our results for CO on Cu correctly reproduce experimental interaction energies reported in the literature, and provide atomistic insights on the site occupancy of steps and terraces for the six facets at all coverage regimes. Additionally, the arrangement of CO on the Rh(111) surface is demonstrated to substantially impact the activation barriers for the CHOH bond scission, illustrating the importance of comprehensive sampling on reaction kinetics. Our findings demonstrate that simplified data generation routines and evaluating generalization of neural networks can be deployed at scale to understand lateral interactions on surfaces, paving the way towards realistic modeling of heterogeneous catalytic processes.

采样高覆盖结构和预测表面上吸附物-吸附物的相互作用对于理解多相催化中的实际界面非常重要。然而，不同场所环境中吸附质构型数量的组合爆炸对准确估计这些相互作用提出了相当大的挑战。在此，我们提出了一种将高通量仿真管道和基于神经网络的模型与MACE架构相结合的策略，以提高采样效率和速度。通过在非松弛结构和能量上训练模型，可以从单点DFT计算中快速获得，我们在域内和域外预测方面都取得了出色的性能，包括对不同方面，覆盖范围和低能量配置的推广。从对模型鲁棒性的系统理解中，我们在没有主动学习的情况下对催化系统的组态相空间进行了详尽的采样。特别是，通过在神经网络模型和模拟退火方法中预测超过1400万个结构的结合能，我们预测了CO在6个Cu面（111、100、211、3331、410和711）上的吸附能，以及CO和CHOH在Rh上的共吸附能（111）。当经过有针对性的采样后松弛验证时，我们的结果正确地再现了文献中报道的CO对Cu的实验相互作用能，并提供了在所有覆盖制度下六个方面的台阶和梯田的位置占用的原子见解。此外，CO在Rh（111）表面的排列被证明对CHOH键断裂的激活屏障有很大的影响，说明了综合采样对反应动力学的重要性。我们的研究结果表明，简化的数据生成程序和评估神经网络的泛化可以大规模部署，以了解表面上的横向相互作用，为多相催化过程的现实建模铺平道路。

{"title":"Comprehensive sampling of coverage effects in catalysis by leveraging generalization in neural network models†","authors":"Daniel Schwalbe-Koda, Nitish Govindarajan and Joel B. Varley","doi":"10.1039/D4DD00328D","DOIUrl":"https://doi.org/10.1039/D4DD00328D","url":null,"abstract":"Sampling high-coverage configurations and predicting adsorbate–adsorbate interactions on surfaces are highly relevant to understand realistic interfaces in heterogeneous catalysis. However, the combinatorial explosion in the number of adsorbate configurations among diverse site environments presents a considerable challenge in accurately estimating these interactions. Here, we propose a strategy combining high-throughput simulation pipelines and a neural network-based model with the MACE architecture to increase sampling efficiency and speed. By training the models on unrelaxed structures and energies, which can be quickly obtained from single-point DFT calculations, we achieve excellent performance for both in-domain and out-of-domain predictions, including generalization to different facets, coverage regimes and low-energy configurations. From this systematic understanding of model robustness, we exhaustively sample the configuration phase space of catalytic systems without active learning. In particular, by predicting binding energies for over 14 million structures within the neural network model and the simulated annealing method, we predict coverage-dependent adsorption energies for CO adsorption on six Cu facets (111, 100, 211, 331, 410 and 711) and the co-adsorption of CO and CHOH on Rh(111). When validated by targeted post-sampling relaxations, our results for CO on Cu correctly reproduce experimental interaction energies reported in the literature, and provide atomistic insights on the site occupancy of steps and terraces for the six facets at all coverage regimes. Additionally, the arrangement of CO on the Rh(111) surface is demonstrated to substantially impact the activation barriers for the CHOH bond scission, illustrating the importance of comprehensive sampling on reaction kinetics. Our findings demonstrate that simplified data generation routines and evaluating generalization of neural networks can be deployed at scale to understand lateral interactions on surfaces, paving the way towards realistic modeling of heterogeneous catalytic processes.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 234-251"},"PeriodicalIF":6.2,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00328d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scientific exploration with expert knowledge (SEEK) in autonomous scanning probe microscopy with active learning† 科学探索与专家知识（SEEK）在自主扫描探针显微镜与主动学习†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-04 DOI: 10.1039/D4DD00277F

Utkarsh Pratiush, Hiroshi Funakubo, Rama Vasudevan, Sergei V. Kalinin and Yongtao Liu

Microscopy plays a foundational role in materials science, biology, and nanotechnology, offering high-resolution imaging and detailed insights into properties at the nanoscale and atomic level. Microscopy automation via active machine learning approaches is a transformative advancement, offering increased efficiency, reproducibility, and the capability to perform complex experiments. Our previous work on autonomous experimentation with scanning probe microscopy (SPM) demonstrated an active learning framework using deep kernel learning (DKL) for structure–property relationship discovery. Here we extend this approach to a multi-stage decision process to incorporate prior knowledge and human interest into DKL-based workflows, we operationalize these workflows in SPM. By integrating expected rewards from structure libraries or spectroscopic features, we enhanced the exploration efficiency of autonomous microscopy, demonstrating more efficient and targeted exploration in autonomous microscopy. These methods can be seamlessly applied to other microscopy and imaging techniques. Furthermore, the concept can be adapted for general Bayesian optimization in material discovery across a broad range of autonomous experimental fields.

显微镜在材料科学，生物学和纳米技术中起着基础作用，提供高分辨率成像和纳米级和原子级特性的详细见解。通过主动机器学习方法实现显微镜自动化是一个变革性的进步，提供了更高的效率、可重复性和执行复杂实验的能力。我们之前在扫描探针显微镜（SPM）的自主实验中展示了一个使用深度核学习（DKL）进行结构-性质关系发现的主动学习框架。在这里，我们将这种方法扩展到一个多阶段的决策过程，将先验知识和人类兴趣结合到基于dcl的工作流中，我们在SPM中操作这些工作流。通过整合结构库或光谱特征的预期回报，我们提高了自治显微镜的探测效率，展示了自治显微镜更有效和有针对性的探测。这些方法可以无缝地应用于其他显微镜和成像技术。此外，该概念可以适用于广泛的自主实验领域中材料发现的一般贝叶斯优化。

{"title":"Scientific exploration with expert knowledge (SEEK) in autonomous scanning probe microscopy with active learning†","authors":"Utkarsh Pratiush, Hiroshi Funakubo, Rama Vasudevan, Sergei V. Kalinin and Yongtao Liu","doi":"10.1039/D4DD00277F","DOIUrl":"https://doi.org/10.1039/D4DD00277F","url":null,"abstract":"Microscopy plays a foundational role in materials science, biology, and nanotechnology, offering high-resolution imaging and detailed insights into properties at the nanoscale and atomic level. Microscopy automation via active machine learning approaches is a transformative advancement, offering increased efficiency, reproducibility, and the capability to perform complex experiments. Our previous work on autonomous experimentation with scanning probe microscopy (SPM) demonstrated an active learning framework using deep kernel learning (DKL) for structure–property relationship discovery. Here we extend this approach to a multi-stage decision process to incorporate prior knowledge and human interest into DKL-based workflows, we operationalize these workflows in SPM. By integrating expected rewards from structure libraries or spectroscopic features, we enhanced the exploration efficiency of autonomous microscopy, demonstrating more efficient and targeted exploration in autonomous microscopy. These methods can be seamlessly applied to other microscopy and imaging techniques. Furthermore, the concept can be adapted for general Bayesian optimization in material discovery across a broad range of autonomous experimental fields.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 252-263"},"PeriodicalIF":6.2,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0