Youjia Li, Vishu Gupta, Muhammed Nur Talha Kilic, Kamal Choudhary, Daniel Wines, Wei-keng Liao, Alok Choudhary and Ankit Agrawal
Graph-centric learning has attracted significant interest in materials informatics. Accordingly, a family of graph-based machine learning models, primarily utilizing Graph Neural Networks (GNN), has been developed to provide accurate prediction of material properties. In recent years, Large Language Models (LLM) have revolutionized existing scientific workflows that process text representations, thanks to their exceptional ability to utilize extensive common knowledge for understanding semantics. With the help of automated text representation tools, fine-tuned LLMs have demonstrated competitive prediction accuracy as standalone predictors. In this paper, we propose to integrate the insights from GNNs and LLMs to enhance both prediction accuracy and model interpretability. Inspired by the feature-extraction-based transfer learning study for the GNN model, we introduce a novel framework that extracts and combines GNN and LLM embeddings to predict material properties. In this study, we employed ALIGNN as the GNN model and utilized BERT and MatBERT as the LLM model. We evaluated the proposed framework in cross-property scenarios using 7 properties. We find that the combined feature extraction approach using GNN and LLM outperforms the GNN-only approach in the majority of the cases with up to 25% improvement in accuracy. We conducted model explanation analysis through text erasure to interpret the model predictions by examining the contribution of different parts of the text representation.
{"title":"Hybrid-LLM-GNN: integrating large language models and graph neural networks for enhanced materials property prediction†","authors":"Youjia Li, Vishu Gupta, Muhammed Nur Talha Kilic, Kamal Choudhary, Daniel Wines, Wei-keng Liao, Alok Choudhary and Ankit Agrawal","doi":"10.1039/D4DD00199K","DOIUrl":"https://doi.org/10.1039/D4DD00199K","url":null,"abstract":"<p >Graph-centric learning has attracted significant interest in materials informatics. Accordingly, a family of graph-based machine learning models, primarily utilizing Graph Neural Networks (GNN), has been developed to provide accurate prediction of material properties. In recent years, Large Language Models (LLM) have revolutionized existing scientific workflows that process text representations, thanks to their exceptional ability to utilize extensive common knowledge for understanding semantics. With the help of automated text representation tools, fine-tuned LLMs have demonstrated competitive prediction accuracy as standalone predictors. In this paper, we propose to integrate the insights from GNNs and LLMs to enhance both prediction accuracy and model interpretability. Inspired by the feature-extraction-based transfer learning study for the GNN model, we introduce a novel framework that extracts and combines GNN and LLM embeddings to predict material properties. In this study, we employed ALIGNN as the GNN model and utilized BERT and MatBERT as the LLM model. We evaluated the proposed framework in cross-property scenarios using 7 properties. We find that the combined feature extraction approach using GNN and LLM outperforms the GNN-only approach in the majority of the cases with up to 25% improvement in accuracy. We conducted model explanation analysis through text erasure to interpret the model predictions by examining the contribution of different parts of the text representation.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 376-383"},"PeriodicalIF":6.2,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00199k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qianxiang Ai, Fanwang Meng, Runzhong Wang, J. Cullen Klein, Alexander G. Godfrey and Connor W. Coley
Automated chemistry platforms hold the potential to enable large-scale organic synthesis campaigns, such as producing a library of compounds for biological evaluation. The efficiency of such platforms will depend on the schedule according to which the synthesis operations are executed. In this work, we study the scheduling problem for chemical library synthesis, where operations from interdependent synthetic routes are scheduled to minimize the makespan—the total duration of the synthesis campaign. We formalize this problem as a flexible job-shop scheduling problem with chemistry-relevant constraints in the form of a mixed integer linear program (MILP), which we then solve in order to design an optimized schedule. The scheduler's ability to produce valid, optimal schedules is demonstrated by 720 simulated scheduling instances for realistically accessible chemical libraries. Reductions in makespan up to 58%, with an average reduction of 20%, are observed compared to the baseline scheduling approach.
{"title":"Schedule optimization for chemical library synthesis†","authors":"Qianxiang Ai, Fanwang Meng, Runzhong Wang, J. Cullen Klein, Alexander G. Godfrey and Connor W. Coley","doi":"10.1039/D4DD00327F","DOIUrl":"10.1039/D4DD00327F","url":null,"abstract":"<p >Automated chemistry platforms hold the potential to enable large-scale organic synthesis campaigns, such as producing a library of compounds for biological evaluation. The efficiency of such platforms will depend on the schedule according to which the synthesis operations are executed. In this work, we study the scheduling problem for chemical library synthesis, where operations from interdependent synthetic routes are scheduled to minimize the makespan—the total duration of the synthesis campaign. We formalize this problem as a flexible job-shop scheduling problem with chemistry-relevant constraints in the form of a mixed integer linear program (MILP), which we then solve in order to design an optimized schedule. The scheduler's ability to produce valid, optimal schedules is demonstrated by 720 simulated scheduling instances for realistically accessible chemical libraries. Reductions in makespan up to 58%, with an average reduction of 20%, are observed compared to the baseline scheduling approach.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 486-499"},"PeriodicalIF":6.2,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11740188/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143017412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Felix Katzenburg, Florian Boser, Felix R. Schäfer, Philipp M. Pflüger and Frank Glorius
The accelerated generation of reaction data through high-throughput experimentation and automation has the potential to boost organic synthesis. However, efforts to generate diverse reaction datasets or identify generally applicable reaction conditions are still hampered by limitations in reaction yield quantification. In this work, we present an automatable screening workflow that facilitates the analysis of reaction arrays with distinct products without relying on the isolation of product references for external calibrations. The workflow is enabled by a flexible liquid handler and parallel GC-MS and GC-Polyarc-FID analysis while we introduce pyGecko, an open-source Python library for processing GC raw data. pyGecko offers comprehensive analysis tools allowing for the determination of reaction outcomes of a 96-reaction array in under a minute. Our workflow's utility is showcased for the scope evaluation of a site-selective thiolation of halogenated heteroarenes and the comparison of four cross-coupling protocols for challenging C–N bond formations.
{"title":"Calibration-free quantification and automated data analysis for high-throughput reaction screening†","authors":"Felix Katzenburg, Florian Boser, Felix R. Schäfer, Philipp M. Pflüger and Frank Glorius","doi":"10.1039/D4DD00347K","DOIUrl":"https://doi.org/10.1039/D4DD00347K","url":null,"abstract":"<p >The accelerated generation of reaction data through high-throughput experimentation and automation has the potential to boost organic synthesis. However, efforts to generate diverse reaction datasets or identify generally applicable reaction conditions are still hampered by limitations in reaction yield quantification. In this work, we present an automatable screening workflow that facilitates the analysis of reaction arrays with distinct products without relying on the isolation of product references for external calibrations. The workflow is enabled by a flexible liquid handler and parallel GC-MS and GC-Polyarc-FID analysis while we introduce pyGecko, an open-source Python library for processing GC raw data. pyGecko offers comprehensive analysis tools allowing for the determination of reaction outcomes of a 96-reaction array in under a minute. Our workflow's utility is showcased for the scope evaluation of a site-selective thiolation of halogenated heteroarenes and the comparison of four cross-coupling protocols for challenging C–N bond formations.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 384-392"},"PeriodicalIF":6.2,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00347k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning has significantly accelerated drug discovery, with ‘chemical language’ processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many ‘bells and whistles’. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This ‘hitchhiker's guide’ not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.
{"title":"A hitchhiker's guide to deep chemical language processing for bioactivity prediction†","authors":"Rıza Özçelik and Francesca Grisoni","doi":"10.1039/D4DD00311J","DOIUrl":"10.1039/D4DD00311J","url":null,"abstract":"<p >Deep learning has significantly accelerated drug discovery, with ‘chemical language’ processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (<em>e.g.</em>, Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many ‘bells and whistles’. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This ‘hitchhiker's guide’ not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, <em>e.g.</em>, in terms of neural network architectures, molecular representations, and hyperparameter optimization.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 316-325"},"PeriodicalIF":6.2,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11667676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142900860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The field of photochemistry underpins broad scientific endeavors, encompasses diverse molecular substances, and incorporates descriptions of qualitative and quantitative properties, all of which together may be representative of many scientific disciplines. Yet finding absorption and fluorescence spectra along with companion values of the molar absorption coefficient (ε) and fluorescence quantum yield (Φf) for a given compound is an arduous task even with the most advanced search methods. To gauge whether chatbots could be used to reliably search the literature, the absorption and fluorescence spectra and quantitative parameters (ε and Φf) for 16 popular dyes and fluorophores were sought using ChatGPT 3.5, ChatGPT 4o, Microsoft Copilot, Google Gemini, Gemini advanced, and Meta AI. In most cases, the values of ε and Φf returned by the chatbots accurately cohered with known values from established resources, whereas the retrieval of spectra was only marginally successful. The chatbots were further challenged to find data for fictive compounds (e.g., rhodamine 7G). The results from each chatbot were categorized as follows: “fabricated” (provides numbers that do not exist in the context queried), “fooled” (mis-identifies the compound but does not return any data), “feigned” (acts as if the fictive compound is real but does not provide any data), or “faithful” (responds that the compound is not known or is not available). In summary, the present shortcomings should not cloud the view that chatbots – judiciously used – already provide a valuable resource for the challenging scientific task of finding granular data, and to lesser degree, spectral traces for known compounds.
{"title":"Acquisition of absorption and fluorescence spectral data using chatbots†","authors":"Masahiko Taniguchi and Jonathan S. Lindsey","doi":"10.1039/D4DD00255E","DOIUrl":"https://doi.org/10.1039/D4DD00255E","url":null,"abstract":"<p >The field of photochemistry underpins broad scientific endeavors, encompasses diverse molecular substances, and incorporates descriptions of qualitative and quantitative properties, all of which together may be representative of many scientific disciplines. Yet finding absorption and fluorescence spectra along with companion values of the molar absorption coefficient (<em>ε</em>) and fluorescence quantum yield (<em>Φ</em><small><sub>f</sub></small>) for a given compound is an arduous task even with the most advanced search methods. To gauge whether chatbots could be used to reliably search the literature, the absorption and fluorescence spectra and quantitative parameters (<em>ε</em> and <em>Φ</em><small><sub>f</sub></small>) for 16 popular dyes and fluorophores were sought using ChatGPT 3.5, ChatGPT 4o, Microsoft Copilot, Google Gemini, Gemini advanced, and Meta AI. In most cases, the values of <em>ε</em> and <em>Φ</em><small><sub>f</sub></small> returned by the chatbots accurately cohered with known values from established resources, whereas the retrieval of spectra was only marginally successful. The chatbots were further challenged to find data for fictive compounds (<em>e.g.</em>, rhodamine 7G). The results from each chatbot were categorized as follows: “fabricated” (provides numbers that do not exist in the context queried), “fooled” (mis-identifies the compound but does not return any data), “feigned” (acts as if the fictive compound is real but does not provide any data), or “faithful” (responds that the compound is not known or is not available). In summary, the present shortcomings should not cloud the view that chatbots – judiciously used – already provide a valuable resource for the challenging scientific task of finding granular data, and to lesser degree, spectral traces for known compounds.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 21-34"},"PeriodicalIF":6.2,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00255e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Zhao, Shu-guang Cheng, Sen Yu, Jiming Zheng, Rui-Zhi Zhang and Meng Guo
High-entropy carbides (HECs) have garnered significant attention due to their unique mechanical properties. However, the design of novel HECs has been limited by extensive trial-and-error strategies, along with insufficient knowledge and computational capabilities. In this work, the intrinsic correlations between elements in the high-dimensional compositional space of HECs are investigated using high-throughput density functional theory calculations and two machine learning models, which enable us to predict the Young's modulus, hardness and wear resistance with only a chemical formula provided. Our models demonstrate a low root mean square error (11.5 GPa) and mean absolute error (9.0 GPa) in predicting the elastic modulus of HECs with arbitrary non-equimolar compositions. We further established a database of 566 370 HECs and identified 15 novel HECs with the best mechanical properties. Our models can rapidly explore the mechanical properties of HECs with descriptor–property correlation analysis, and hence provide an efficient method for accelerating the design of non-equimolar high-entropy materials with desired performance.
{"title":"Predicting mechanical properties of non-equimolar high-entropy carbides using machine learning†","authors":"Xi Zhao, Shu-guang Cheng, Sen Yu, Jiming Zheng, Rui-Zhi Zhang and Meng Guo","doi":"10.1039/D4DD00243A","DOIUrl":"https://doi.org/10.1039/D4DD00243A","url":null,"abstract":"<p >High-entropy carbides (HECs) have garnered significant attention due to their unique mechanical properties. However, the design of novel HECs has been limited by extensive trial-and-error strategies, along with insufficient knowledge and computational capabilities. In this work, the intrinsic correlations between elements in the high-dimensional compositional space of HECs are investigated using high-throughput density functional theory calculations and two machine learning models, which enable us to predict the Young's modulus, hardness and wear resistance with only a chemical formula provided. Our models demonstrate a low root mean square error (11.5 GPa) and mean absolute error (9.0 GPa) in predicting the elastic modulus of HECs with arbitrary non-equimolar compositions. We further established a database of 566 370 HECs and identified 15 novel HECs with the best mechanical properties. Our models can rapidly explore the mechanical properties of HECs with descriptor–property correlation analysis, and hence provide an efficient method for accelerating the design of non-equimolar high-entropy materials with desired performance.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 264-274"},"PeriodicalIF":6.2,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00243a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local vs. global property prediction tasks, small vs. large datasets, and structure vs. compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.
{"title":"MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols†","authors":"Matthew D. Witman and Peter Schindler","doi":"10.1039/D4DD00250D","DOIUrl":"https://doi.org/10.1039/D4DD00250D","url":null,"abstract":"<p >Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local <em>vs.</em> global property prediction tasks, small <em>vs.</em> large datasets, and structure <em>vs.</em> compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 625-635"},"PeriodicalIF":6.2,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00250d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Willimetz, Andreas Erlebach, Christopher J. Heard and Lukáš Grajciar
Zeolites, such as MFI, are versatile microporous aluminosilicate materials that are widely used in catalysis and adsorption processes. The location and the character of the aluminium within the zeolite framework is one of the important determinants of performance in industrial applications, and is typically probed by 27Al NMR spectroscopy. However, interpretation of 27Al NMR spectra is challenging, as first-principles computational modelling struggles to achieve the timescales and model complexity needed to provide reliable assignments. In this study, we deploy advanced machine learning-based methods to help bridge the time and model complexity scale by first utilizing neural network interatomic potentials to achieve significant speed-up in structure sampling compared to traditional density functional theory (DFT) approaches, and second by training regression models to cost-effectively predict the 27Al chemical shifts. This allows us, for the H-MFI zeolite as a use case, to comprehensively explore the effect of various conditions relevant to catalysis, including water loading, temperature, and the aluminium concentration, on the 27Al chemical shifts. We demonstrate that both water content and temperature significantly affect the chemical shift and do so in a non-trivial way that is highly T-site dependent, highlighting a need for adoption of realistic, case-specific models. We also observe that our approach is able to achieve close to quantitative agreement with relevant experimental data for such a complex zeolite as MFI, allowing for the tentative assignment of the experimental NMR peaks to specific T-sites. These findings provide a testament to the capabilities of machine learning approaches in providing reliable predictions of important spectroscopic observables for complex industrially relevant materials under realistic conditions.
{"title":"27Al NMR chemical shifts in zeolite MFI via machine learning acceleration of structure sampling and shift prediction†","authors":"Daniel Willimetz, Andreas Erlebach, Christopher J. Heard and Lukáš Grajciar","doi":"10.1039/D4DD00306C","DOIUrl":"https://doi.org/10.1039/D4DD00306C","url":null,"abstract":"<p >Zeolites, such as MFI, are versatile microporous aluminosilicate materials that are widely used in catalysis and adsorption processes. The location and the character of the aluminium within the zeolite framework is one of the important determinants of performance in industrial applications, and is typically probed by <small><sup>27</sup></small>Al NMR spectroscopy. However, interpretation of <small><sup>27</sup></small>Al NMR spectra is challenging, as first-principles computational modelling struggles to achieve the timescales and model complexity needed to provide reliable assignments. In this study, we deploy advanced machine learning-based methods to help bridge the time and model complexity scale by first utilizing neural network interatomic potentials to achieve significant speed-up in structure sampling compared to traditional density functional theory (DFT) approaches, and second by training regression models to cost-effectively predict the <small><sup>27</sup></small>Al chemical shifts. This allows us, for the H-MFI zeolite as a use case, to comprehensively explore the effect of various conditions relevant to catalysis, including water loading, temperature, and the aluminium concentration, on the <small><sup>27</sup></small>Al chemical shifts. We demonstrate that both water content and temperature significantly affect the chemical shift and do so in a non-trivial way that is highly T-site dependent, highlighting a need for adoption of realistic, case-specific models. We also observe that our approach is able to achieve close to quantitative agreement with relevant experimental data for such a complex zeolite as MFI, allowing for the tentative assignment of the experimental NMR peaks to specific T-sites. These findings provide a testament to the capabilities of machine learning approaches in providing reliable predictions of important spectroscopic observables for complex industrially relevant materials under realistic conditions.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 275-288"},"PeriodicalIF":6.2,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00306c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Schwalbe-Koda, Nitish Govindarajan and Joel B. Varley
Sampling high-coverage configurations and predicting adsorbate–adsorbate interactions on surfaces are highly relevant to understand realistic interfaces in heterogeneous catalysis. However, the combinatorial explosion in the number of adsorbate configurations among diverse site environments presents a considerable challenge in accurately estimating these interactions. Here, we propose a strategy combining high-throughput simulation pipelines and a neural network-based model with the MACE architecture to increase sampling efficiency and speed. By training the models on unrelaxed structures and energies, which can be quickly obtained from single-point DFT calculations, we achieve excellent performance for both in-domain and out-of-domain predictions, including generalization to different facets, coverage regimes and low-energy configurations. From this systematic understanding of model robustness, we exhaustively sample the configuration phase space of catalytic systems without active learning. In particular, by predicting binding energies for over 14 million structures within the neural network model and the simulated annealing method, we predict coverage-dependent adsorption energies for CO adsorption on six Cu facets (111, 100, 211, 331, 410 and 711) and the co-adsorption of CO and CHOH on Rh(111). When validated by targeted post-sampling relaxations, our results for CO on Cu correctly reproduce experimental interaction energies reported in the literature, and provide atomistic insights on the site occupancy of steps and terraces for the six facets at all coverage regimes. Additionally, the arrangement of CO on the Rh(111) surface is demonstrated to substantially impact the activation barriers for the CHOH bond scission, illustrating the importance of comprehensive sampling on reaction kinetics. Our findings demonstrate that simplified data generation routines and evaluating generalization of neural networks can be deployed at scale to understand lateral interactions on surfaces, paving the way towards realistic modeling of heterogeneous catalytic processes.
{"title":"Comprehensive sampling of coverage effects in catalysis by leveraging generalization in neural network models†","authors":"Daniel Schwalbe-Koda, Nitish Govindarajan and Joel B. Varley","doi":"10.1039/D4DD00328D","DOIUrl":"https://doi.org/10.1039/D4DD00328D","url":null,"abstract":"<p >Sampling high-coverage configurations and predicting adsorbate–adsorbate interactions on surfaces are highly relevant to understand realistic interfaces in heterogeneous catalysis. However, the combinatorial explosion in the number of adsorbate configurations among diverse site environments presents a considerable challenge in accurately estimating these interactions. Here, we propose a strategy combining high-throughput simulation pipelines and a neural network-based model with the MACE architecture to increase sampling efficiency and speed. By training the models on unrelaxed structures and energies, which can be quickly obtained from single-point DFT calculations, we achieve excellent performance for both in-domain and out-of-domain predictions, including generalization to different facets, coverage regimes and low-energy configurations. From this systematic understanding of model robustness, we exhaustively sample the configuration phase space of catalytic systems without active learning. In particular, by predicting binding energies for over 14 million structures within the neural network model and the simulated annealing method, we predict coverage-dependent adsorption energies for CO adsorption on six Cu facets (111, 100, 211, 331, 410 and 711) and the co-adsorption of CO and CHOH on Rh(111). When validated by targeted post-sampling relaxations, our results for CO on Cu correctly reproduce experimental interaction energies reported in the literature, and provide atomistic insights on the site occupancy of steps and terraces for the six facets at all coverage regimes. Additionally, the arrangement of CO on the Rh(111) surface is demonstrated to substantially impact the activation barriers for the CHOH bond scission, illustrating the importance of comprehensive sampling on reaction kinetics. Our findings demonstrate that simplified data generation routines and evaluating generalization of neural networks can be deployed at scale to understand lateral interactions on surfaces, paving the way towards realistic modeling of heterogeneous catalytic processes.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 234-251"},"PeriodicalIF":6.2,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00328d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Utkarsh Pratiush, Hiroshi Funakubo, Rama Vasudevan, Sergei V. Kalinin and Yongtao Liu
Microscopy plays a foundational role in materials science, biology, and nanotechnology, offering high-resolution imaging and detailed insights into properties at the nanoscale and atomic level. Microscopy automation via active machine learning approaches is a transformative advancement, offering increased efficiency, reproducibility, and the capability to perform complex experiments. Our previous work on autonomous experimentation with scanning probe microscopy (SPM) demonstrated an active learning framework using deep kernel learning (DKL) for structure–property relationship discovery. Here we extend this approach to a multi-stage decision process to incorporate prior knowledge and human interest into DKL-based workflows, we operationalize these workflows in SPM. By integrating expected rewards from structure libraries or spectroscopic features, we enhanced the exploration efficiency of autonomous microscopy, demonstrating more efficient and targeted exploration in autonomous microscopy. These methods can be seamlessly applied to other microscopy and imaging techniques. Furthermore, the concept can be adapted for general Bayesian optimization in material discovery across a broad range of autonomous experimental fields.
{"title":"Scientific exploration with expert knowledge (SEEK) in autonomous scanning probe microscopy with active learning†","authors":"Utkarsh Pratiush, Hiroshi Funakubo, Rama Vasudevan, Sergei V. Kalinin and Yongtao Liu","doi":"10.1039/D4DD00277F","DOIUrl":"https://doi.org/10.1039/D4DD00277F","url":null,"abstract":"<p >Microscopy plays a foundational role in materials science, biology, and nanotechnology, offering high-resolution imaging and detailed insights into properties at the nanoscale and atomic level. Microscopy automation <em>via</em> active machine learning approaches is a transformative advancement, offering increased efficiency, reproducibility, and the capability to perform complex experiments. Our previous work on autonomous experimentation with scanning probe microscopy (SPM) demonstrated an active learning framework using deep kernel learning (DKL) for structure–property relationship discovery. Here we extend this approach to a multi-stage decision process to incorporate prior knowledge and human interest into DKL-based workflows, we operationalize these workflows in SPM. By integrating expected rewards from structure libraries or spectroscopic features, we enhanced the exploration efficiency of autonomous microscopy, demonstrating more efficient and targeted exploration in autonomous microscopy. These methods can be seamlessly applied to other microscopy and imaging techniques. Furthermore, the concept can be adapted for general Bayesian optimization in material discovery across a broad range of autonomous experimental fields.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 252-263"},"PeriodicalIF":6.2,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}