Panagiotis Krokidas, Vassilis Gkatsis, John Theocharis and George Giannakopoulos
Machine learning (ML) has the potential to accelerate the discovery of high-performance materials by learning complex structure–property relationships and prioritizing candidates for costly experiments or simulations. However, ML efficiency is often offset by the need for large, high-quality training datasets, motivating strategies that intelligently select the most informative samples. Here, we formulate the search for top-performing functionalized nanoporous materials (metal–organic and covalent–organic frameworks) as a global optimization problem and apply Bayesian Optimization (BO) to identify regions of interest and rank candidates with minimal evaluations. We highlight the importance of a proper and efficient initialization scheme of the BO process, and we demonstrate how BO-acquired samples can also be used to train an XGBoost regression predictive model that can further enrich the efficient mapping of the region of high performing instances of the design space. Across multiple literature-derived adsorption and diffusion datasets containing thousands of structures, our BO framework identifies 2×- to 3×-more materials within a top-100 or top-10 ranking list, than random-sampling-based ML pipelines, and it achieves significantly higher ranking quality. Moreover, the surrogate enrichment strategy further boosts top-N recovery while maintaining high ranking fidelity. By shifting the evaluation focus from average predictive metrics (e.g., R2, MSE) to task-specific criteria (e.g., recall@N and nDCG), our approach offers a practical, data-efficient, and computationally accessible route to guide experimental and computational campaigns toward the most promising materials.
{"title":"Navigating materials design spaces with efficient Bayesian optimization: a case study in functionalized nanoporous materials","authors":"Panagiotis Krokidas, Vassilis Gkatsis, John Theocharis and George Giannakopoulos","doi":"10.1039/D5DD00237K","DOIUrl":"https://doi.org/10.1039/D5DD00237K","url":null,"abstract":"<p >Machine learning (ML) has the potential to accelerate the discovery of high-performance materials by learning complex structure–property relationships and prioritizing candidates for costly experiments or simulations. However, ML efficiency is often offset by the need for large, high-quality training datasets, motivating strategies that intelligently select the most informative samples. Here, we formulate the search for top-performing functionalized nanoporous materials (metal–organic and covalent–organic frameworks) as a global optimization problem and apply Bayesian Optimization (BO) to identify regions of interest and rank candidates with minimal evaluations. We highlight the importance of a proper and efficient initialization scheme of the BO process, and we demonstrate how BO-acquired samples can also be used to train an XGBoost regression predictive model that can further enrich the efficient mapping of the region of high performing instances of the design space. Across multiple literature-derived adsorption and diffusion datasets containing thousands of structures, our BO framework identifies 2×- to 3×-more materials within a top-100 or top-10 ranking list, than random-sampling-based ML pipelines, and it achieves significantly higher ranking quality. Moreover, the surrogate enrichment strategy further boosts top-<em>N</em> recovery while maintaining high ranking fidelity. By shifting the evaluation focus from average predictive metrics (<em>e.g.</em>, <em>R</em><small><sup>2</sup></small>, MSE) to task-specific criteria (<em>e.g.</em>, recall@<em>N</em> and nDCG), our approach offers a practical, data-efficient, and computationally accessible route to guide experimental and computational campaigns toward the most promising materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3753-3763"},"PeriodicalIF":6.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00237k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan-Lucas Uslu, Alexey Nekrasov, Alexander Hermans, Bernd Beschoten, Bastian Leibe, Lutz Waldecker and Christoph Stampfer
The detection and classification of exfoliated two-dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large-scale data collection. Existing algorithms often exhibit challenges in identifying low-contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre-trained using a synthetic data generator that generates realistic microscopy images from unlabeled data. This results in a model that can quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low-contrast materials such as hexagonal boron nitride.
{"title":"MaskTerial: a foundation model for automated 2D material flake detection","authors":"Jan-Lucas Uslu, Alexey Nekrasov, Alexander Hermans, Bernd Beschoten, Bastian Leibe, Lutz Waldecker and Christoph Stampfer","doi":"10.1039/D5DD00156K","DOIUrl":"10.1039/D5DD00156K","url":null,"abstract":"<p >The detection and classification of exfoliated two-dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large-scale data collection. Existing algorithms often exhibit challenges in identifying low-contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre-trained using a synthetic data generator that generates realistic microscopy images from unlabeled data. This results in a model that can quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low-contrast materials such as hexagonal boron nitride.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3744-3752"},"PeriodicalIF":6.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12598537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145497629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alice Gauthier, Laure Vancauwenberghe, Jean-Charles Cousty, Cyril Matthey-Doret, Robin Franken, Sabine Maennel, Pascal Miéville and Oksana Riba Grognuz
The growing demand for reproducible, high-throughput chemical experimentation calls for scalable digital infrastructures that support automation, traceability, and AI-readiness. A dedicated research data infrastructure (RDI) developed within Swiss Cat+ is presented, integrating automated synthesis, multi-stage analytics, and semantic modeling. It captures each experimental step in a structured, machine-interpretable format, forming a scalable, and interoperable data backbone. By systematically recording both successful and failed experiments, the RDI ensures data completeness, strengthens traceability, and enables the creation of bias-resilient datasets essential for robust AI model development. Built on Kubernetes and Argo Workflows and aligned with FAIR principles, the RDI transforms experimental metadata into validated Resource Description Framework (RDF) graphs using an ontology-driven semantic model. These graphs are accessible through a web interface and SPARQL endpoint, facilitating integration with downstream AI and analysis pipelines. Key features include a modular RDF converter and ‘Matryoshka files’, which encapsulate complete experiments with raw data and metadata in a portable, standardized ZIP format. This approach supports scalable querying and sets the stage for standardized data sharing and autonomous experimentation.
{"title":"A FAIR research data infrastructure for high-throughput digital chemistry","authors":"Alice Gauthier, Laure Vancauwenberghe, Jean-Charles Cousty, Cyril Matthey-Doret, Robin Franken, Sabine Maennel, Pascal Miéville and Oksana Riba Grognuz","doi":"10.1039/D5DD00297D","DOIUrl":"https://doi.org/10.1039/D5DD00297D","url":null,"abstract":"<p >The growing demand for reproducible, high-throughput chemical experimentation calls for scalable digital infrastructures that support automation, traceability, and AI-readiness. A dedicated research data infrastructure (RDI) developed within Swiss Cat+ is presented, integrating automated synthesis, multi-stage analytics, and semantic modeling. It captures each experimental step in a structured, machine-interpretable format, forming a scalable, and interoperable data backbone. By systematically recording both successful and failed experiments, the RDI ensures data completeness, strengthens traceability, and enables the creation of bias-resilient datasets essential for robust AI model development. Built on Kubernetes and Argo Workflows and aligned with FAIR principles, the RDI transforms experimental metadata into validated Resource Description Framework (RDF) graphs using an ontology-driven semantic model. These graphs are accessible through a web interface and SPARQL endpoint, facilitating integration with downstream AI and analysis pipelines. Key features include a modular RDF converter and ‘Matryoshka files’, which encapsulate complete experiments with raw data and metadata in a portable, standardized ZIP format. This approach supports scalable querying and sets the stage for standardized data sharing and autonomous experimentation.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3502-3514"},"PeriodicalIF":6.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00297d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Correction for “Beyond training data: how elemental features enhance ML-based formation energy predictions” by Hamed Mahdavi et al., Digital Discovery, 2025, 4, 2972–2982, https://doi.org/10.1039/D5DD00182J.
{"title":"Correction: Beyond training data: how elemental features enhance ML-based formation energy predictions","authors":"Hamed Mahdavi, Vasant Honavar and Dane Morgan","doi":"10.1039/D5DD90047F","DOIUrl":"https://doi.org/10.1039/D5DD90047F","url":null,"abstract":"<p >Correction for “Beyond training data: how elemental features enhance ML-based formation energy predictions” by Hamed Mahdavi <em>et al.</em>, <em>Digital Discovery</em>, 2025, <strong>4</strong>, 2972–2982, https://doi.org/10.1039/D5DD00182J.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3828-3828"},"PeriodicalIF":6.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd90047f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Metal–organic frameworks (MOFs) exhibit immense structural diversity and hold promise for applications ranging from gas storage and separation to energy storage and conversion. However, structural flexibility makes accurate and scalable property prediction difficult. While machine learning potentials (MLPs) offer a compelling balance between accuracy and efficiency, most existing models are system-specific and lack transferability across different MOFs. In this work, we introduce FFLAME – Fragment-to-Framework Learning Approach for MOF Potentials, a fragment-centric strategy for training transferable MLPs. By decomposing MOFs into their constituent metal clusters and organic linkers, FFLAME enables efficient reuse of chemical environments and significantly reduces the need for full-framework training data. We demonstrate that fragment-informed training improves model generalizability, particularly in data-scarce regimes, and accelerates convergence during fine-tuning. FFLAME achieves near-target accuracy on unseen MOFs with minimal additional training. These results establish a robust and data-efficient pathway toward general-purpose MLPs for the simulation of diverse framework materials.
{"title":"FFLAME: a fragment-to-framework learning approach for MOF potentials","authors":"Xiaoqi Zhang, Yutao Li, Xin Jin and Berend Smit","doi":"10.1039/D5DD00321K","DOIUrl":"10.1039/D5DD00321K","url":null,"abstract":"<p >Metal–organic frameworks (MOFs) exhibit immense structural diversity and hold promise for applications ranging from gas storage and separation to energy storage and conversion. However, structural flexibility makes accurate and scalable property prediction difficult. While machine learning potentials (MLPs) offer a compelling balance between accuracy and efficiency, most existing models are system-specific and lack transferability across different MOFs. In this work, we introduce FFLAME – Fragment-to-Framework Learning Approach for MOF Potentials, a fragment-centric strategy for training transferable MLPs. By decomposing MOFs into their constituent metal clusters and organic linkers, FFLAME enables efficient reuse of chemical environments and significantly reduces the need for full-framework training data. We demonstrate that fragment-informed training improves model generalizability, particularly in data-scarce regimes, and accelerates convergence during fine-tuning. FFLAME achieves near-target accuracy on unseen MOFs with minimal additional training. These results establish a robust and data-efficient pathway toward general-purpose MLPs for the simulation of diverse framework materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3466-3477"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12593188/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ricardo Montoya-Gonzalez, Rosa de Guadalupe González-Huerta, Martha Leticia Hernández-Pichardo and Subha R. Das
The integration of machine learning (ML) into materials science has the potential to accelerate material discovery and optimize properties. However, the reliability of ML models depends heavily on the consistency and reproducibility of experimental data. In this study, we present a methodology to combine automated, remotely-programmed synthesis protocols with ML to enable data-driven materials discovery. Experiments were programmed and conducted remotely through robotic syntheses at cloud laboratories, using multiple different liquid handlers and spectrometers across two independent facilities (Emerald Cloud Lab, Austin, TX and Carnegie Mellon University Automated Science Lab, Pittsburgh, PA). This multi-instrument approach ensured precise control over reaction parameters, eliminated both operator and instrument-specific variability, and enabled generation of high-quality datasets for ML training. From only 40 training samples, our approach predicts whether specific synthesis parameters will lead to successful formation of copper nanoclusters (CuNCs) with interpretable models providing mechanistic insights through SHAP analysis. Our workflow demonstrates how remotely accessed/cloud laboratory infrastructure coupled with ML can transform traditionally manual processes into autonomous, predictive systems. This multi-instrument validation demonstrates reproducibility critical for reliable ML-driven materials discovery and for advancing automated materials synthesis beyond single-laboratory demonstrations.
{"title":"Cross-laboratory validation of machine learning models for copper nanocluster synthesis using cloud-based automated platforms","authors":"Ricardo Montoya-Gonzalez, Rosa de Guadalupe González-Huerta, Martha Leticia Hernández-Pichardo and Subha R. Das","doi":"10.1039/D5DD00335K","DOIUrl":"https://doi.org/10.1039/D5DD00335K","url":null,"abstract":"<p >The integration of machine learning (ML) into materials science has the potential to accelerate material discovery and optimize properties. However, the reliability of ML models depends heavily on the consistency and reproducibility of experimental data. In this study, we present a methodology to combine automated, remotely-programmed synthesis protocols with ML to enable data-driven materials discovery. Experiments were programmed and conducted remotely through robotic syntheses at cloud laboratories, using multiple different liquid handlers and spectrometers across two independent facilities (Emerald Cloud Lab, Austin, TX and Carnegie Mellon University Automated Science Lab, Pittsburgh, PA). This multi-instrument approach ensured precise control over reaction parameters, eliminated both operator and instrument-specific variability, and enabled generation of high-quality datasets for ML training. From only 40 training samples, our approach predicts whether specific synthesis parameters will lead to successful formation of copper nanoclusters (CuNCs) with interpretable models providing mechanistic insights through SHAP analysis. Our workflow demonstrates how remotely accessed/cloud laboratory infrastructure coupled with ML can transform traditionally manual processes into autonomous, predictive systems. This multi-instrument validation demonstrates reproducibility critical for reliable ML-driven materials discovery and for advancing automated materials synthesis beyond single-laboratory demonstrations.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3683-3692"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00335k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predicting enzymatic reactions is crucial for applications in biocatalysis, metabolic engineering, and drug discovery, yet it remains a complex and resource-intensive task. Large Language Models (LLMs) have recently demonstrated remarkable success in various scientific domains, e.g., through their ability to generalize knowledge, reason over complex structures, and leverage in-context learning strategies. In this study, we systematically evaluate the capability of LLMs, particularly the Llama-3.1 family (8B and 70B), across three core biochemical tasks: enzyme commission number prediction, forward synthesis, and retrosynthesis. We compare single-task and multitask learning strategies, employing parameter-efficient fine-tuning via LoRA adapters. Additionally, we assess performance across different data regimes to explore their adaptability in low-data settings. Our results demonstrate that fine-tuned LLMs capture biochemical knowledge, with multitask learning enhancing forward- and retrosynthesis predictions by leveraging shared enzymatic information. We also identify key limitations, for example challenges in hierarchical EC classification schemes, highlighting areas for further improvement in LLM-driven biochemical modeling.
{"title":"Leveraging large language models for enzymatic reaction prediction and characterization","authors":"Lorenzo Di Fruscia and Jana M. Weber","doi":"10.1039/D5DD00187K","DOIUrl":"https://doi.org/10.1039/D5DD00187K","url":null,"abstract":"<p >Predicting enzymatic reactions is crucial for applications in biocatalysis, metabolic engineering, and drug discovery, yet it remains a complex and resource-intensive task. Large Language Models (LLMs) have recently demonstrated remarkable success in various scientific domains, <em>e.g.,</em> through their ability to generalize knowledge, reason over complex structures, and leverage in-context learning strategies. In this study, we systematically evaluate the capability of LLMs, particularly the Llama-3.1 family (8B and 70B), across three core biochemical tasks: enzyme commission number prediction, forward synthesis, and retrosynthesis. We compare single-task and multitask learning strategies, employing parameter-efficient fine-tuning <em>via</em> LoRA adapters. Additionally, we assess performance across different data regimes to explore their adaptability in low-data settings. Our results demonstrate that fine-tuned LLMs capture biochemical knowledge, with multitask learning enhancing forward- and retrosynthesis predictions by leveraging shared enzymatic information. We also identify key limitations, for example challenges in hierarchical EC classification schemes, highlighting areas for further improvement in LLM-driven biochemical modeling.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3588-3609"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00187k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junhao Cao, Nicolas Folastre, Gozde Oney, Edgar Rauch, Stavros Nicolopoulos, Partha Pratim Das and Arnaud Demortière
This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-component loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.
{"title":"Unsupervised multi-clustering and decision-making strategies for 4D-STEM orientation mapping","authors":"Junhao Cao, Nicolas Folastre, Gozde Oney, Edgar Rauch, Stavros Nicolopoulos, Partha Pratim Das and Arnaud Demortière","doi":"10.1039/D5DD00071H","DOIUrl":"https://doi.org/10.1039/D5DD00071H","url":null,"abstract":"<p >This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (<em>k</em>) required for robust and interpretable orientation mapping. By leveraging the <em>K</em>-component loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3610-3622"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00071h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amit Chaudhari, Kushagra Agrawal and Andrew J. Logsdail
Accurate electronic structure simulations of strongly correlated metal oxides are crucial for the atomic level understanding of heterogeneous catalysts, batteries and photovoltaics, but remain challenging to perform in a computationally tractable manner. Hubbard corrected density functional theory (DFT+U) in a numerical atom-centred orbital framework has been shown to address this challenge but is susceptible to numerical instability when simulating common transition metal oxides (TMOs), e.g., TiO2 and rare-earth metal oxides (REOs), e.g., CeO2, necessitating the development of advanced DFT+U parameterisation strategies. In this work, the numerical instabilities of DFT+U are traced to the default atomic Hubbard projector, which we refine for Ti 3d orbitals in TiO2 using Bayesian optimisation, with a cost function and constraints defined using symbolic regression (SR) and support vector machines, respectively. The optimised Ti 3d Hubbard projector enables the numerically stable simulation of electron polarons at intrinsic and extrinsic defects in both anatase and rutile TiO2, with comparable accuracy to hybrid-DFT at several orders of magnitude lower computational cost. We extend the method by defining a general first-principles approach for optimising Hubbard projectors, based on reproducing orbital occupancies calculated using hybrid-DFT. Using a hierarchical SR-defined cost function that depends on DFT-predicted orbital occupancies, basis set parameters and atomic material descriptors, a generalised workflow for the one-shot computation of Hubbard U values and projectors is presented. The method transferability is shown for 10 prototypical TMOs and REOs, with demonstrable accuracy for unseen materials that extends to complex battery cathode materials like LiCo1−xMgxO2−x. The work highlights the integration of advanced machine learning algorithms to develop cost-effective and transferable workflows for DFT+U parameterisation, enabling more accurate and efficient simulations of strongly correlated metal oxides.
强相关金属氧化物的精确电子结构模拟对于理解非均相催化剂、电池和光伏电池的原子水平至关重要,但在计算上仍然具有挑战性。Hubbard校正密度泛函数理论(DFT+U)在数值原子中心轨道框架中已被证明可以解决这一挑战,但在模拟常见的过渡金属氧化物(TMOs)时,如TiO2和稀土金属氧化物(REOs),如CeO2,容易受到数值不稳定性的影响,因此需要开发先进的DFT+U参数化策略。在这项工作中,DFT+U的数值不稳定性追溯到默认的原子Hubbard投影仪,我们使用贝叶斯优化对TiO2中的Ti 3d轨道进行了改进,分别使用符号回归(SR)和支持向量机定义了成本函数和约束。优化后的Ti 3d Hubbard投影仪能够在锐钛矿和金红石型TiO2的内在和外在缺陷处进行电子极化子的数值稳定模拟,其精度与混合dft相当,计算成本降低了几个数量级。我们通过定义优化哈伯德投影仪的一般第一性原理方法来扩展该方法,该方法基于使用混合dft计算的轨道占位率的再现。利用基于dft预测的轨道占位、基集参数和原子材料描述符的分层sr定义成本函数,提出了一种一次性计算Hubbard U值和投影的通用工作流程。该方法的可转移性在10个原型TMOs和reo上得到了证明,对于未见过的材料,如LiCo1−xMgxO2−x,具有可证明的准确性。这项工作强调了先进机器学习算法的集成,为DFT+U参数化开发具有成本效益和可转移的工作流程,从而能够更准确、更有效地模拟强相关金属氧化物。
{"title":"Machine learning generalised DFT+U projectors in a numerical atom-centred orbital framework","authors":"Amit Chaudhari, Kushagra Agrawal and Andrew J. Logsdail","doi":"10.1039/D5DD00292C","DOIUrl":"https://doi.org/10.1039/D5DD00292C","url":null,"abstract":"<p >Accurate electronic structure simulations of strongly correlated metal oxides are crucial for the atomic level understanding of heterogeneous catalysts, batteries and photovoltaics, but remain challenging to perform in a computationally tractable manner. Hubbard corrected density functional theory (DFT+<em>U</em>) in a numerical atom-centred orbital framework has been shown to address this challenge but is susceptible to numerical instability when simulating common transition metal oxides (TMOs), <em>e.g.</em>, TiO<small><sub>2</sub></small> and rare-earth metal oxides (REOs), <em>e.g.</em>, CeO<small><sub>2</sub></small>, necessitating the development of advanced DFT+<em>U</em> parameterisation strategies. In this work, the numerical instabilities of DFT+<em>U</em> are traced to the default atomic Hubbard projector, which we refine for Ti 3d orbitals in TiO<small><sub>2</sub></small> using Bayesian optimisation, with a cost function and constraints defined using symbolic regression (SR) and support vector machines, respectively. The optimised Ti 3d Hubbard projector enables the numerically stable simulation of electron polarons at intrinsic and extrinsic defects in both anatase and rutile TiO<small><sub>2</sub></small>, with comparable accuracy to hybrid-DFT at several orders of magnitude lower computational cost. We extend the method by defining a general first-principles approach for optimising Hubbard projectors, based on reproducing orbital occupancies calculated using hybrid-DFT. Using a hierarchical SR-defined cost function that depends on DFT-predicted orbital occupancies, basis set parameters and atomic material descriptors, a generalised workflow for the one-shot computation of Hubbard <em>U</em> values and projectors is presented. The method transferability is shown for 10 prototypical TMOs and REOs, with demonstrable accuracy for unseen materials that extends to complex battery cathode materials like LiCo<small><sub>1−<em>x</em></sub></small>Mg<small><sub><em>x</em></sub></small>O<small><sub>2−<em>x</em></sub></small>. The work highlights the integration of advanced machine learning algorithms to develop cost-effective and transferable workflows for DFT+<em>U</em> parameterisation, enabling more accurate and efficient simulations of strongly correlated metal oxides.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3701-3727"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00292c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Atomistic simulations driven by machine-learned interatomic potentials (MLIPs) are a cost-effective alternative to ab initio molecular dynamics (AIMD). Yet, their broad applicability in reaction modelling remains hindered, in part, by the need for large training datasets that adequately sample the relevant potential energy surface, including high-energy transition state (TS) regions. To optimise dataset generation and extend the use of MLIPs for reaction modelling, we present a data-efficient and fully automated workflow for MLIP training that requires only a small number (typically five to ten) of initial configurations and no prior knowledge of the TS. The approach combines automated active learning with well-tempered metadynamics to iteratively and selectively explore chemically relevant regions of configuration space. Using data-efficient architectures, such as the linear Atomic Cluster Expansion, we illustrate the performance of this strategy in various organic reactions where the environment is described at different levels, including the SN2 reaction between fluoride and chloromethane in implicit water, the methyl shift of 2,2-dimethylisoindene in the gas phase, and a glycosylation reaction in explicit dichloromethane solution, where competitive pathways exist. The proposed training strategy yields accurate and stable MLIPs for all three cases, highlighting its versatility for modelling reactive processes.
{"title":"Active learning meets metadynamics: automated workflow for reactive machine learning interatomic potentials","authors":"Valdas Vitartas, Hanwen Zhang, Veronika Juraskova, Tristan Johnston-Wood and Fernanda Duarte","doi":"10.1039/D5DD00261C","DOIUrl":"10.1039/D5DD00261C","url":null,"abstract":"<p >Atomistic simulations driven by machine-learned interatomic potentials (MLIPs) are a cost-effective alternative to <em>ab initio</em> molecular dynamics (AIMD). Yet, their broad applicability in reaction modelling remains hindered, in part, by the need for large training datasets that adequately sample the relevant potential energy surface, including high-energy transition state (TS) regions. To optimise dataset generation and extend the use of MLIPs for reaction modelling, we present a data-efficient and fully automated workflow for MLIP training that requires only a small number (typically five to ten) of initial configurations and no prior knowledge of the TS. The approach combines automated active learning with well-tempered metadynamics to iteratively and selectively explore chemically relevant regions of configuration space. Using data-efficient architectures, such as the linear Atomic Cluster Expansion, we illustrate the performance of this strategy in various organic reactions where the environment is described at different levels, including the S<small><sub>N</sub></small>2 reaction between fluoride and chloromethane in implicit water, the methyl shift of 2,2-dimethylisoindene in the gas phase, and a glycosylation reaction in explicit dichloromethane solution, where competitive pathways exist. The proposed training strategy yields accurate and stable MLIPs for all three cases, highlighting its versatility for modelling reactive processes.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 108-122"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12642453/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}