Digital discovery最新文献_第10页

Unsupervised multi-clustering and decision-making strategies for 4D-STEM orientation mapping 4D-STEM方向映射的无监督多聚类与决策策略

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-30 DOI: 10.1039/D5DD00071H

Junhao Cao, Nicolas Folastre, Gozde Oney, Edgar Rauch, Stavros Nicolopoulos, Partha Pratim Das and Arnaud Demortière

This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-component loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.

本研究提出了一种新的无监督学习和决策策略的集成方法，用于4D-STEM数据集的高级分析，重点关注非负矩阵分解（NMF）作为主要聚类方法。我们的方法引入了一个系统框架来确定稳健和可解释的方向映射所需的最佳组件数量(k)。通过利用k分量损失方法和图像质量评估（IQA）指标，我们有效地平衡了重建保真度和模型复杂性。此外，我们强调了数据集预处理在提高聚类稳定性和准确性方面的关键作用。此外，我们的空间权重矩阵分析通过采用基于阈值的可视化，提供了对数据集中重叠区域的洞察，促进了对聚类相互作用的详细理解。结果表明，将NMF与先进的IQA指标和预处理技术相结合，可以在4D-STEM数据集中进行可靠的取向映射和结构分析，为未来在多维材料表征中的应用铺平了道路。

{"title":"Unsupervised multi-clustering and decision-making strategies for 4D-STEM orientation mapping","authors":"Junhao Cao, Nicolas Folastre, Gozde Oney, Edgar Rauch, Stavros Nicolopoulos, Partha Pratim Das and Arnaud Demortière","doi":"10.1039/D5DD00071H","DOIUrl":"https://doi.org/10.1039/D5DD00071H","url":null,"abstract":"This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-component loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3610-3622"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00071h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning generalised DFT+U projectors in a numerical atom-centred orbital framework 机器学习在数值原子中心轨道框架中的广义DFT+U投影

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-30 DOI: 10.1039/D5DD00292C

Amit Chaudhari, Kushagra Agrawal and Andrew J. Logsdail

Accurate electronic structure simulations of strongly correlated metal oxides are crucial for the atomic level understanding of heterogeneous catalysts, batteries and photovoltaics, but remain challenging to perform in a computationally tractable manner. Hubbard corrected density functional theory (DFT+U) in a numerical atom-centred orbital framework has been shown to address this challenge but is susceptible to numerical instability when simulating common transition metal oxides (TMOs), e.g., TiO₂ and rare-earth metal oxides (REOs), e.g., CeO₂, necessitating the development of advanced DFT+U parameterisation strategies. In this work, the numerical instabilities of DFT+U are traced to the default atomic Hubbard projector, which we refine for Ti 3d orbitals in TiO₂ using Bayesian optimisation, with a cost function and constraints defined using symbolic regression (SR) and support vector machines, respectively. The optimised Ti 3d Hubbard projector enables the numerically stable simulation of electron polarons at intrinsic and extrinsic defects in both anatase and rutile TiO₂, with comparable accuracy to hybrid-DFT at several orders of magnitude lower computational cost. We extend the method by defining a general first-principles approach for optimising Hubbard projectors, based on reproducing orbital occupancies calculated using hybrid-DFT. Using a hierarchical SR-defined cost function that depends on DFT-predicted orbital occupancies, basis set parameters and atomic material descriptors, a generalised workflow for the one-shot computation of Hubbard U values and projectors is presented. The method transferability is shown for 10 prototypical TMOs and REOs, with demonstrable accuracy for unseen materials that extends to complex battery cathode materials like LiCo_1−xMg_xO_2−x. The work highlights the integration of advanced machine learning algorithms to develop cost-effective and transferable workflows for DFT+U parameterisation, enabling more accurate and efficient simulations of strongly correlated metal oxides.

强相关金属氧化物的精确电子结构模拟对于理解非均相催化剂、电池和光伏电池的原子水平至关重要，但在计算上仍然具有挑战性。Hubbard校正密度泛函数理论（DFT+U）在数值原子中心轨道框架中已被证明可以解决这一挑战，但在模拟常见的过渡金属氧化物（TMOs）时，如TiO2和稀土金属氧化物（REOs），如CeO2，容易受到数值不稳定性的影响，因此需要开发先进的DFT+U参数化策略。在这项工作中，DFT+U的数值不稳定性追溯到默认的原子Hubbard投影仪，我们使用贝叶斯优化对TiO2中的Ti 3d轨道进行了改进，分别使用符号回归（SR）和支持向量机定义了成本函数和约束。优化后的Ti 3d Hubbard投影仪能够在锐钛矿和金红石型TiO2的内在和外在缺陷处进行电子极化子的数值稳定模拟，其精度与混合dft相当，计算成本降低了几个数量级。我们通过定义优化哈伯德投影仪的一般第一性原理方法来扩展该方法，该方法基于使用混合dft计算的轨道占位率的再现。利用基于dft预测的轨道占位、基集参数和原子材料描述符的分层sr定义成本函数，提出了一种一次性计算Hubbard U值和投影的通用工作流程。该方法的可转移性在10个原型TMOs和reo上得到了证明，对于未见过的材料，如LiCo1−xMgxO2−x，具有可证明的准确性。这项工作强调了先进机器学习算法的集成，为DFT+U参数化开发具有成本效益和可转移的工作流程，从而能够更准确、更有效地模拟强相关金属氧化物。

{"title":"Machine learning generalised DFT+U projectors in a numerical atom-centred orbital framework","authors":"Amit Chaudhari, Kushagra Agrawal and Andrew J. Logsdail","doi":"10.1039/D5DD00292C","DOIUrl":"https://doi.org/10.1039/D5DD00292C","url":null,"abstract":"Accurate electronic structure simulations of strongly correlated metal oxides are crucial for the atomic level understanding of heterogeneous catalysts, batteries and photovoltaics, but remain challenging to perform in a computationally tractable manner. Hubbard corrected density functional theory (DFT+U) in a numerical atom-centred orbital framework has been shown to address this challenge but is susceptible to numerical instability when simulating common transition metal oxides (TMOs), e.g., TiO2 and rare-earth metal oxides (REOs), e.g., CeO2, necessitating the development of advanced DFT+U parameterisation strategies. In this work, the numerical instabilities of DFT+U are traced to the default atomic Hubbard projector, which we refine for Ti 3d orbitals in TiO2 using Bayesian optimisation, with a cost function and constraints defined using symbolic regression (SR) and support vector machines, respectively. The optimised Ti 3d Hubbard projector enables the numerically stable simulation of electron polarons at intrinsic and extrinsic defects in both anatase and rutile TiO2, with comparable accuracy to hybrid-DFT at several orders of magnitude lower computational cost. We extend the method by defining a general first-principles approach for optimising Hubbard projectors, based on reproducing orbital occupancies calculated using hybrid-DFT. Using a hierarchical SR-defined cost function that depends on DFT-predicted orbital occupancies, basis set parameters and atomic material descriptors, a generalised workflow for the one-shot computation of Hubbard U values and projectors is presented. The method transferability is shown for 10 prototypical TMOs and REOs, with demonstrable accuracy for unseen materials that extends to complex battery cathode materials like LiCo1−xMgxO2−x. The work highlights the integration of advanced machine learning algorithms to develop cost-effective and transferable workflows for DFT+U parameterisation, enabling more accurate and efficient simulations of strongly correlated metal oxides.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3701-3727"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00292c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Active learning meets metadynamics: automated workflow for reactive machine learning interatomic potentials 主动学习满足元动力学：反应性机器学习原子间电位的自动化工作流程。

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-30 DOI: 10.1039/D5DD00261C

Valdas Vitartas, Hanwen Zhang, Veronika Juraskova, Tristan Johnston-Wood and Fernanda Duarte

Atomistic simulations driven by machine-learned interatomic potentials (MLIPs) are a cost-effective alternative to ab initio molecular dynamics (AIMD). Yet, their broad applicability in reaction modelling remains hindered, in part, by the need for large training datasets that adequately sample the relevant potential energy surface, including high-energy transition state (TS) regions. To optimise dataset generation and extend the use of MLIPs for reaction modelling, we present a data-efficient and fully automated workflow for MLIP training that requires only a small number (typically five to ten) of initial configurations and no prior knowledge of the TS. The approach combines automated active learning with well-tempered metadynamics to iteratively and selectively explore chemically relevant regions of configuration space. Using data-efficient architectures, such as the linear Atomic Cluster Expansion, we illustrate the performance of this strategy in various organic reactions where the environment is described at different levels, including the S_N2 reaction between fluoride and chloromethane in implicit water, the methyl shift of 2,2-dimethylisoindene in the gas phase, and a glycosylation reaction in explicit dichloromethane solution, where competitive pathways exist. The proposed training strategy yields accurate and stable MLIPs for all three cases, highlighting its versatility for modelling reactive processes.

由机器学习原子间势（MLIPs）驱动的原子模拟是从头计算分子动力学（AIMD）的一种经济有效的替代方法。然而，它们在反应建模中的广泛适用性仍然受到阻碍，部分原因是需要大量的训练数据集来充分采样相关的势能面，包括高能过渡态（TS）区域。为了优化数据集生成并扩展MLIP在反应建模中的使用，我们提出了一个数据高效且全自动的MLIP训练工作流，该工作流只需要少量（通常为5到10）初始配置，并且不需要TS的先验知识。该方法将自动主动学习与良好调节的元动力学相结合，以迭代和选择性地探索配置空间的化学相关区域。利用数据高效架构，例如线性原子簇扩展，我们说明了该策略在不同水平环境下的各种有机反应中的性能，包括隐含水中氟和氯甲烷之间的SN2反应，2,2-二甲基异丁烯在气相中的甲基转移，以及显性二氯甲烷溶液中的糖基化反应，其中存在竞争途径。所提出的训练策略为所有三种情况产生准确和稳定的mlip，突出了其建模反应过程的多功能性。

{"title":"Active learning meets metadynamics: automated workflow for reactive machine learning interatomic potentials","authors":"Valdas Vitartas, Hanwen Zhang, Veronika Juraskova, Tristan Johnston-Wood and Fernanda Duarte","doi":"10.1039/D5DD00261C","DOIUrl":"10.1039/D5DD00261C","url":null,"abstract":"Atomistic simulations driven by machine-learned interatomic potentials (MLIPs) are a cost-effective alternative to ab initio molecular dynamics (AIMD). Yet, their broad applicability in reaction modelling remains hindered, in part, by the need for large training datasets that adequately sample the relevant potential energy surface, including high-energy transition state (TS) regions. To optimise dataset generation and extend the use of MLIPs for reaction modelling, we present a data-efficient and fully automated workflow for MLIP training that requires only a small number (typically five to ten) of initial configurations and no prior knowledge of the TS. The approach combines automated active learning with well-tempered metadynamics to iteratively and selectively explore chemically relevant regions of configuration space. Using data-efficient architectures, such as the linear Atomic Cluster Expansion, we illustrate the performance of this strategy in various organic reactions where the environment is described at different levels, including the SN2 reaction between fluoride and chloromethane in implicit water, the methyl shift of 2,2-dimethylisoindene in the gas phase, and a glycosylation reaction in explicit dichloromethane solution, where competitive pathways exist. The proposed training strategy yields accurate and stable MLIPs for all three cases, highlighting its versatility for modelling reactive processes.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 108-122"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12642453/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A straightforward gradient-based approach for designing superconductors with high critical temperature: exploiting domain knowledge via adaptive constraints 一种直接的基于梯度的高临界温度超导体设计方法：通过自适应约束利用领域知识

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-29 DOI: 10.1039/D5DD00250H

Akihiro Fujii, Anh Khoa Augustin Lu, Koji Shimizu and Satoshi Watanabe

Materials design aims to discover novel compounds with desired properties. However, prevailing strategies face critical trade-offs. Conventional element-substitution approaches readily and adaptively incorporate various domain knowledge but remain confined to a narrow search space. In contrast, deep generative models efficiently explore vast compositional landscapes, yet they struggle to flexibly integrate domain knowledge. To address these trade-offs, we propose a gradient-based material design framework that combines these strengths, offering both efficiency and adaptability. In our method, chemical compositions are optimised to achieve target properties by using property prediction models and their gradients. In order to seamlessly enforce diverse constraints—including those reflecting domain insights such as oxidation states, discretised compositional ratios, types of elements, and their abundance, we apply masks and employ a special loss function, namely the integer loss. Furthermore, we initialise the optimisation using promising candidates from existing datasets, effectively guiding the search away from unfavourable regions and thus helping to avoid poor solutions. Our approach demonstrates a more efficient exploration of superconductor candidates, uncovering candidate materials with higher critical temperature than conventional element-substitution and generative models. Importantly, it could propose new compositions beyond those found in existing databases, including new hydride superconductors absent from the training dataset but which share compositional similarities with materials found in the literature. This synergy of domain knowledge and machine-learning-based scalability provides a robust foundation for rapid, adaptive, and comprehensive materials design for superconductors and beyond.

材料设计旨在发现具有理想性能的新化合物。然而，主流战略面临着关键的权衡。传统的元素替换方法容易且自适应地包含各种领域知识，但仍然局限于狭窄的搜索空间。相比之下，深度生成模型可以有效地探索大量的组成景观，但它们难以灵活地整合领域知识。为了解决这些问题，我们提出了一个基于梯度的材料设计框架，它结合了这些优势，提供了效率和适应性。在我们的方法中，通过使用性质预测模型及其梯度来优化化学成分以达到目标性质。为了无缝地执行各种约束，包括那些反映领域洞察力的约束，如氧化态、离散成分比、元素类型及其丰度，我们应用掩模并采用特殊的损失函数，即整数损失。此外，我们使用来自现有数据集的有希望的候选数据初始化优化，有效地引导搜索远离不利区域，从而帮助避免不良解决方案。我们的方法展示了对超导体候选材料更有效的探索，发现了比传统元素取代和生成模型具有更高临界温度的候选材料。重要的是，它可以提出超越现有数据库中发现的新成分，包括训练数据集中没有的新的氢化物超导体，但它们与文献中发现的材料具有相似的成分。这种领域知识和基于机器学习的可扩展性的协同作用为超导体及其他领域的快速、自适应和全面的材料设计提供了坚实的基础。

{"title":"A straightforward gradient-based approach for designing superconductors with high critical temperature: exploiting domain knowledge via adaptive constraints","authors":"Akihiro Fujii, Anh Khoa Augustin Lu, Koji Shimizu and Satoshi Watanabe","doi":"10.1039/D5DD00250H","DOIUrl":"https://doi.org/10.1039/D5DD00250H","url":null,"abstract":"Materials design aims to discover novel compounds with desired properties. However, prevailing strategies face critical trade-offs. Conventional element-substitution approaches readily and adaptively incorporate various domain knowledge but remain confined to a narrow search space. In contrast, deep generative models efficiently explore vast compositional landscapes, yet they struggle to flexibly integrate domain knowledge. To address these trade-offs, we propose a gradient-based material design framework that combines these strengths, offering both efficiency and adaptability. In our method, chemical compositions are optimised to achieve target properties by using property prediction models and their gradients. In order to seamlessly enforce diverse constraints—including those reflecting domain insights such as oxidation states, discretised compositional ratios, types of elements, and their abundance, we apply masks and employ a special loss function, namely the integer loss. Furthermore, we initialise the optimisation using promising candidates from existing datasets, effectively guiding the search away from unfavourable regions and thus helping to avoid poor solutions. Our approach demonstrates a more efficient exploration of superconductor candidates, uncovering candidate materials with higher critical temperature than conventional element-substitution and generative models. Importantly, it could propose new compositions beyond those found in existing databases, including new hydride superconductors absent from the training dataset but which share compositional similarities with materials found in the literature. This synergy of domain knowledge and machine-learning-based scalability provides a robust foundation for rapid, adaptive, and comprehensive materials design for superconductors and beyond.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3662-3673"},"PeriodicalIF":6.2,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00250h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning anomaly detection of automated HPLC experiments in the cloud laboratory 云实验室自动化HPLC实验的机器学习异常检测

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-29 DOI: 10.1039/D5DD00253B

Filipp Gusev, Benjamin C. Kline, Ryan Quinn, Anqin Xu, Ben Smith, Brian Frezza and Olexandr Isayev

Automation of experiments in cloud laboratories promises to revolutionize scientific research by enabling remote experimentation and improving reproducibility. However, maintaining quality control without constant human oversight remains a critical challenge. Here, we present a novel machine learning framework for automated anomaly detection in High-Performance Liquid Chromatography (HPLC) experiments conducted in a cloud lab. Our system specifically targets air bubble contamination—a common yet challenging issue that typically requires expert analytical chemists to detect and resolve. By leveraging active learning combined with human-in-the-loop annotation, we trained a binary classifier on approximately 25 000 HPLC traces. Prospective validation demonstrated robust performance, with an accuracy of 0.96 and an F1 score of 0.92, suitable for real-world applications. Beyond anomaly detection, we show that the system can serve as a sensitive indicator of instrument health, outperforming traditional periodic qualification tests in identifying systematic issues. The framework is protocol-agnostic, instrument-agnostic, and, in principle, vendor-neutral, making it adaptable to various laboratory settings. This work represents a significant step toward fully autonomous laboratories by enabling continuous quality control, reducing the expertise barrier for complex analytical techniques, and facilitating proactive maintenance of scientific instrumentation. The approach can be extended to detect other types of experimental anomalies, potentially transforming how quality control is implemented in self-driving laboratories (SDLs) across diverse scientific disciplines.

云实验室的实验自动化有望通过实现远程实验和提高可重复性来彻底改变科学研究。然而，在没有持续的人为监督的情况下保持质量控制仍然是一个关键的挑战。在这里，我们提出了一个新的机器学习框架，用于在云实验室中进行的高效液相色谱（HPLC）实验中的自动异常检测。我们的系统专门针对气泡污染，这是一个常见但具有挑战性的问题，通常需要专业的分析化学家来检测和解决。通过利用主动学习和human-in-the-loop注释相结合，我们在大约25000条HPLC轨迹上训练了一个二元分类器。前瞻性验证显示了稳健的性能，准确性为0.96，F1分数为0.92，适合实际应用。除了异常检测之外，我们还表明该系统可以作为仪器健康状况的敏感指标，在识别系统问题方面优于传统的定期资格测试。该框架与协议无关，与仪器无关，并且原则上与供应商无关，使其适用于各种实验室设置。这项工作通过实现持续的质量控制，减少复杂分析技术的专业知识障碍，促进科学仪器的主动维护，代表了迈向完全自主实验室的重要一步。该方法可以扩展到检测其他类型的实验异常，可能会改变在不同科学学科的自动驾驶实验室（sdl）中实施质量控制的方式。

{"title":"Machine learning anomaly detection of automated HPLC experiments in the cloud laboratory","authors":"Filipp Gusev, Benjamin C. Kline, Ryan Quinn, Anqin Xu, Ben Smith, Brian Frezza and Olexandr Isayev","doi":"10.1039/D5DD00253B","DOIUrl":"https://doi.org/10.1039/D5DD00253B","url":null,"abstract":"Automation of experiments in cloud laboratories promises to revolutionize scientific research by enabling remote experimentation and improving reproducibility. However, maintaining quality control without constant human oversight remains a critical challenge. Here, we present a novel machine learning framework for automated anomaly detection in High-Performance Liquid Chromatography (HPLC) experiments conducted in a cloud lab. Our system specifically targets air bubble contamination—a common yet challenging issue that typically requires expert analytical chemists to detect and resolve. By leveraging active learning combined with human-in-the-loop annotation, we trained a binary classifier on approximately 25 000 HPLC traces. Prospective validation demonstrated robust performance, with an accuracy of 0.96 and an F1 score of 0.92, suitable for real-world applications. Beyond anomaly detection, we show that the system can serve as a sensitive indicator of instrument health, outperforming traditional periodic qualification tests in identifying systematic issues. The framework is protocol-agnostic, instrument-agnostic, and, in principle, vendor-neutral, making it adaptable to various laboratory settings. This work represents a significant step toward fully autonomous laboratories by enabling continuous quality control, reducing the expertise barrier for complex analytical techniques, and facilitating proactive maintenance of scientific instrumentation. The approach can be extended to detect other types of experimental anomalies, potentially transforming how quality control is implemented in self-driving laboratories (SDLs) across diverse scientific disciplines.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3445-3454"},"PeriodicalIF":6.2,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00253b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic repurposing model for traditional Chinese ancient formulas based on a knowledge graph 基于知识图谱的中国传统古式语义再利用模型

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-29 DOI: 10.1039/D5DD00344J

Xu Dong, Wenyan Zhao, Feifei Li, LiHong Hu, Hongzhi Li and Guangzhe Li

Drug repurposing can dramatically decrease cost and risk in drug discovery and it can be very helpful for recommending candidate drugs. However, as traditional Chinese medicine (TCM) formulas are multi-component, the repurposing methods for western medicine are usually not applicable for TCM formulas. In this study, we proposed a concept/strategy for multi-component formula/recipe discovery with network and semantics. With this concept, we establish a semantic formula-repurposing model for TCM based on a link-prediction algorithm and knowledge graph (KG). The proposed model integrating semantic embedding with KG networks facilitates the effective repurposing of traditional Chinese medicine formulas. First, we construct a KG that consists of more than 46 600 ancient formulas, including over 120 000 entities, 415 900 triples and 12 relations that are extracted from non-structural textual data by deep-learning techniques. Then, a link-prediction model is built on KG triplets for entity and edge semantic vectors. The formula-repurposing task is considered as computing the similarity of semantic vectors in KG between entities and query formulas. In the current version of the proposed model, two ways of repurposing are tested: one is searching for a similar formula to the query one, and the other is seeking a possible formula for rare, emerging diseases or epidemics. The former is based on the name of a formula; the latter is carried out through symptom entities. The experiments are exemplified with existing formulas, Fufang Danshen Tablets () and the symptoms of COVID-19. The results agree well with existing clinical practices. This suggests our model can be a comprehensive approach to constructing a knowledge graph of TCM formulas and a TCM formula-repurposing strategy, which is able to assist compound formula development and facilitate further research in multi-compound drug/prescription discovery.

药物再利用可以显著降低药物发现的成本和风险，对推荐候选药物非常有帮助。然而，由于中药方剂是多组分的，西药的再利用方法通常不适用于中药方剂。在本研究中，我们提出了一种基于网络和语义的多组分配方/配方发现的概念/策略。在此基础上，建立了基于链接预测算法和知识图（KG）的中医语义公式再利用模型。该模型将语义嵌入与KG网络相结合，促进了中药方剂的有效再利用。首先，我们构建了一个KG，该KG由超过46 600个古代公式组成，其中包括超过12万个实体，415 900个三组和12个关系，这些公式是通过深度学习技术从非结构性文本数据中提取的。然后，基于KG三元组对实体和边缘语义向量建立链接预测模型。公式重用任务被认为是计算实体和查询公式之间语义向量在KG中的相似度。在提出的模型的当前版本中，测试了两种重新利用的方法：一种是寻找与查询的公式相似的公式，另一种是寻找罕见的、新出现的疾病或流行病的可能公式。前者是基于公式的名称；后者是通过症状实体进行的。实验以现有方剂、复方丹参片（）和COVID-19症状为例。结果与临床实践吻合较好。这表明我们的模型可以作为构建中药方剂知识图谱和中药方剂再利用策略的综合方法，能够辅助复方开发和促进多复方药物/处方发现的进一步研究。

{"title":"Semantic repurposing model for traditional Chinese ancient formulas based on a knowledge graph","authors":"Xu Dong, Wenyan Zhao, Feifei Li, LiHong Hu, Hongzhi Li and Guangzhe Li","doi":"10.1039/D5DD00344J","DOIUrl":"https://doi.org/10.1039/D5DD00344J","url":null,"abstract":"Drug repurposing can dramatically decrease cost and risk in drug discovery and it can be very helpful for recommending candidate drugs. However, as traditional Chinese medicine (TCM) formulas are multi-component, the repurposing methods for western medicine are usually not applicable for TCM formulas. In this study, we proposed a concept/strategy for multi-component formula/recipe discovery with network and semantics. With this concept, we establish a semantic formula-repurposing model for TCM based on a link-prediction algorithm and knowledge graph (KG). The proposed model integrating semantic embedding with KG networks facilitates the effective repurposing of traditional Chinese medicine formulas. First, we construct a KG that consists of more than 46 600 ancient formulas, including over 120 000 entities, 415 900 triples and 12 relations that are extracted from non-structural textual data by deep-learning techniques. Then, a link-prediction model is built on KG triplets for entity and edge semantic vectors. The formula-repurposing task is considered as computing the similarity of semantic vectors in KG between entities and query formulas. In the current version of the proposed model, two ways of repurposing are tested: one is searching for a similar formula to the query one, and the other is seeking a possible formula for rare, emerging diseases or epidemics. The former is based on the name of a formula; the latter is carried out through symptom entities. The experiments are exemplified with existing formulas, Fufang Danshen Tablets (<img>) and the symptoms of COVID-19. The results agree well with existing clinical practices. This suggests our model can be a comprehensive approach to constructing a knowledge graph of TCM formulas and a TCM formula-repurposing strategy, which is able to assist compound formula development and facilitate further research in multi-compound drug/prescription discovery.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 317-331"},"PeriodicalIF":6.2,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00344j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146007006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Estimating Trotter approximation errors to optimize Hamiltonian partitioning for lower eigenvalue errors 估计Trotter近似误差以优化较低特征值误差的哈密顿划分

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-28 DOI: 10.1039/D5DD00185D

Shashank G. Mehendale, Luis A. Martínez-Martínez, Prathami Divakar Kamath and Artur F. Izmaylov

Trotter approximation in conjunction with quantum phase estimation can be used to extract eigen-energies of a many-body Hamiltonian on a quantum computer. There were several ways proposed to assess the quality of this approximation based on estimating the norm of the difference between the exact and approximate evolution operators. Here, we explore how different error estimators for various partitionings correlate with the true error in the ground state energy due to Trotter approximation. For a set of small molecules we calculate these exact error in ground-state electronic energies due to the second-order Trotter approximation. Comparison of these errors with previously used upper bounds show correlation less than 0.5 across various Hamiltonian partitionings. On the other hand, building the Trotter approximation error estimation based on perturbation theory up to a second order in the time-step for eigenvalues provides estimates with very good correlations with the exact Trotter approximation errors. These findings highlight the non-faithful character of norm-based estimations for prediction of best Hamiltonian partitionings and the need for perturbative estimates.

将Trotter近似与量子相位估计相结合，可以在量子计算机上提取多体哈密顿算子的本征能量。在估计精确和近似进化算子之间的差的范数的基础上，提出了几种方法来评估这种近似的质量。在这里，我们探讨了由于Trotter近似，不同分区的不同误差估计与基态能量的真实误差之间的关系。对于一组小分子，我们根据二阶Trotter近似计算了这些基态电子能的精确误差。这些误差与以前使用的上界的比较表明，在各种哈密顿划分中相关性小于0.5。另一方面，基于微扰理论在时间步长上建立特征值二阶的Trotter近似误差估计，使估计与确切的Trotter近似误差具有很好的相关性。这些发现突出了基于范数的估计在预测最佳哈密顿分划时的非忠实性和摄动估计的必要性。

{"title":"Estimating Trotter approximation errors to optimize Hamiltonian partitioning for lower eigenvalue errors","authors":"Shashank G. Mehendale, Luis A. Martínez-Martínez, Prathami Divakar Kamath and Artur F. Izmaylov","doi":"10.1039/D5DD00185D","DOIUrl":"https://doi.org/10.1039/D5DD00185D","url":null,"abstract":"Trotter approximation in conjunction with quantum phase estimation can be used to extract eigen-energies of a many-body Hamiltonian on a quantum computer. There were several ways proposed to assess the quality of this approximation based on estimating the norm of the difference between the exact and approximate evolution operators. Here, we explore how different error estimators for various partitionings correlate with the true error in the ground state energy due to Trotter approximation. For a set of small molecules we calculate these exact error in ground-state electronic energies due to the second-order Trotter approximation. Comparison of these errors with previously used upper bounds show correlation less than 0.5 across various Hamiltonian partitionings. On the other hand, building the Trotter approximation error estimation based on perturbation theory up to a second order in the time-step for eigenvalues provides estimates with very good correlations with the exact Trotter approximation errors. These findings highlight the non-faithful character of norm-based estimations for prediction of best Hamiltonian partitionings and the need for perturbative estimates.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3540-3551"},"PeriodicalIF":6.2,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00185d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MolEncoder: towards optimal masked language modeling for molecules MolEncoder：面向分子的最佳掩码语言建模

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-28 DOI: 10.1039/D5DD00369E

Fabian P. Krüger, Nicklas Österbacka, Mikhail Kabeshov, Ola Engkvist and Igor Tetko

Predicting molecular properties is a key challenge in drug discovery. Machine learning models, especially those based on transformer architectures, are increasingly used to make these predictions from chemical structures. Inspired by recent progress in natural language processing, many studies have adopted encoder-only transformer architectures similar to BERT (Bidirectional Encoder Representations from Transformers) for this task. These models are pretrained using masked language modeling, where parts of the input are hidden and the model learns to recover them before fine-tuning on downstream tasks. In this work, we systematically investigate whether core assumptions from natural language processing, which are commonly adopted in molecular BERT-based models, actually hold when applied to molecules represented using the Simplified Molecular Input Line Entry System (SMILES). Specifically, we examine how masking ratio, pretraining dataset size, and model size affect performance in molecular property prediction. We find that higher masking ratios than commonly used significantly improve performance. In contrast, increasing model or pretraining dataset size quickly leads to diminishing returns, offering no consistent benefit while incurring significantly higher computational cost. Based on these insights, we develop MolEncoder, a BERT-based model that outperforms existing approaches on drug discovery tasks while being more computationally efficient. Our results highlight key differences between molecular pretraining and natural language processing, showing that they require different design choices. This enables more efficient model development and lowers barriers for researchers with limited computational resources. We release MolEncoder publicly to support future work and hope our findings help make molecular representation learning more accessible and cost-effective in drug discovery.

预测分子特性是药物发现的一个关键挑战。机器学习模型，特别是那些基于变压器架构的模型，越来越多地用于从化学结构中做出这些预测。受自然语言处理最新进展的启发，许多研究采用了类似于BERT（双向编码器表示）的纯编码器转换器架构来完成这项任务。这些模型使用屏蔽语言建模进行预训练，其中部分输入是隐藏的，模型在对下游任务进行微调之前学会恢复它们。在这项工作中，我们系统地研究了自然语言处理的核心假设是否适用于使用简化分子输入线输入系统（SMILES）表示的分子。自然语言处理通常用于基于bert的分子模型。具体来说，我们研究了掩蔽比、预训练数据集大小和模型大小如何影响分子性质预测的性能。我们发现较高的掩蔽比比通常使用显著提高性能。相比之下，增加模型或预训练数据集的大小会迅速导致收益递减，在产生显著更高的计算成本的同时，无法提供一致的收益。基于这些见解，我们开发了MolEncoder，这是一种基于bert的模型，在药物发现任务上优于现有方法，同时计算效率更高。我们的研究结果强调了分子预训练和自然语言处理之间的关键差异，表明它们需要不同的设计选择。这使得更有效的模型开发和降低障碍的研究人员有限的计算资源。我们公开发布MolEncoder以支持未来的工作，并希望我们的发现有助于使分子表征学习在药物发现中更容易获得和具有成本效益。

{"title":"MolEncoder: towards optimal masked language modeling for molecules","authors":"Fabian P. Krüger, Nicklas Österbacka, Mikhail Kabeshov, Ola Engkvist and Igor Tetko","doi":"10.1039/D5DD00369E","DOIUrl":"https://doi.org/10.1039/D5DD00369E","url":null,"abstract":"Predicting molecular properties is a key challenge in drug discovery. Machine learning models, especially those based on transformer architectures, are increasingly used to make these predictions from chemical structures. Inspired by recent progress in natural language processing, many studies have adopted encoder-only transformer architectures similar to BERT (Bidirectional Encoder Representations from Transformers) for this task. These models are pretrained using masked language modeling, where parts of the input are hidden and the model learns to recover them before fine-tuning on downstream tasks. In this work, we systematically investigate whether core assumptions from natural language processing, which are commonly adopted in molecular BERT-based models, actually hold when applied to molecules represented using the Simplified Molecular Input Line Entry System (SMILES). Specifically, we examine how masking ratio, pretraining dataset size, and model size affect performance in molecular property prediction. We find that higher masking ratios than commonly used significantly improve performance. In contrast, increasing model or pretraining dataset size quickly leads to diminishing returns, offering no consistent benefit while incurring significantly higher computational cost. Based on these insights, we develop MolEncoder, a BERT-based model that outperforms existing approaches on drug discovery tasks while being more computationally efficient. Our results highlight key differences between molecular pretraining and natural language processing, showing that they require different design choices. This enables more efficient model development and lowers barriers for researchers with limited computational resources. We release MolEncoder publicly to support future work and hope our findings help make molecular representation learning more accessible and cost-effective in drug discovery.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3552-3566"},"PeriodicalIF":6.2,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00369e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting PROTAC-mediated ternary complexes with AlphaFold3 and Boltz-1 用AlphaFold3和Boltz-1预测protac介导的三元配合物

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-27 DOI: 10.1039/D5DD00300H

Nils Dunlop, Francisco Erazo, Farzaneh Jalalypour and Rocío Mercado

Accurate prediction of protein–ligand and protein–protein interactions is essential for computational drug discovery, yet remains a significant challenge, particularly for complexes involving large, flexible ligands. In this study, we assess the capabilities of AlphaFold 3 (AF3) and Boltz-1 for modeling ligand–mediated ternary complexes, focusing on proteolysis-targeting chimeras (PROTACs). PROTACs facilitate targeted protein degradation by recruiting an E3 ubiquitin ligase to a protein of interest, offering a promising therapeutic strategy for previously undruggable intracellular targets. However, their size, flexibility, and cooperative binding requirements pose significant challenges for computational modeling. To address this, we systematically evaluated AF3 and Boltz-1 on 62 PROTAC complexes from the Protein Data Bank. Both models achieve high structural accuracy by integrating ligand input during inference, as measured by RMSD, pTM, and DockQ scores, even for post-2021 structures absent from AF3 and Boltz-1 training data. AF3 demonstrates superior ligand positioning, producing 33 ternary complexes with RMSD < 1 Å and 46 with RMSD < 4 Å, compared to Boltz-1's 25 and 40, respectively. We explore different input strategies by comparing molecular string representations and explicit ligand atom positions, finding that the latter yields more accurate ligand placement and predictions. By analyzing the relationships between ligand positioning, protein–ligand interactions, and structural accuracy metrics, we provide insights into key factors influencing AF3's and Boltz-1's performance in modeling PROTAC–mediated binary and ternary complexes. To ensure reproducibility, we publicly release our pipeline and results via a GitHub repository and website (https://protacfold.xyz), providing a framework for future PROTAC structure prediction studies.

准确预测蛋白质-配体和蛋白质-蛋白质相互作用对于计算药物发现至关重要，但仍然是一个重大挑战，特别是涉及大型柔性配体的复合物。在这项研究中，我们评估了AlphaFold 3 （AF3）和Boltz-1模拟配体介导的三元配合物的能力，重点是蛋白质水解靶向嵌合体（PROTACs）。PROTACs通过将E3泛素连接酶募集到感兴趣的蛋白质上，促进靶向蛋白质降解，为以前不可药物的细胞内靶标提供了一种有希望的治疗策略。然而，它们的大小、灵活性和协作绑定需求对计算建模提出了重大挑战。为了解决这个问题，我们系统地评估了蛋白质数据库中62个PROTAC复合物上的AF3和Boltz-1。通过RMSD、pTM和DockQ分数测量，两种模型都通过在推理过程中整合配体输入实现了很高的结构精度，即使对于AF3和Boltz-1训练数据中缺失的2021年后结构也是如此。AF3表现出优越的配体定位，与Boltz-1分别产生25个和40个三元配合物，与RMSD <； 1 Å和RMSD <； 4 Å相比，AF3产生了33个三元配合物。我们通过比较分子串表示和显式配体原子位置来探索不同的输入策略，发现后者产生更准确的配体位置和预测。通过分析配体定位、蛋白质-配体相互作用和结构精度指标之间的关系，我们深入了解了影响AF3和Boltz-1在protac介导的二元和三元配合物建模中的性能的关键因素。为了确保可重复性，我们通过GitHub存储库和网站（https://protacfold）公开发布我们的管道和结果。xyz)，为未来的PROTAC结构预测研究提供了一个框架。

{"title":"Predicting PROTAC-mediated ternary complexes with AlphaFold3 and Boltz-1","authors":"Nils Dunlop, Francisco Erazo, Farzaneh Jalalypour and Rocío Mercado","doi":"10.1039/D5DD00300H","DOIUrl":"https://doi.org/10.1039/D5DD00300H","url":null,"abstract":"Accurate prediction of protein–ligand and protein–protein interactions is essential for computational drug discovery, yet remains a significant challenge, particularly for complexes involving large, flexible ligands. In this study, we assess the capabilities of AlphaFold 3 (AF3) and Boltz-1 for modeling ligand–mediated ternary complexes, focusing on proteolysis-targeting chimeras (PROTACs). PROTACs facilitate targeted protein degradation by recruiting an E3 ubiquitin ligase to a protein of interest, offering a promising therapeutic strategy for previously undruggable intracellular targets. However, their size, flexibility, and cooperative binding requirements pose significant challenges for computational modeling. To address this, we systematically evaluated AF3 and Boltz-1 on 62 PROTAC complexes from the Protein Data Bank. Both models achieve high structural accuracy by integrating ligand input during inference, as measured by RMSD, pTM, and DockQ scores, even for post-2021 structures absent from AF3 and Boltz-1 training data. AF3 demonstrates superior ligand positioning, producing 33 ternary complexes with RMSD < 1 Å and 46 with RMSD < 4 Å, compared to Boltz-1's 25 and 40, respectively. We explore different input strategies by comparing molecular string representations and explicit ligand atom positions, finding that the latter yields more accurate ligand placement and predictions. By analyzing the relationships between ligand positioning, protein–ligand interactions, and structural accuracy metrics, we provide insights into key factors influencing AF3's and Boltz-1's performance in modeling PROTAC–mediated binary and ternary complexes. To ensure reproducibility, we publicly release our pipeline and results via a GitHub repository and website (https://protacfold.xyz), providing a framework for future PROTAC structure prediction studies.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3782-3809"},"PeriodicalIF":6.2,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00300h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An improved machine learning strategy using structural features to predict the glass transition temperature of oxide glasses 一种利用结构特征预测氧化玻璃玻璃化转变温度的改进机器学习策略

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-10-24 DOI: 10.1039/D5DD00326A

Satwinder Singh Danewalia and Kulvir Singh

We present a physics-informed machine learning approach to predict the glass transition temperature (T_g) of sodium borosilicate glasses. Four models—random forest, extreme gradient boosting, support vector machines, and K-nearest neighbors—were trained using both compositional and structural features derived from statistical mechanics. Incorporating these structural descriptors significantly improved model performance. This is evident from reduction in mean absolute error (14.85 K → 13.76 K), root mean square error (21.78 → 19.12) and increase in R² (0.88 → 0.91) measured on testing the dataset for the random forest model. Similar performance improvement was seen for other models as well. Building on this, we propose a three-step predictive strategy that enhances generalization across compositions and accurately predict the T_g of unseen compositions, achieving a mean absolute error of approximately 8 K and an R² value of around 0.98. Our method demonstrates improved accuracy when benchmarked against GlassNet, which represents the current state-of-the-art in property prediction for glasses. These results highlight the importance of considering structural information in improving prediction capabilities of machine learning models for composition-specific small datasets. This approach can assist in the rapid screening and design of glass materials, reducing the reliance on time-consuming experiments and guiding future research toward targeted property optimization.

我们提出了一种基于物理的机器学习方法来预测硼硅酸钠玻璃的玻璃化转变温度（Tg）。四个模型——随机森林、极端梯度增强、支持向量机和k近邻——使用来自统计力学的组成和结构特征进行训练。结合这些结构描述符显著提高了模型的性能。这可以从平均绝对误差（14.85 K→13.76 K）、均方根误差（21.78→19.12）的减小和随机森林模型数据集测试中测量到的R2（0.88→0.91）的增加中看出。其他模型也看到了类似的性能改进。在此基础上，我们提出了一个三步预测策略，该策略增强了跨成分的泛化，并准确地预测了未见成分的Tg，实现了平均绝对误差约为8 K， R2值约为0.98。我们的方法在与GlassNet进行基准测试时证明了更高的准确性，GlassNet代表了当前最先进的玻璃属性预测。这些结果强调了考虑结构信息在提高机器学习模型对特定于成分的小数据集的预测能力方面的重要性。这种方法可以帮助玻璃材料的快速筛选和设计，减少对耗时实验的依赖，并指导未来的针对性性能优化研究。

{"title":"An improved machine learning strategy using structural features to predict the glass transition temperature of oxide glasses","authors":"Satwinder Singh Danewalia and Kulvir Singh","doi":"10.1039/D5DD00326A","DOIUrl":"https://doi.org/10.1039/D5DD00326A","url":null,"abstract":"We present a physics-informed machine learning approach to predict the glass transition temperature (Tg) of sodium borosilicate glasses. Four models—random forest, extreme gradient boosting, support vector machines, and K-nearest neighbors—were trained using both compositional and structural features derived from statistical mechanics. Incorporating these structural descriptors significantly improved model performance. This is evident from reduction in mean absolute error (14.85 K → 13.76 K), root mean square error (21.78 → 19.12) and increase in R2 (0.88 → 0.91) measured on testing the dataset for the random forest model. Similar performance improvement was seen for other models as well. Building on this, we propose a three-step predictive strategy that enhances generalization across compositions and accurately predict the Tg of unseen compositions, achieving a mean absolute error of approximately 8 K and an R2 value of around 0.98. Our method demonstrates improved accuracy when benchmarked against GlassNet, which represents the current state-of-the-art in property prediction for glasses. These results highlight the importance of considering structural information in improving prediction capabilities of machine learning models for composition-specific small datasets. This approach can assist in the rapid screening and design of glass materials, reducing the reliance on time-consuming experiments and guiding future research toward targeted property optimization.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3764-3773"},"PeriodicalIF":6.2,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00326a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0