Amit Chaudhari, Kushagra Agrawal and Andrew J. Logsdail
Accurate electronic structure simulations of strongly correlated metal oxides are crucial for the atomic level understanding of heterogeneous catalysts, batteries and photovoltaics, but remain challenging to perform in a computationally tractable manner. Hubbard corrected density functional theory (DFT+U) in a numerical atom-centred orbital framework has been shown to address this challenge but is susceptible to numerical instability when simulating common transition metal oxides (TMOs), e.g., TiO2 and rare-earth metal oxides (REOs), e.g., CeO2, necessitating the development of advanced DFT+U parameterisation strategies. In this work, the numerical instabilities of DFT+U are traced to the default atomic Hubbard projector, which we refine for Ti 3d orbitals in TiO2 using Bayesian optimisation, with a cost function and constraints defined using symbolic regression (SR) and support vector machines, respectively. The optimised Ti 3d Hubbard projector enables the numerically stable simulation of electron polarons at intrinsic and extrinsic defects in both anatase and rutile TiO2, with comparable accuracy to hybrid-DFT at several orders of magnitude lower computational cost. We extend the method by defining a general first-principles approach for optimising Hubbard projectors, based on reproducing orbital occupancies calculated using hybrid-DFT. Using a hierarchical SR-defined cost function that depends on DFT-predicted orbital occupancies, basis set parameters and atomic material descriptors, a generalised workflow for the one-shot computation of Hubbard U values and projectors is presented. The method transferability is shown for 10 prototypical TMOs and REOs, with demonstrable accuracy for unseen materials that extends to complex battery cathode materials like LiCo1−xMgxO2−x. The work highlights the integration of advanced machine learning algorithms to develop cost-effective and transferable workflows for DFT+U parameterisation, enabling more accurate and efficient simulations of strongly correlated metal oxides.
强相关金属氧化物的精确电子结构模拟对于理解非均相催化剂、电池和光伏电池的原子水平至关重要,但在计算上仍然具有挑战性。Hubbard校正密度泛函数理论(DFT+U)在数值原子中心轨道框架中已被证明可以解决这一挑战,但在模拟常见的过渡金属氧化物(TMOs)时,如TiO2和稀土金属氧化物(REOs),如CeO2,容易受到数值不稳定性的影响,因此需要开发先进的DFT+U参数化策略。在这项工作中,DFT+U的数值不稳定性追溯到默认的原子Hubbard投影仪,我们使用贝叶斯优化对TiO2中的Ti 3d轨道进行了改进,分别使用符号回归(SR)和支持向量机定义了成本函数和约束。优化后的Ti 3d Hubbard投影仪能够在锐钛矿和金红石型TiO2的内在和外在缺陷处进行电子极化子的数值稳定模拟,其精度与混合dft相当,计算成本降低了几个数量级。我们通过定义优化哈伯德投影仪的一般第一性原理方法来扩展该方法,该方法基于使用混合dft计算的轨道占位率的再现。利用基于dft预测的轨道占位、基集参数和原子材料描述符的分层sr定义成本函数,提出了一种一次性计算Hubbard U值和投影的通用工作流程。该方法的可转移性在10个原型TMOs和reo上得到了证明,对于未见过的材料,如LiCo1−xMgxO2−x,具有可证明的准确性。这项工作强调了先进机器学习算法的集成,为DFT+U参数化开发具有成本效益和可转移的工作流程,从而能够更准确、更有效地模拟强相关金属氧化物。
{"title":"Machine learning generalised DFT+U projectors in a numerical atom-centred orbital framework","authors":"Amit Chaudhari, Kushagra Agrawal and Andrew J. Logsdail","doi":"10.1039/D5DD00292C","DOIUrl":"https://doi.org/10.1039/D5DD00292C","url":null,"abstract":"<p >Accurate electronic structure simulations of strongly correlated metal oxides are crucial for the atomic level understanding of heterogeneous catalysts, batteries and photovoltaics, but remain challenging to perform in a computationally tractable manner. Hubbard corrected density functional theory (DFT+<em>U</em>) in a numerical atom-centred orbital framework has been shown to address this challenge but is susceptible to numerical instability when simulating common transition metal oxides (TMOs), <em>e.g.</em>, TiO<small><sub>2</sub></small> and rare-earth metal oxides (REOs), <em>e.g.</em>, CeO<small><sub>2</sub></small>, necessitating the development of advanced DFT+<em>U</em> parameterisation strategies. In this work, the numerical instabilities of DFT+<em>U</em> are traced to the default atomic Hubbard projector, which we refine for Ti 3d orbitals in TiO<small><sub>2</sub></small> using Bayesian optimisation, with a cost function and constraints defined using symbolic regression (SR) and support vector machines, respectively. The optimised Ti 3d Hubbard projector enables the numerically stable simulation of electron polarons at intrinsic and extrinsic defects in both anatase and rutile TiO<small><sub>2</sub></small>, with comparable accuracy to hybrid-DFT at several orders of magnitude lower computational cost. We extend the method by defining a general first-principles approach for optimising Hubbard projectors, based on reproducing orbital occupancies calculated using hybrid-DFT. Using a hierarchical SR-defined cost function that depends on DFT-predicted orbital occupancies, basis set parameters and atomic material descriptors, a generalised workflow for the one-shot computation of Hubbard <em>U</em> values and projectors is presented. The method transferability is shown for 10 prototypical TMOs and REOs, with demonstrable accuracy for unseen materials that extends to complex battery cathode materials like LiCo<small><sub>1−<em>x</em></sub></small>Mg<small><sub><em>x</em></sub></small>O<small><sub>2−<em>x</em></sub></small>. The work highlights the integration of advanced machine learning algorithms to develop cost-effective and transferable workflows for DFT+<em>U</em> parameterisation, enabling more accurate and efficient simulations of strongly correlated metal oxides.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3701-3727"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00292c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Atomistic simulations driven by machine-learned interatomic potentials (MLIPs) are a cost-effective alternative to ab initio molecular dynamics (AIMD). Yet, their broad applicability in reaction modelling remains hindered, in part, by the need for large training datasets that adequately sample the relevant potential energy surface, including high-energy transition state (TS) regions. To optimise dataset generation and extend the use of MLIPs for reaction modelling, we present a data-efficient and fully automated workflow for MLIP training that requires only a small number (typically five to ten) of initial configurations and no prior knowledge of the TS. The approach combines automated active learning with well-tempered metadynamics to iteratively and selectively explore chemically relevant regions of configuration space. Using data-efficient architectures, such as the linear Atomic Cluster Expansion, we illustrate the performance of this strategy in various organic reactions where the environment is described at different levels, including the SN2 reaction between fluoride and chloromethane in implicit water, the methyl shift of 2,2-dimethylisoindene in the gas phase, and a glycosylation reaction in explicit dichloromethane solution, where competitive pathways exist. The proposed training strategy yields accurate and stable MLIPs for all three cases, highlighting its versatility for modelling reactive processes.
{"title":"Active learning meets metadynamics: automated workflow for reactive machine learning interatomic potentials","authors":"Valdas Vitartas, Hanwen Zhang, Veronika Juraskova, Tristan Johnston-Wood and Fernanda Duarte","doi":"10.1039/D5DD00261C","DOIUrl":"10.1039/D5DD00261C","url":null,"abstract":"<p >Atomistic simulations driven by machine-learned interatomic potentials (MLIPs) are a cost-effective alternative to <em>ab initio</em> molecular dynamics (AIMD). Yet, their broad applicability in reaction modelling remains hindered, in part, by the need for large training datasets that adequately sample the relevant potential energy surface, including high-energy transition state (TS) regions. To optimise dataset generation and extend the use of MLIPs for reaction modelling, we present a data-efficient and fully automated workflow for MLIP training that requires only a small number (typically five to ten) of initial configurations and no prior knowledge of the TS. The approach combines automated active learning with well-tempered metadynamics to iteratively and selectively explore chemically relevant regions of configuration space. Using data-efficient architectures, such as the linear Atomic Cluster Expansion, we illustrate the performance of this strategy in various organic reactions where the environment is described at different levels, including the S<small><sub>N</sub></small>2 reaction between fluoride and chloromethane in implicit water, the methyl shift of 2,2-dimethylisoindene in the gas phase, and a glycosylation reaction in explicit dichloromethane solution, where competitive pathways exist. The proposed training strategy yields accurate and stable MLIPs for all three cases, highlighting its versatility for modelling reactive processes.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 108-122"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12642453/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akihiro Fujii, Anh Khoa Augustin Lu, Koji Shimizu and Satoshi Watanabe
Materials design aims to discover novel compounds with desired properties. However, prevailing strategies face critical trade-offs. Conventional element-substitution approaches readily and adaptively incorporate various domain knowledge but remain confined to a narrow search space. In contrast, deep generative models efficiently explore vast compositional landscapes, yet they struggle to flexibly integrate domain knowledge. To address these trade-offs, we propose a gradient-based material design framework that combines these strengths, offering both efficiency and adaptability. In our method, chemical compositions are optimised to achieve target properties by using property prediction models and their gradients. In order to seamlessly enforce diverse constraints—including those reflecting domain insights such as oxidation states, discretised compositional ratios, types of elements, and their abundance, we apply masks and employ a special loss function, namely the integer loss. Furthermore, we initialise the optimisation using promising candidates from existing datasets, effectively guiding the search away from unfavourable regions and thus helping to avoid poor solutions. Our approach demonstrates a more efficient exploration of superconductor candidates, uncovering candidate materials with higher critical temperature than conventional element-substitution and generative models. Importantly, it could propose new compositions beyond those found in existing databases, including new hydride superconductors absent from the training dataset but which share compositional similarities with materials found in the literature. This synergy of domain knowledge and machine-learning-based scalability provides a robust foundation for rapid, adaptive, and comprehensive materials design for superconductors and beyond.
{"title":"A straightforward gradient-based approach for designing superconductors with high critical temperature: exploiting domain knowledge via adaptive constraints","authors":"Akihiro Fujii, Anh Khoa Augustin Lu, Koji Shimizu and Satoshi Watanabe","doi":"10.1039/D5DD00250H","DOIUrl":"https://doi.org/10.1039/D5DD00250H","url":null,"abstract":"<p >Materials design aims to discover novel compounds with desired properties. However, prevailing strategies face critical trade-offs. Conventional element-substitution approaches readily and adaptively incorporate various domain knowledge but remain confined to a narrow search space. In contrast, deep generative models efficiently explore vast compositional landscapes, yet they struggle to flexibly integrate domain knowledge. To address these trade-offs, we propose a gradient-based material design framework that combines these strengths, offering both efficiency and adaptability. In our method, chemical compositions are optimised to achieve target properties by using property prediction models and their gradients. In order to seamlessly enforce diverse constraints—including those reflecting domain insights such as oxidation states, discretised compositional ratios, types of elements, and their abundance, we apply masks and employ a special loss function, namely the integer loss. Furthermore, we initialise the optimisation using promising candidates from existing datasets, effectively guiding the search away from unfavourable regions and thus helping to avoid poor solutions. Our approach demonstrates a more efficient exploration of superconductor candidates, uncovering candidate materials with higher critical temperature than conventional element-substitution and generative models. Importantly, it could propose new compositions beyond those found in existing databases, including new hydride superconductors absent from the training dataset but which share compositional similarities with materials found in the literature. This synergy of domain knowledge and machine-learning-based scalability provides a robust foundation for rapid, adaptive, and comprehensive materials design for superconductors and beyond.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3662-3673"},"PeriodicalIF":6.2,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00250h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Filipp Gusev, Benjamin C. Kline, Ryan Quinn, Anqin Xu, Ben Smith, Brian Frezza and Olexandr Isayev
Automation of experiments in cloud laboratories promises to revolutionize scientific research by enabling remote experimentation and improving reproducibility. However, maintaining quality control without constant human oversight remains a critical challenge. Here, we present a novel machine learning framework for automated anomaly detection in High-Performance Liquid Chromatography (HPLC) experiments conducted in a cloud lab. Our system specifically targets air bubble contamination—a common yet challenging issue that typically requires expert analytical chemists to detect and resolve. By leveraging active learning combined with human-in-the-loop annotation, we trained a binary classifier on approximately 25 000 HPLC traces. Prospective validation demonstrated robust performance, with an accuracy of 0.96 and an F1 score of 0.92, suitable for real-world applications. Beyond anomaly detection, we show that the system can serve as a sensitive indicator of instrument health, outperforming traditional periodic qualification tests in identifying systematic issues. The framework is protocol-agnostic, instrument-agnostic, and, in principle, vendor-neutral, making it adaptable to various laboratory settings. This work represents a significant step toward fully autonomous laboratories by enabling continuous quality control, reducing the expertise barrier for complex analytical techniques, and facilitating proactive maintenance of scientific instrumentation. The approach can be extended to detect other types of experimental anomalies, potentially transforming how quality control is implemented in self-driving laboratories (SDLs) across diverse scientific disciplines.
{"title":"Machine learning anomaly detection of automated HPLC experiments in the cloud laboratory","authors":"Filipp Gusev, Benjamin C. Kline, Ryan Quinn, Anqin Xu, Ben Smith, Brian Frezza and Olexandr Isayev","doi":"10.1039/D5DD00253B","DOIUrl":"https://doi.org/10.1039/D5DD00253B","url":null,"abstract":"<p >Automation of experiments in cloud laboratories promises to revolutionize scientific research by enabling remote experimentation and improving reproducibility. However, maintaining quality control without constant human oversight remains a critical challenge. Here, we present a novel machine learning framework for automated anomaly detection in High-Performance Liquid Chromatography (HPLC) experiments conducted in a cloud lab. Our system specifically targets air bubble contamination—a common yet challenging issue that typically requires expert analytical chemists to detect and resolve. By leveraging active learning combined with human-in-the-loop annotation, we trained a binary classifier on approximately 25 000 HPLC traces. Prospective validation demonstrated robust performance, with an accuracy of 0.96 and an F1 score of 0.92, suitable for real-world applications. Beyond anomaly detection, we show that the system can serve as a sensitive indicator of instrument health, outperforming traditional periodic qualification tests in identifying systematic issues. The framework is protocol-agnostic, instrument-agnostic, and, in principle, vendor-neutral, making it adaptable to various laboratory settings. This work represents a significant step toward fully autonomous laboratories by enabling continuous quality control, reducing the expertise barrier for complex analytical techniques, and facilitating proactive maintenance of scientific instrumentation. The approach can be extended to detect other types of experimental anomalies, potentially transforming how quality control is implemented in self-driving laboratories (SDLs) across diverse scientific disciplines.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3445-3454"},"PeriodicalIF":6.2,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00253b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xu Dong, Wenyan Zhao, Feifei Li, LiHong Hu, Hongzhi Li and Guangzhe Li
Drug repurposing can dramatically decrease cost and risk in drug discovery and it can be very helpful for recommending candidate drugs. However, as traditional Chinese medicine (TCM) formulas are multi-component, the repurposing methods for western medicine are usually not applicable for TCM formulas. In this study, we proposed a concept/strategy for multi-component formula/recipe discovery with network and semantics. With this concept, we establish a semantic formula-repurposing model for TCM based on a link-prediction algorithm and knowledge graph (KG). The proposed model integrating semantic embedding with KG networks facilitates the effective repurposing of traditional Chinese medicine formulas. First, we construct a KG that consists of more than 46 600 ancient formulas, including over 120 000 entities, 415 900 triples and 12 relations that are extracted from non-structural textual data by deep-learning techniques. Then, a link-prediction model is built on KG triplets for entity and edge semantic vectors. The formula-repurposing task is considered as computing the similarity of semantic vectors in KG between entities and query formulas. In the current version of the proposed model, two ways of repurposing are tested: one is searching for a similar formula to the query one, and the other is seeking a possible formula for rare, emerging diseases or epidemics. The former is based on the name of a formula; the latter is carried out through symptom entities. The experiments are exemplified with existing formulas, Fufang Danshen Tablets () and the symptoms of COVID-19. The results agree well with existing clinical practices. This suggests our model can be a comprehensive approach to constructing a knowledge graph of TCM formulas and a TCM formula-repurposing strategy, which is able to assist compound formula development and facilitate further research in multi-compound drug/prescription discovery.
{"title":"Semantic repurposing model for traditional Chinese ancient formulas based on a knowledge graph","authors":"Xu Dong, Wenyan Zhao, Feifei Li, LiHong Hu, Hongzhi Li and Guangzhe Li","doi":"10.1039/D5DD00344J","DOIUrl":"https://doi.org/10.1039/D5DD00344J","url":null,"abstract":"<p >Drug repurposing can dramatically decrease cost and risk in drug discovery and it can be very helpful for recommending candidate drugs. However, as traditional Chinese medicine (TCM) formulas are multi-component, the repurposing methods for western medicine are usually not applicable for TCM formulas. In this study, we proposed a concept/strategy for multi-component formula/recipe discovery with network and semantics. With this concept, we establish a semantic formula-repurposing model for TCM based on a link-prediction algorithm and knowledge graph (KG). The proposed model integrating semantic embedding with KG networks facilitates the effective repurposing of traditional Chinese medicine formulas. First, we construct a KG that consists of more than 46 600 ancient formulas, including over 120 000 entities, 415 900 triples and 12 relations that are extracted from non-structural textual data by deep-learning techniques. Then, a link-prediction model is built on KG triplets for entity and edge semantic vectors. The formula-repurposing task is considered as computing the similarity of semantic vectors in KG between entities and query formulas. In the current version of the proposed model, two ways of repurposing are tested: one is searching for a similar formula to the query one, and the other is seeking a possible formula for rare, emerging diseases or epidemics. The former is based on the name of a formula; the latter is carried out through symptom entities. The experiments are exemplified with existing formulas, Fufang Danshen Tablets (<img>) and the symptoms of COVID-19. The results agree well with existing clinical practices. This suggests our model can be a comprehensive approach to constructing a knowledge graph of TCM formulas and a TCM formula-repurposing strategy, which is able to assist compound formula development and facilitate further research in multi-compound drug/prescription discovery.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 317-331"},"PeriodicalIF":6.2,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00344j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146007006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shashank G. Mehendale, Luis A. Martínez-Martínez, Prathami Divakar Kamath and Artur F. Izmaylov
Trotter approximation in conjunction with quantum phase estimation can be used to extract eigen-energies of a many-body Hamiltonian on a quantum computer. There were several ways proposed to assess the quality of this approximation based on estimating the norm of the difference between the exact and approximate evolution operators. Here, we explore how different error estimators for various partitionings correlate with the true error in the ground state energy due to Trotter approximation. For a set of small molecules we calculate these exact error in ground-state electronic energies due to the second-order Trotter approximation. Comparison of these errors with previously used upper bounds show correlation less than 0.5 across various Hamiltonian partitionings. On the other hand, building the Trotter approximation error estimation based on perturbation theory up to a second order in the time-step for eigenvalues provides estimates with very good correlations with the exact Trotter approximation errors. These findings highlight the non-faithful character of norm-based estimations for prediction of best Hamiltonian partitionings and the need for perturbative estimates.
{"title":"Estimating Trotter approximation errors to optimize Hamiltonian partitioning for lower eigenvalue errors","authors":"Shashank G. Mehendale, Luis A. Martínez-Martínez, Prathami Divakar Kamath and Artur F. Izmaylov","doi":"10.1039/D5DD00185D","DOIUrl":"https://doi.org/10.1039/D5DD00185D","url":null,"abstract":"<p >Trotter approximation in conjunction with quantum phase estimation can be used to extract eigen-energies of a many-body Hamiltonian on a quantum computer. There were several ways proposed to assess the quality of this approximation based on estimating the norm of the difference between the exact and approximate evolution operators. Here, we explore how different error estimators for various partitionings correlate with the true error in the ground state energy due to Trotter approximation. For a set of small molecules we calculate these exact error in ground-state electronic energies due to the second-order Trotter approximation. Comparison of these errors with previously used upper bounds show correlation less than 0.5 across various Hamiltonian partitionings. On the other hand, building the Trotter approximation error estimation based on perturbation theory up to a second order in the time-step for eigenvalues provides estimates with very good correlations with the exact Trotter approximation errors. These findings highlight the non-faithful character of norm-based estimations for prediction of best Hamiltonian partitionings and the need for perturbative estimates.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3540-3551"},"PeriodicalIF":6.2,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00185d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian P. Krüger, Nicklas Österbacka, Mikhail Kabeshov, Ola Engkvist and Igor Tetko
Predicting molecular properties is a key challenge in drug discovery. Machine learning models, especially those based on transformer architectures, are increasingly used to make these predictions from chemical structures. Inspired by recent progress in natural language processing, many studies have adopted encoder-only transformer architectures similar to BERT (Bidirectional Encoder Representations from Transformers) for this task. These models are pretrained using masked language modeling, where parts of the input are hidden and the model learns to recover them before fine-tuning on downstream tasks. In this work, we systematically investigate whether core assumptions from natural language processing, which are commonly adopted in molecular BERT-based models, actually hold when applied to molecules represented using the Simplified Molecular Input Line Entry System (SMILES). Specifically, we examine how masking ratio, pretraining dataset size, and model size affect performance in molecular property prediction. We find that higher masking ratios than commonly used significantly improve performance. In contrast, increasing model or pretraining dataset size quickly leads to diminishing returns, offering no consistent benefit while incurring significantly higher computational cost. Based on these insights, we develop MolEncoder, a BERT-based model that outperforms existing approaches on drug discovery tasks while being more computationally efficient. Our results highlight key differences between molecular pretraining and natural language processing, showing that they require different design choices. This enables more efficient model development and lowers barriers for researchers with limited computational resources. We release MolEncoder publicly to support future work and hope our findings help make molecular representation learning more accessible and cost-effective in drug discovery.
{"title":"MolEncoder: towards optimal masked language modeling for molecules","authors":"Fabian P. Krüger, Nicklas Österbacka, Mikhail Kabeshov, Ola Engkvist and Igor Tetko","doi":"10.1039/D5DD00369E","DOIUrl":"https://doi.org/10.1039/D5DD00369E","url":null,"abstract":"<p >Predicting molecular properties is a key challenge in drug discovery. Machine learning models, especially those based on transformer architectures, are increasingly used to make these predictions from chemical structures. Inspired by recent progress in natural language processing, many studies have adopted encoder-only transformer architectures similar to BERT (Bidirectional Encoder Representations from Transformers) for this task. These models are pretrained using masked language modeling, where parts of the input are hidden and the model learns to recover them before fine-tuning on downstream tasks. In this work, we systematically investigate whether core assumptions from natural language processing, which are commonly adopted in molecular BERT-based models, actually hold when applied to molecules represented using the Simplified Molecular Input Line Entry System (SMILES). Specifically, we examine how masking ratio, pretraining dataset size, and model size affect performance in molecular property prediction. We find that higher masking ratios than commonly used significantly improve performance. In contrast, increasing model or pretraining dataset size quickly leads to diminishing returns, offering no consistent benefit while incurring significantly higher computational cost. Based on these insights, we develop MolEncoder, a BERT-based model that outperforms existing approaches on drug discovery tasks while being more computationally efficient. Our results highlight key differences between molecular pretraining and natural language processing, showing that they require different design choices. This enables more efficient model development and lowers barriers for researchers with limited computational resources. We release MolEncoder publicly to support future work and hope our findings help make molecular representation learning more accessible and cost-effective in drug discovery.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3552-3566"},"PeriodicalIF":6.2,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00369e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nils Dunlop, Francisco Erazo, Farzaneh Jalalypour and Rocío Mercado
Accurate prediction of protein–ligand and protein–protein interactions is essential for computational drug discovery, yet remains a significant challenge, particularly for complexes involving large, flexible ligands. In this study, we assess the capabilities of AlphaFold 3 (AF3) and Boltz-1 for modeling ligand–mediated ternary complexes, focusing on proteolysis-targeting chimeras (PROTACs). PROTACs facilitate targeted protein degradation by recruiting an E3 ubiquitin ligase to a protein of interest, offering a promising therapeutic strategy for previously undruggable intracellular targets. However, their size, flexibility, and cooperative binding requirements pose significant challenges for computational modeling. To address this, we systematically evaluated AF3 and Boltz-1 on 62 PROTAC complexes from the Protein Data Bank. Both models achieve high structural accuracy by integrating ligand input during inference, as measured by RMSD, pTM, and DockQ scores, even for post-2021 structures absent from AF3 and Boltz-1 training data. AF3 demonstrates superior ligand positioning, producing 33 ternary complexes with RMSD < 1 Å and 46 with RMSD < 4 Å, compared to Boltz-1's 25 and 40, respectively. We explore different input strategies by comparing molecular string representations and explicit ligand atom positions, finding that the latter yields more accurate ligand placement and predictions. By analyzing the relationships between ligand positioning, protein–ligand interactions, and structural accuracy metrics, we provide insights into key factors influencing AF3's and Boltz-1's performance in modeling PROTAC–mediated binary and ternary complexes. To ensure reproducibility, we publicly release our pipeline and results via a GitHub repository and website (https://protacfold.xyz), providing a framework for future PROTAC structure prediction studies.
{"title":"Predicting PROTAC-mediated ternary complexes with AlphaFold3 and Boltz-1","authors":"Nils Dunlop, Francisco Erazo, Farzaneh Jalalypour and Rocío Mercado","doi":"10.1039/D5DD00300H","DOIUrl":"https://doi.org/10.1039/D5DD00300H","url":null,"abstract":"<p >Accurate prediction of protein–ligand and protein–protein interactions is essential for computational drug discovery, yet remains a significant challenge, particularly for complexes involving large, flexible ligands. In this study, we assess the capabilities of AlphaFold 3 (AF3) and Boltz-1 for modeling ligand–mediated ternary complexes, focusing on proteolysis-targeting chimeras (PROTACs). PROTACs facilitate targeted protein degradation by recruiting an E3 ubiquitin ligase to a protein of interest, offering a promising therapeutic strategy for previously undruggable intracellular targets. However, their size, flexibility, and cooperative binding requirements pose significant challenges for computational modeling. To address this, we systematically evaluated AF3 and Boltz-1 on 62 PROTAC complexes from the Protein Data Bank. Both models achieve high structural accuracy by integrating ligand input during inference, as measured by RMSD, pTM, and DockQ scores, even for post-2021 structures absent from AF3 and Boltz-1 training data. AF3 demonstrates superior ligand positioning, producing 33 ternary complexes with RMSD < 1 Å and 46 with RMSD < 4 Å, compared to Boltz-1's 25 and 40, respectively. We explore different input strategies by comparing molecular string representations and explicit ligand atom positions, finding that the latter yields more accurate ligand placement and predictions. By analyzing the relationships between ligand positioning, protein–ligand interactions, and structural accuracy metrics, we provide insights into key factors influencing AF3's and Boltz-1's performance in modeling PROTAC–mediated binary and ternary complexes. To ensure reproducibility, we publicly release our pipeline and results <em>via</em> a GitHub repository and website (https://protacfold.xyz), providing a framework for future PROTAC structure prediction studies.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3782-3809"},"PeriodicalIF":6.2,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00300h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a physics-informed machine learning approach to predict the glass transition temperature (Tg) of sodium borosilicate glasses. Four models—random forest, extreme gradient boosting, support vector machines, and K-nearest neighbors—were trained using both compositional and structural features derived from statistical mechanics. Incorporating these structural descriptors significantly improved model performance. This is evident from reduction in mean absolute error (14.85 K → 13.76 K), root mean square error (21.78 → 19.12) and increase in R2 (0.88 → 0.91) measured on testing the dataset for the random forest model. Similar performance improvement was seen for other models as well. Building on this, we propose a three-step predictive strategy that enhances generalization across compositions and accurately predict the Tg of unseen compositions, achieving a mean absolute error of approximately 8 K and an R2 value of around 0.98. Our method demonstrates improved accuracy when benchmarked against GlassNet, which represents the current state-of-the-art in property prediction for glasses. These results highlight the importance of considering structural information in improving prediction capabilities of machine learning models for composition-specific small datasets. This approach can assist in the rapid screening and design of glass materials, reducing the reliance on time-consuming experiments and guiding future research toward targeted property optimization.
{"title":"An improved machine learning strategy using structural features to predict the glass transition temperature of oxide glasses","authors":"Satwinder Singh Danewalia and Kulvir Singh","doi":"10.1039/D5DD00326A","DOIUrl":"https://doi.org/10.1039/D5DD00326A","url":null,"abstract":"<p >We present a physics-informed machine learning approach to predict the glass transition temperature (<em>T</em><small><sub><em>g</em></sub></small>) of sodium borosilicate glasses. Four models—random forest, extreme gradient boosting, support vector machines, and K-nearest neighbors—were trained using both compositional and structural features derived from statistical mechanics. Incorporating these structural descriptors significantly improved model performance. This is evident from reduction in mean absolute error (14.85 K → 13.76 K), root mean square error (21.78 → 19.12) and increase in <em>R</em><small><sup>2</sup></small> (0.88 → 0.91) measured on testing the dataset for the random forest model. Similar performance improvement was seen for other models as well. Building on this, we propose a three-step predictive strategy that enhances generalization across compositions and accurately predict the <em>T</em><small><sub><em>g</em></sub></small> of unseen compositions, achieving a mean absolute error of approximately 8 K and an <em>R</em><small><sup>2</sup></small> value of around 0.98. Our method demonstrates improved accuracy when benchmarked against GlassNet, which represents the current state-of-the-art in property prediction for glasses. These results highlight the importance of considering structural information in improving prediction capabilities of machine learning models for composition-specific small datasets. This approach can assist in the rapid screening and design of glass materials, reducing the reliance on time-consuming experiments and guiding future research toward targeted property optimization.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3764-3773"},"PeriodicalIF":6.2,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00326a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximiliam Fleck, Marcelle B. M. Spera, Samir Darouich, Timo Klenk and Niels Hansen
Data-driven approaches used to predict thermophysical properties benefit from physical constraints because the extrapolation behavior can be improved and the amount of training data be reduced. In the present work, the well-established entropy scaling approach is incorporated into a neural network architecture to predict the shear viscosity of a diverse set of pure fluids over a large temperature and pressure range. Instead of imposing a particular form of the reference entropy and reference shear viscosity, these properties are learned. The resulting architecture can be interpreted as two linked DeepONets with generalization capabilities.
{"title":"Generalized DeepONets for viscosity prediction using learned entropy scaling references","authors":"Maximiliam Fleck, Marcelle B. M. Spera, Samir Darouich, Timo Klenk and Niels Hansen","doi":"10.1039/D5DD00179J","DOIUrl":"https://doi.org/10.1039/D5DD00179J","url":null,"abstract":"<p >Data-driven approaches used to predict thermophysical properties benefit from physical constraints because the extrapolation behavior can be improved and the amount of training data be reduced. In the present work, the well-established entropy scaling approach is incorporated into a neural network architecture to predict the shear viscosity of a diverse set of pure fluids over a large temperature and pressure range. Instead of imposing a particular form of the reference entropy and reference shear viscosity, these properties are learned. The resulting architecture can be interpreted as two linked DeepONets with generalization capabilities.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3578-3587"},"PeriodicalIF":6.2,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00179j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}