Ahmet Buğra Ortaakarsu, Michel Hosny, Mansour Sobeh and Mohamed A. O. Abdelfattah
Non-alcoholic fatty liver disease (NAFLD) is a prevalent metabolic disorder with limited therapeutic options. Thyroid receptor β (THR-β) agonists have been showing promise for controlling NAFLD via improving hepatic lipid metabolism. This study utilized different in silico tools to screen 47 199 natural compounds from the ZINC15 database to identify potential THR-β agonists. Molecular docking, molecular dynamics simulations, and advanced analyses such as PCA, TICA-FES, and MSM revealed that 4-O-caffeoylquinic acid (compounds 2) and dihydroxydehydrodiconiferyl alcohol (compound 18) are the most promising hits. Both demonstrated high binding affinity and stable agonist interactions with key THR-β residues such as Arg316 and Arg320 that stabilize the ligand binding pocket and support the agonist potential, comparable to the reference agonists resmetirom and {3,5-dichloro-4-[4-hydroxy-3-(propan-2-yl)phenoxy]phenyl}acetic acid. Long-term MD simulations confirmed their stability, and MM/GBSA calculations supported robust thermodynamic profiles. Moreover, the two hits displayed superior selectivity for THR-β over THR-α and favorable pharmacokinetic profiles with minimal toxicity alerts. These findings support compounds 2 and 18 as strong candidates for NAFLD therapy, warranting further experimental validation.
非酒精性脂肪性肝病(NAFLD)是一种普遍存在的代谢性疾病,治疗选择有限。甲状腺受体β (THR-β)激动剂已显示出通过改善肝脏脂质代谢来控制NAFLD的希望。本研究利用不同的硅工具从ZINC15数据库中筛选47199种天然化合物,以鉴定潜在的THR-β激动剂。分子对接、分子动力学模拟以及PCA、TICA-FES和MSM等先进分析表明,4- o -咖啡酰奎宁酸(化合物2)和二羟基脱氢二花叶醇(化合物18)是最有希望的目标。两者都表现出高结合亲和力和稳定的激动剂相互作用,与关键的THR-β残基(如Arg316和Arg320)稳定配体结合袋并支持激动剂的潜力,可与参考激动剂resmetirom和{3,5-二氯-4-[4-羟基-3-(丙烷-2-基)苯氧基]苯基}乙酸相媲美。长期的MD模拟证实了它们的稳定性,MM/GBSA计算支持强大的热力学剖面。此外,这两种药物对THR-β的选择性优于THR-α,并且具有良好的药代动力学特征,毒性报警最小。这些发现支持化合物2和18作为NAFLD治疗的有力候选者,需要进一步的实验验证。
{"title":"Database mining of ZINC15 natural compounds reveals potential thyroid receptor β agonists for NAFLD management: an in silico study","authors":"Ahmet Buğra Ortaakarsu, Michel Hosny, Mansour Sobeh and Mohamed A. O. Abdelfattah","doi":"10.1039/D5DD00146C","DOIUrl":"https://doi.org/10.1039/D5DD00146C","url":null,"abstract":"<p >Non-alcoholic fatty liver disease (NAFLD) is a prevalent metabolic disorder with limited therapeutic options. Thyroid receptor β (THR-β) agonists have been showing promise for controlling NAFLD <em>via</em> improving hepatic lipid metabolism. This study utilized different <em>in silico</em> tools to screen 47 199 natural compounds from the ZINC15 database to identify potential THR-β agonists. Molecular docking, molecular dynamics simulations, and advanced analyses such as PCA, TICA-FES, and MSM revealed that 4-<em>O</em>-caffeoylquinic acid (compounds <strong>2</strong>) and dihydroxydehydrodiconiferyl alcohol (compound <strong>18</strong>) are the most promising hits. Both demonstrated high binding affinity and stable agonist interactions with key THR-β residues such as Arg316 and Arg320 that stabilize the ligand binding pocket and support the agonist potential, comparable to the reference agonists resmetirom and {3,5-dichloro-4-[4-hydroxy-3-(propan-2-yl)phenoxy]phenyl}acetic acid. Long-term MD simulations confirmed their stability, and MM/GBSA calculations supported robust thermodynamic profiles. Moreover, the two hits displayed superior selectivity for THR-β over THR-α and favorable pharmacokinetic profiles with minimal toxicity alerts. These findings support compounds <strong>2</strong> and <strong>18</strong> as strong candidates for NAFLD therapy, warranting further experimental validation.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3635-3651"},"PeriodicalIF":6.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00146c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panagiotis Krokidas, Vassilis Gkatsis, John Theocharis and George Giannakopoulos
Machine learning (ML) has the potential to accelerate the discovery of high-performance materials by learning complex structure–property relationships and prioritizing candidates for costly experiments or simulations. However, ML efficiency is often offset by the need for large, high-quality training datasets, motivating strategies that intelligently select the most informative samples. Here, we formulate the search for top-performing functionalized nanoporous materials (metal–organic and covalent–organic frameworks) as a global optimization problem and apply Bayesian Optimization (BO) to identify regions of interest and rank candidates with minimal evaluations. We highlight the importance of a proper and efficient initialization scheme of the BO process, and we demonstrate how BO-acquired samples can also be used to train an XGBoost regression predictive model that can further enrich the efficient mapping of the region of high performing instances of the design space. Across multiple literature-derived adsorption and diffusion datasets containing thousands of structures, our BO framework identifies 2×- to 3×-more materials within a top-100 or top-10 ranking list, than random-sampling-based ML pipelines, and it achieves significantly higher ranking quality. Moreover, the surrogate enrichment strategy further boosts top-N recovery while maintaining high ranking fidelity. By shifting the evaluation focus from average predictive metrics (e.g., R2, MSE) to task-specific criteria (e.g., recall@N and nDCG), our approach offers a practical, data-efficient, and computationally accessible route to guide experimental and computational campaigns toward the most promising materials.
{"title":"Navigating materials design spaces with efficient Bayesian optimization: a case study in functionalized nanoporous materials","authors":"Panagiotis Krokidas, Vassilis Gkatsis, John Theocharis and George Giannakopoulos","doi":"10.1039/D5DD00237K","DOIUrl":"https://doi.org/10.1039/D5DD00237K","url":null,"abstract":"<p >Machine learning (ML) has the potential to accelerate the discovery of high-performance materials by learning complex structure–property relationships and prioritizing candidates for costly experiments or simulations. However, ML efficiency is often offset by the need for large, high-quality training datasets, motivating strategies that intelligently select the most informative samples. Here, we formulate the search for top-performing functionalized nanoporous materials (metal–organic and covalent–organic frameworks) as a global optimization problem and apply Bayesian Optimization (BO) to identify regions of interest and rank candidates with minimal evaluations. We highlight the importance of a proper and efficient initialization scheme of the BO process, and we demonstrate how BO-acquired samples can also be used to train an XGBoost regression predictive model that can further enrich the efficient mapping of the region of high performing instances of the design space. Across multiple literature-derived adsorption and diffusion datasets containing thousands of structures, our BO framework identifies 2×- to 3×-more materials within a top-100 or top-10 ranking list, than random-sampling-based ML pipelines, and it achieves significantly higher ranking quality. Moreover, the surrogate enrichment strategy further boosts top-<em>N</em> recovery while maintaining high ranking fidelity. By shifting the evaluation focus from average predictive metrics (<em>e.g.</em>, <em>R</em><small><sup>2</sup></small>, MSE) to task-specific criteria (<em>e.g.</em>, recall@<em>N</em> and nDCG), our approach offers a practical, data-efficient, and computationally accessible route to guide experimental and computational campaigns toward the most promising materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3753-3763"},"PeriodicalIF":6.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00237k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan-Lucas Uslu, Alexey Nekrasov, Alexander Hermans, Bernd Beschoten, Bastian Leibe, Lutz Waldecker and Christoph Stampfer
The detection and classification of exfoliated two-dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large-scale data collection. Existing algorithms often exhibit challenges in identifying low-contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre-trained using a synthetic data generator that generates realistic microscopy images from unlabeled data. This results in a model that can quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low-contrast materials such as hexagonal boron nitride.
{"title":"MaskTerial: a foundation model for automated 2D material flake detection","authors":"Jan-Lucas Uslu, Alexey Nekrasov, Alexander Hermans, Bernd Beschoten, Bastian Leibe, Lutz Waldecker and Christoph Stampfer","doi":"10.1039/D5DD00156K","DOIUrl":"10.1039/D5DD00156K","url":null,"abstract":"<p >The detection and classification of exfoliated two-dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large-scale data collection. Existing algorithms often exhibit challenges in identifying low-contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre-trained using a synthetic data generator that generates realistic microscopy images from unlabeled data. This results in a model that can quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low-contrast materials such as hexagonal boron nitride.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3744-3752"},"PeriodicalIF":6.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12598537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145497629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alice Gauthier, Laure Vancauwenberghe, Jean-Charles Cousty, Cyril Matthey-Doret, Robin Franken, Sabine Maennel, Pascal Miéville and Oksana Riba Grognuz
The growing demand for reproducible, high-throughput chemical experimentation calls for scalable digital infrastructures that support automation, traceability, and AI-readiness. A dedicated research data infrastructure (RDI) developed within Swiss Cat+ is presented, integrating automated synthesis, multi-stage analytics, and semantic modeling. It captures each experimental step in a structured, machine-interpretable format, forming a scalable, and interoperable data backbone. By systematically recording both successful and failed experiments, the RDI ensures data completeness, strengthens traceability, and enables the creation of bias-resilient datasets essential for robust AI model development. Built on Kubernetes and Argo Workflows and aligned with FAIR principles, the RDI transforms experimental metadata into validated Resource Description Framework (RDF) graphs using an ontology-driven semantic model. These graphs are accessible through a web interface and SPARQL endpoint, facilitating integration with downstream AI and analysis pipelines. Key features include a modular RDF converter and ‘Matryoshka files’, which encapsulate complete experiments with raw data and metadata in a portable, standardized ZIP format. This approach supports scalable querying and sets the stage for standardized data sharing and autonomous experimentation.
{"title":"A FAIR research data infrastructure for high-throughput digital chemistry","authors":"Alice Gauthier, Laure Vancauwenberghe, Jean-Charles Cousty, Cyril Matthey-Doret, Robin Franken, Sabine Maennel, Pascal Miéville and Oksana Riba Grognuz","doi":"10.1039/D5DD00297D","DOIUrl":"https://doi.org/10.1039/D5DD00297D","url":null,"abstract":"<p >The growing demand for reproducible, high-throughput chemical experimentation calls for scalable digital infrastructures that support automation, traceability, and AI-readiness. A dedicated research data infrastructure (RDI) developed within Swiss Cat+ is presented, integrating automated synthesis, multi-stage analytics, and semantic modeling. It captures each experimental step in a structured, machine-interpretable format, forming a scalable, and interoperable data backbone. By systematically recording both successful and failed experiments, the RDI ensures data completeness, strengthens traceability, and enables the creation of bias-resilient datasets essential for robust AI model development. Built on Kubernetes and Argo Workflows and aligned with FAIR principles, the RDI transforms experimental metadata into validated Resource Description Framework (RDF) graphs using an ontology-driven semantic model. These graphs are accessible through a web interface and SPARQL endpoint, facilitating integration with downstream AI and analysis pipelines. Key features include a modular RDF converter and ‘Matryoshka files’, which encapsulate complete experiments with raw data and metadata in a portable, standardized ZIP format. This approach supports scalable querying and sets the stage for standardized data sharing and autonomous experimentation.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3502-3514"},"PeriodicalIF":6.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00297d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Correction for “Beyond training data: how elemental features enhance ML-based formation energy predictions” by Hamed Mahdavi et al., Digital Discovery, 2025, 4, 2972–2982, https://doi.org/10.1039/D5DD00182J.
{"title":"Correction: Beyond training data: how elemental features enhance ML-based formation energy predictions","authors":"Hamed Mahdavi, Vasant Honavar and Dane Morgan","doi":"10.1039/D5DD90047F","DOIUrl":"https://doi.org/10.1039/D5DD90047F","url":null,"abstract":"<p >Correction for “Beyond training data: how elemental features enhance ML-based formation energy predictions” by Hamed Mahdavi <em>et al.</em>, <em>Digital Discovery</em>, 2025, <strong>4</strong>, 2972–2982, https://doi.org/10.1039/D5DD00182J.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3828-3828"},"PeriodicalIF":6.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd90047f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Metal–organic frameworks (MOFs) exhibit immense structural diversity and hold promise for applications ranging from gas storage and separation to energy storage and conversion. However, structural flexibility makes accurate and scalable property prediction difficult. While machine learning potentials (MLPs) offer a compelling balance between accuracy and efficiency, most existing models are system-specific and lack transferability across different MOFs. In this work, we introduce FFLAME – Fragment-to-Framework Learning Approach for MOF Potentials, a fragment-centric strategy for training transferable MLPs. By decomposing MOFs into their constituent metal clusters and organic linkers, FFLAME enables efficient reuse of chemical environments and significantly reduces the need for full-framework training data. We demonstrate that fragment-informed training improves model generalizability, particularly in data-scarce regimes, and accelerates convergence during fine-tuning. FFLAME achieves near-target accuracy on unseen MOFs with minimal additional training. These results establish a robust and data-efficient pathway toward general-purpose MLPs for the simulation of diverse framework materials.
{"title":"FFLAME: a fragment-to-framework learning approach for MOF potentials","authors":"Xiaoqi Zhang, Yutao Li, Xin Jin and Berend Smit","doi":"10.1039/D5DD00321K","DOIUrl":"10.1039/D5DD00321K","url":null,"abstract":"<p >Metal–organic frameworks (MOFs) exhibit immense structural diversity and hold promise for applications ranging from gas storage and separation to energy storage and conversion. However, structural flexibility makes accurate and scalable property prediction difficult. While machine learning potentials (MLPs) offer a compelling balance between accuracy and efficiency, most existing models are system-specific and lack transferability across different MOFs. In this work, we introduce FFLAME – Fragment-to-Framework Learning Approach for MOF Potentials, a fragment-centric strategy for training transferable MLPs. By decomposing MOFs into their constituent metal clusters and organic linkers, FFLAME enables efficient reuse of chemical environments and significantly reduces the need for full-framework training data. We demonstrate that fragment-informed training improves model generalizability, particularly in data-scarce regimes, and accelerates convergence during fine-tuning. FFLAME achieves near-target accuracy on unseen MOFs with minimal additional training. These results establish a robust and data-efficient pathway toward general-purpose MLPs for the simulation of diverse framework materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3466-3477"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12593188/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145483945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ricardo Montoya-Gonzalez, Rosa de Guadalupe González-Huerta, Martha Leticia Hernández-Pichardo and Subha R. Das
The integration of machine learning (ML) into materials science has the potential to accelerate material discovery and optimize properties. However, the reliability of ML models depends heavily on the consistency and reproducibility of experimental data. In this study, we present a methodology to combine automated, remotely-programmed synthesis protocols with ML to enable data-driven materials discovery. Experiments were programmed and conducted remotely through robotic syntheses at cloud laboratories, using multiple different liquid handlers and spectrometers across two independent facilities (Emerald Cloud Lab, Austin, TX and Carnegie Mellon University Automated Science Lab, Pittsburgh, PA). This multi-instrument approach ensured precise control over reaction parameters, eliminated both operator and instrument-specific variability, and enabled generation of high-quality datasets for ML training. From only 40 training samples, our approach predicts whether specific synthesis parameters will lead to successful formation of copper nanoclusters (CuNCs) with interpretable models providing mechanistic insights through SHAP analysis. Our workflow demonstrates how remotely accessed/cloud laboratory infrastructure coupled with ML can transform traditionally manual processes into autonomous, predictive systems. This multi-instrument validation demonstrates reproducibility critical for reliable ML-driven materials discovery and for advancing automated materials synthesis beyond single-laboratory demonstrations.
{"title":"Cross-laboratory validation of machine learning models for copper nanocluster synthesis using cloud-based automated platforms","authors":"Ricardo Montoya-Gonzalez, Rosa de Guadalupe González-Huerta, Martha Leticia Hernández-Pichardo and Subha R. Das","doi":"10.1039/D5DD00335K","DOIUrl":"https://doi.org/10.1039/D5DD00335K","url":null,"abstract":"<p >The integration of machine learning (ML) into materials science has the potential to accelerate material discovery and optimize properties. However, the reliability of ML models depends heavily on the consistency and reproducibility of experimental data. In this study, we present a methodology to combine automated, remotely-programmed synthesis protocols with ML to enable data-driven materials discovery. Experiments were programmed and conducted remotely through robotic syntheses at cloud laboratories, using multiple different liquid handlers and spectrometers across two independent facilities (Emerald Cloud Lab, Austin, TX and Carnegie Mellon University Automated Science Lab, Pittsburgh, PA). This multi-instrument approach ensured precise control over reaction parameters, eliminated both operator and instrument-specific variability, and enabled generation of high-quality datasets for ML training. From only 40 training samples, our approach predicts whether specific synthesis parameters will lead to successful formation of copper nanoclusters (CuNCs) with interpretable models providing mechanistic insights through SHAP analysis. Our workflow demonstrates how remotely accessed/cloud laboratory infrastructure coupled with ML can transform traditionally manual processes into autonomous, predictive systems. This multi-instrument validation demonstrates reproducibility critical for reliable ML-driven materials discovery and for advancing automated materials synthesis beyond single-laboratory demonstrations.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3683-3692"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00335k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predicting enzymatic reactions is crucial for applications in biocatalysis, metabolic engineering, and drug discovery, yet it remains a complex and resource-intensive task. Large Language Models (LLMs) have recently demonstrated remarkable success in various scientific domains, e.g., through their ability to generalize knowledge, reason over complex structures, and leverage in-context learning strategies. In this study, we systematically evaluate the capability of LLMs, particularly the Llama-3.1 family (8B and 70B), across three core biochemical tasks: enzyme commission number prediction, forward synthesis, and retrosynthesis. We compare single-task and multitask learning strategies, employing parameter-efficient fine-tuning via LoRA adapters. Additionally, we assess performance across different data regimes to explore their adaptability in low-data settings. Our results demonstrate that fine-tuned LLMs capture biochemical knowledge, with multitask learning enhancing forward- and retrosynthesis predictions by leveraging shared enzymatic information. We also identify key limitations, for example challenges in hierarchical EC classification schemes, highlighting areas for further improvement in LLM-driven biochemical modeling.
{"title":"Leveraging large language models for enzymatic reaction prediction and characterization","authors":"Lorenzo Di Fruscia and Jana M. Weber","doi":"10.1039/D5DD00187K","DOIUrl":"https://doi.org/10.1039/D5DD00187K","url":null,"abstract":"<p >Predicting enzymatic reactions is crucial for applications in biocatalysis, metabolic engineering, and drug discovery, yet it remains a complex and resource-intensive task. Large Language Models (LLMs) have recently demonstrated remarkable success in various scientific domains, <em>e.g.,</em> through their ability to generalize knowledge, reason over complex structures, and leverage in-context learning strategies. In this study, we systematically evaluate the capability of LLMs, particularly the Llama-3.1 family (8B and 70B), across three core biochemical tasks: enzyme commission number prediction, forward synthesis, and retrosynthesis. We compare single-task and multitask learning strategies, employing parameter-efficient fine-tuning <em>via</em> LoRA adapters. Additionally, we assess performance across different data regimes to explore their adaptability in low-data settings. Our results demonstrate that fine-tuned LLMs capture biochemical knowledge, with multitask learning enhancing forward- and retrosynthesis predictions by leveraging shared enzymatic information. We also identify key limitations, for example challenges in hierarchical EC classification schemes, highlighting areas for further improvement in LLM-driven biochemical modeling.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3588-3609"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00187k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junhao Cao, Nicolas Folastre, Gozde Oney, Edgar Rauch, Stavros Nicolopoulos, Partha Pratim Das and Arnaud Demortière
This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-component loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.
{"title":"Unsupervised multi-clustering and decision-making strategies for 4D-STEM orientation mapping","authors":"Junhao Cao, Nicolas Folastre, Gozde Oney, Edgar Rauch, Stavros Nicolopoulos, Partha Pratim Das and Arnaud Demortière","doi":"10.1039/D5DD00071H","DOIUrl":"https://doi.org/10.1039/D5DD00071H","url":null,"abstract":"<p >This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (<em>k</em>) required for robust and interpretable orientation mapping. By leveraging the <em>K</em>-component loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3610-3622"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00071h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amit Chaudhari, Kushagra Agrawal and Andrew J. Logsdail
Accurate electronic structure simulations of strongly correlated metal oxides are crucial for the atomic level understanding of heterogeneous catalysts, batteries and photovoltaics, but remain challenging to perform in a computationally tractable manner. Hubbard corrected density functional theory (DFT+U) in a numerical atom-centred orbital framework has been shown to address this challenge but is susceptible to numerical instability when simulating common transition metal oxides (TMOs), e.g., TiO2 and rare-earth metal oxides (REOs), e.g., CeO2, necessitating the development of advanced DFT+U parameterisation strategies. In this work, the numerical instabilities of DFT+U are traced to the default atomic Hubbard projector, which we refine for Ti 3d orbitals in TiO2 using Bayesian optimisation, with a cost function and constraints defined using symbolic regression (SR) and support vector machines, respectively. The optimised Ti 3d Hubbard projector enables the numerically stable simulation of electron polarons at intrinsic and extrinsic defects in both anatase and rutile TiO2, with comparable accuracy to hybrid-DFT at several orders of magnitude lower computational cost. We extend the method by defining a general first-principles approach for optimising Hubbard projectors, based on reproducing orbital occupancies calculated using hybrid-DFT. Using a hierarchical SR-defined cost function that depends on DFT-predicted orbital occupancies, basis set parameters and atomic material descriptors, a generalised workflow for the one-shot computation of Hubbard U values and projectors is presented. The method transferability is shown for 10 prototypical TMOs and REOs, with demonstrable accuracy for unseen materials that extends to complex battery cathode materials like LiCo1−xMgxO2−x. The work highlights the integration of advanced machine learning algorithms to develop cost-effective and transferable workflows for DFT+U parameterisation, enabling more accurate and efficient simulations of strongly correlated metal oxides.
强相关金属氧化物的精确电子结构模拟对于理解非均相催化剂、电池和光伏电池的原子水平至关重要,但在计算上仍然具有挑战性。Hubbard校正密度泛函数理论(DFT+U)在数值原子中心轨道框架中已被证明可以解决这一挑战,但在模拟常见的过渡金属氧化物(TMOs)时,如TiO2和稀土金属氧化物(REOs),如CeO2,容易受到数值不稳定性的影响,因此需要开发先进的DFT+U参数化策略。在这项工作中,DFT+U的数值不稳定性追溯到默认的原子Hubbard投影仪,我们使用贝叶斯优化对TiO2中的Ti 3d轨道进行了改进,分别使用符号回归(SR)和支持向量机定义了成本函数和约束。优化后的Ti 3d Hubbard投影仪能够在锐钛矿和金红石型TiO2的内在和外在缺陷处进行电子极化子的数值稳定模拟,其精度与混合dft相当,计算成本降低了几个数量级。我们通过定义优化哈伯德投影仪的一般第一性原理方法来扩展该方法,该方法基于使用混合dft计算的轨道占位率的再现。利用基于dft预测的轨道占位、基集参数和原子材料描述符的分层sr定义成本函数,提出了一种一次性计算Hubbard U值和投影的通用工作流程。该方法的可转移性在10个原型TMOs和reo上得到了证明,对于未见过的材料,如LiCo1−xMgxO2−x,具有可证明的准确性。这项工作强调了先进机器学习算法的集成,为DFT+U参数化开发具有成本效益和可转移的工作流程,从而能够更准确、更有效地模拟强相关金属氧化物。
{"title":"Machine learning generalised DFT+U projectors in a numerical atom-centred orbital framework","authors":"Amit Chaudhari, Kushagra Agrawal and Andrew J. Logsdail","doi":"10.1039/D5DD00292C","DOIUrl":"https://doi.org/10.1039/D5DD00292C","url":null,"abstract":"<p >Accurate electronic structure simulations of strongly correlated metal oxides are crucial for the atomic level understanding of heterogeneous catalysts, batteries and photovoltaics, but remain challenging to perform in a computationally tractable manner. Hubbard corrected density functional theory (DFT+<em>U</em>) in a numerical atom-centred orbital framework has been shown to address this challenge but is susceptible to numerical instability when simulating common transition metal oxides (TMOs), <em>e.g.</em>, TiO<small><sub>2</sub></small> and rare-earth metal oxides (REOs), <em>e.g.</em>, CeO<small><sub>2</sub></small>, necessitating the development of advanced DFT+<em>U</em> parameterisation strategies. In this work, the numerical instabilities of DFT+<em>U</em> are traced to the default atomic Hubbard projector, which we refine for Ti 3d orbitals in TiO<small><sub>2</sub></small> using Bayesian optimisation, with a cost function and constraints defined using symbolic regression (SR) and support vector machines, respectively. The optimised Ti 3d Hubbard projector enables the numerically stable simulation of electron polarons at intrinsic and extrinsic defects in both anatase and rutile TiO<small><sub>2</sub></small>, with comparable accuracy to hybrid-DFT at several orders of magnitude lower computational cost. We extend the method by defining a general first-principles approach for optimising Hubbard projectors, based on reproducing orbital occupancies calculated using hybrid-DFT. Using a hierarchical SR-defined cost function that depends on DFT-predicted orbital occupancies, basis set parameters and atomic material descriptors, a generalised workflow for the one-shot computation of Hubbard <em>U</em> values and projectors is presented. The method transferability is shown for 10 prototypical TMOs and REOs, with demonstrable accuracy for unseen materials that extends to complex battery cathode materials like LiCo<small><sub>1−<em>x</em></sub></small>Mg<small><sub><em>x</em></sub></small>O<small><sub>2−<em>x</em></sub></small>. The work highlights the integration of advanced machine learning algorithms to develop cost-effective and transferable workflows for DFT+<em>U</em> parameterisation, enabling more accurate and efficient simulations of strongly correlated metal oxides.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3701-3727"},"PeriodicalIF":6.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00292c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}