Single-step retrosynthesis models are integral to the development of computer-aided synthesis planning (CASP) tools, leveraging past reaction data to generate new synthetic pathways. However, it remains unclear how the diversity of reactions within a training set impacts model performance. Here, we assess how dataset size and diversity, as defined using automatically extracted reaction templates, affect accuracy and reaction feasibility of three state-of-the-art architectures – template-based LocalRetro and template-free MEGAN and RootAligned. We show that increasing the diversity of the training set (from 1k to 10k templates) significantly increases top-5 round-trip accuracy while reducing top-10 accuracy, impacting prediction feasibility and recall, respectively. In contrast, increasing dataset size without increasing template diversity yields minimal performance gains for LocalRetro and MEGAN, showing that these architectures are robust even with smaller datasets. Moreover, reaction templates that are less common in the training dataset have significantly lower top-k accuracy than more common ones, regardless of the model architecture. Finally, we use an external data source to validate the drastic difference between top-k accuracies on seen and unseen templates, showing that there is limited capability for generalisation to novel disconnections. Our findings suggest that reaction templates can be used to describe the underlying diversity of reaction datasets and the scope of trained models, and that the task of single-step retrosynthesis suffers from a class imbalance problem.
{"title":"An exploration of dataset bias in single-step retrosynthesis prediction","authors":"Sara Tanovic, Ewa Wieczorek and Fernanda Duarte","doi":"10.1039/D5DD00358J","DOIUrl":"https://doi.org/10.1039/D5DD00358J","url":null,"abstract":"<p >Single-step retrosynthesis models are integral to the development of computer-aided synthesis planning (CASP) tools, leveraging past reaction data to generate new synthetic pathways. However, it remains unclear how the diversity of reactions within a training set impacts model performance. Here, we assess how dataset size and diversity, as defined using automatically extracted reaction templates, affect accuracy and reaction feasibility of three state-of-the-art architectures – template-based LocalRetro and template-free MEGAN and RootAligned. We show that increasing the diversity of the training set (from 1k to 10k templates) significantly increases top-5 round-trip accuracy while reducing top-10 accuracy, impacting prediction feasibility and recall, respectively. In contrast, increasing dataset size without increasing template diversity yields minimal performance gains for LocalRetro and MEGAN, showing that these architectures are robust even with smaller datasets. Moreover, reaction templates that are less common in the training dataset have significantly lower top-<em>k</em> accuracy than more common ones, regardless of the model architecture. Finally, we use an external data source to validate the drastic difference between top-<em>k</em> accuracies on seen and unseen templates, showing that there is limited capability for generalisation to novel disconnections. Our findings suggest that reaction templates can be used to describe the underlying diversity of reaction datasets and the scope of trained models, and that the task of single-step retrosynthesis suffers from a class imbalance problem.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 793-802"},"PeriodicalIF":6.2,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00358j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chloe Wilson, María Calvo, Stamatia Zavitsanou, James D. Somper, Ewa Wieczorek, Tom Watts, Jason Crain and Fernanda Duarte
The accurate prediction of reaction rates is an integral step in elucidating reaction mechanisms and designing synthetic pathways. Traditionally, kinetic parameters have been derived from activation energies obtained from quantum mechanical (QM) methods and, more recently, machine learning (ML) approaches. Among ML methods, Bidirectional Encoder Representations from Transformers (BERT), a type of transformer-based model, is the state-of-the-art method for both reaction classification and yield prediction. Despite its success, it has yet to be applied to kinetic prediction. In this work, we developed a BERT model to predict experimental log k values of bimolecular nucleophilic substitution (SN2) reactions and compared its performance to the top-performing Random Forest (RF) literature model in terms of accuracy, training time, and interpretability. Both BERT and RF models exhibit near-experimental accuracy (RMSE ≈ 1.1 log k) on similarity-split test data. Interpretation of the predictions from both models reveals that they successfully identify key reaction centres and reproduce known electronic and steric trends. This analysis also highlights the distinct limitations of each; RF outperformed BERT in identifying aromatic allylic effects, while BERT showed stronger extrapolation capabilities.
{"title":"Kinetic predictions for SN2 reactions using the BERT architecture: comparison and interpretation","authors":"Chloe Wilson, María Calvo, Stamatia Zavitsanou, James D. Somper, Ewa Wieczorek, Tom Watts, Jason Crain and Fernanda Duarte","doi":"10.1039/D5DD00192G","DOIUrl":"https://doi.org/10.1039/D5DD00192G","url":null,"abstract":"<p >The accurate prediction of reaction rates is an integral step in elucidating reaction mechanisms and designing synthetic pathways. Traditionally, kinetic parameters have been derived from activation energies obtained from quantum mechanical (QM) methods and, more recently, machine learning (ML) approaches. Among ML methods, Bidirectional Encoder Representations from Transformers (BERT), a type of transformer-based model, is the state-of-the-art method for both reaction classification and yield prediction. Despite its success, it has yet to be applied to kinetic prediction. In this work, we developed a BERT model to predict experimental log <em>k</em> values of bimolecular nucleophilic substitution (S<small><sub>N</sub></small>2) reactions and compared its performance to the top-performing Random Forest (RF) literature model in terms of accuracy, training time, and interpretability. Both BERT and RF models exhibit near-experimental accuracy (RMSE ≈ 1.1 log <em>k</em>) on similarity-split test data. Interpretation of the predictions from both models reveals that they successfully identify key reaction centres and reproduce known electronic and steric trends. This analysis also highlights the distinct limitations of each; RF outperformed BERT in identifying aromatic allylic effects, while BERT showed stronger extrapolation capabilities.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 743-753"},"PeriodicalIF":6.2,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00192g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Madeleine A. Gaidimas, Abhijoy Mandal, Pan Chen, Shi Xuan Leong, Gyu-Hee Kim, Akshay Talekar, Kent O. Kirlikovali, Kourosh Darvish, Omar K. Farha, Varinia Bernales and Alán Aspuru-Guzik
Advances in high-throughput instrumentation and laboratory automation are revolutionizing materials synthesis by enabling the rapid generation of large libraries of novel materials. However, efficient characterization of these synthetic libraries remains a significant bottleneck in the discovery of new materials. Traditional characterization methods are often limited to sequential analysis, making them time-intensive and cost-prohibitive when applied to large sample sets. In the same way that chemists interpret visual indicators to identify promising samples, computer vision (CV) is an efficient approach to accelerate materials characterization across varying scales when visual cues are present. CV is particularly useful in high-throughput synthesis and characterization workflows, as these techniques can be rapid, scalable, and cost-effective. Although there is a set of growing examples in the literature, we have found a lack of resources where newcomers interested in the field could get a hold of a practical way to get started. Here, we aim to fill that identified gap and present a structured tutorial for experimentalists to integrate computer vision into high-throughput materials research, providing a detailed roadmap from data collection to model validation. Specifically, we describe the hardware and software stack required for deploying CV in materials characterization, including image acquisition, annotation strategies, model training, and performance evaluation. As a case study, we demonstrate the implementation of a CV workflow within a high-throughput materials synthesis and characterization platform to investigate the crystallization of metal–organic frameworks (MOFs). By outlining key challenges and best practices, this tutorial aims to equip chemists and materials scientists with the necessary tools to harness CV for accelerating materials discovery.
{"title":"Computer vision for high-throughput materials synthesis: a tutorial for experimentalists","authors":"Madeleine A. Gaidimas, Abhijoy Mandal, Pan Chen, Shi Xuan Leong, Gyu-Hee Kim, Akshay Talekar, Kent O. Kirlikovali, Kourosh Darvish, Omar K. Farha, Varinia Bernales and Alán Aspuru-Guzik","doi":"10.1039/D5DD00384A","DOIUrl":"https://doi.org/10.1039/D5DD00384A","url":null,"abstract":"<p >Advances in high-throughput instrumentation and laboratory automation are revolutionizing materials synthesis by enabling the rapid generation of large libraries of novel materials. However, efficient characterization of these synthetic libraries remains a significant bottleneck in the discovery of new materials. Traditional characterization methods are often limited to sequential analysis, making them time-intensive and cost-prohibitive when applied to large sample sets. In the same way that chemists interpret visual indicators to identify promising samples, computer vision (CV) is an efficient approach to accelerate materials characterization across varying scales when visual cues are present. CV is particularly useful in high-throughput synthesis and characterization workflows, as these techniques can be rapid, scalable, and cost-effective. Although there is a set of growing examples in the literature, we have found a lack of resources where newcomers interested in the field could get a hold of a practical way to get started. Here, we aim to fill that identified gap and present a structured tutorial for experimentalists to integrate computer vision into high-throughput materials research, providing a detailed roadmap from data collection to model validation. Specifically, we describe the hardware and software stack required for deploying CV in materials characterization, including image acquisition, annotation strategies, model training, and performance evaluation. As a case study, we demonstrate the implementation of a CV workflow within a high-throughput materials synthesis and characterization platform to investigate the crystallization of metal–organic frameworks (MOFs). By outlining key challenges and best practices, this tutorial aims to equip chemists and materials scientists with the necessary tools to harness CV for accelerating materials discovery.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 510-522"},"PeriodicalIF":6.2,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00384a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Satya Pratik Srivastava, Rohan Gorantla, Sharath Krishna Chundru, Claire J. R. Winkelman, Antonia S. J. S. Mey and Rajeev Kumar Singh
Active learning (AL) prioritises which compounds to measure next for protein–ligand affinity when assay or simulation budgets are limited. We present an explainable AL framework built on Gaussian process regression and assess how molecular representations, covariance kernels, and acquisition policies affect enrichment across four drug-relevant targets. Using recall of the top active compound, we find that dataset identity which is a target's chemical landscape sets the performance ceiling and method choices modulate outcomes rather than overturn them. Fingerprints with simple Gaussian process kernels provide robust, low-variance enrichment, whereas learned embeddings with non-linear kernels can reach higher peaks but with greater variability. Uncertainty-guided acquisition consistently outperforms random selection, yet no single policy is universally optimal; the best choice follows structure–activity relationship (SAR) complexity. To enhance interpretability beyond black-box selection, we integrate SHapley Additive exPlanations (SHAP) to link high-impact fingerprint bits to chemically meaningful fragments across AL cycles, illustrating how the model's attention progressively concentrates on SAR-relevant motifs. We additionally provide an interactive active learning analysis platform featuring SHAP traces to support reproducibility and target-specific decision-making.
{"title":"Explainable active learning framework for ligand binding affinity prediction","authors":"Satya Pratik Srivastava, Rohan Gorantla, Sharath Krishna Chundru, Claire J. R. Winkelman, Antonia S. J. S. Mey and Rajeev Kumar Singh","doi":"10.1039/D5DD00436E","DOIUrl":"https://doi.org/10.1039/D5DD00436E","url":null,"abstract":"<p >Active learning (AL) prioritises which compounds to measure next for protein–ligand affinity when assay or simulation budgets are limited. We present an explainable AL framework built on Gaussian process regression and assess how molecular representations, covariance kernels, and acquisition policies affect enrichment across four drug-relevant targets. Using recall of the top active compound, we find that dataset identity which is a target's chemical landscape sets the performance ceiling and method choices modulate outcomes rather than overturn them. Fingerprints with simple Gaussian process kernels provide robust, low-variance enrichment, whereas learned embeddings with non-linear kernels can reach higher peaks but with greater variability. Uncertainty-guided acquisition consistently outperforms random selection, yet no single policy is universally optimal; the best choice follows structure–activity relationship (SAR) complexity. To enhance interpretability beyond black-box selection, we integrate SHapley Additive exPlanations (SHAP) to link high-impact fingerprint bits to chemically meaningful fragments across AL cycles, illustrating how the model's attention progressively concentrates on SAR-relevant motifs. We additionally provide an interactive active learning analysis platform featuring SHAP traces to support reproducibility and target-specific decision-making.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 769-779"},"PeriodicalIF":6.2,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00436e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Janghoon Ock, Radheesh Sharma Meda, Tirtha Vinchurkar, Yayati Jadhav and Amir Barati Farimani
Adsorption energy is a key reactivity descriptor in catalysis. Determining adsorption energy requires evaluating numerous adsorbate–catalyst configurations, making it computationally intensive. Current methods rely on exhaustive sampling, which must navigate a large search space without guaranteeing the identification of the global minimum energy. To address this, we introduce Adsorb-Agent, a Large Language Model (LLM) agent designed to efficiently identify stable adsorption configurations corresponding to the global minimum energy. Adsorb-Agent leverages its built-in knowledge and reasoning to strategically explore configurations, significantly reducing the number of initial configurations required while improving the energy prediction accuracy. In this study, we also evaluated the performance of different LLMs—GPT-4o, GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-Chat—as the reasoning engine for Adsorb-Agent, with GPT-4o showing the strongest overall performance. Tested on twenty diverse systems, Adsorb-Agent identifies comparable adsorption energies for 84% of cases and achieves lower energies for 35%, particularly excelling in complex systems. It identifies lower energies in 47% of intermetallic systems and 67% of systems with large adsorbates. These findings demonstrate Adsorb-Agent's potential to accelerate catalyst discovery by reducing computational costs and enhancing prediction reliability compared to exhaustive search methods.
{"title":"Adsorb-Agent: autonomous identification of stable adsorption configurations via a large language model agent","authors":"Janghoon Ock, Radheesh Sharma Meda, Tirtha Vinchurkar, Yayati Jadhav and Amir Barati Farimani","doi":"10.1039/D5DD00298B","DOIUrl":"https://doi.org/10.1039/D5DD00298B","url":null,"abstract":"<p >Adsorption energy is a key reactivity descriptor in catalysis. Determining adsorption energy requires evaluating numerous adsorbate–catalyst configurations, making it computationally intensive. Current methods rely on exhaustive sampling, which must navigate a large search space without guaranteeing the identification of the global minimum energy. To address this, we introduce Adsorb-Agent, a Large Language Model (LLM) agent designed to efficiently identify stable adsorption configurations corresponding to the global minimum energy. Adsorb-Agent leverages its built-in knowledge and reasoning to strategically explore configurations, significantly reducing the number of initial configurations required while improving the energy prediction accuracy. In this study, we also evaluated the performance of different LLMs—GPT-4o, GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-Chat—as the reasoning engine for Adsorb-Agent, with GPT-4o showing the strongest overall performance. Tested on twenty diverse systems, Adsorb-Agent identifies comparable adsorption energies for 84% of cases and achieves lower energies for 35%, particularly excelling in complex systems. It identifies lower energies in 47% of intermetallic systems and 67% of systems with large adsorbates. These findings demonstrate Adsorb-Agent's potential to accelerate catalyst discovery by reducing computational costs and enhancing prediction reliability compared to exhaustive search methods.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 617-629"},"PeriodicalIF":6.2,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00298b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis H. M. Torres, Sofia M. da Silva, Joel P. Arrais, Catarina Pimentel and Bernardete Ribeiro
Correction for ‘Advancing mutagenicity predictions in drug discovery with an explainable few-shot deep learning framework’ by Luis H. M. Torres et al., Digital Discovery, 2025, 4, 3515–3532, https://doi.org/10.1039/D5DD00276A.
更正Luis H. M. Torres等人的“利用可解释的少量深度学习框架推进药物发现中的突变性预测”,《数字发现》,2025,4,3515 - 3532,https://doi.org/10.1039/D5DD00276A。
{"title":"Correction: Advancing mutagenicity predictions in drug discovery with an explainable few-shot deep learning framework","authors":"Luis H. M. Torres, Sofia M. da Silva, Joel P. Arrais, Catarina Pimentel and Bernardete Ribeiro","doi":"10.1039/D5DD90058A","DOIUrl":"https://doi.org/10.1039/D5DD90058A","url":null,"abstract":"<p >Correction for ‘Advancing mutagenicity predictions in drug discovery with an explainable few-shot deep learning framework’ by Luis H. M. Torres <em>et al.</em>, <em>Digital Discovery</em>, 2025, <strong>4</strong>, 3515–3532, https://doi.org/10.1039/D5DD00276A.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 463-463"},"PeriodicalIF":6.2,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd90058a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Babak Mahjour, Felix Katzenburg, Emil Lammi and Tim Cernak
In this report, the pharmaceuticals listed in DrugBank were structurally mapped to a commercial catalog of chemical feedstocks through reaction agnostic one step retrosynthetic decomposition. Enumerative combinatorics was utilized to retrosynthesize target molecules into commercially available building blocks, wherein only the bond formed and the minimal substructure template of each building block class are considered. In contrast to the status quo in automated retrosynthesis, our algorithm may suggest reactions that do not yet exist but, if they did, could enable the synthesis of drugs in just one reaction step from commercial feedstocks. Cross-referencing synthons to commercial datasets can thus reveal valuable reaction classes for development in addition to streamlining drug production. Decomposed synthons were linked to target molecules by transformations that form one bond after the elimination of each synthon's respective reactive functional handle, as indicated by their building block class. Specific reactivities were analyzed after post hoc refinement and clustering of commercial synthons. Maps between boronates, bromides, iodides, amines, acids, chlorides, alcohols, and various C–H motifs to form alkyl–alkyl, alkyl–aryl, and aryl–aryl carbon–carbon, carbon–nitrogen, and carbon–oxygen bonds are reported herein, with specific examples for each provided.
{"title":"One step retrosynthesis of drugs from commercially available chemical building blocks and conceivable coupling reactions","authors":"Babak Mahjour, Felix Katzenburg, Emil Lammi and Tim Cernak","doi":"10.1039/D5DD00310E","DOIUrl":"https://doi.org/10.1039/D5DD00310E","url":null,"abstract":"<p >In this report, the pharmaceuticals listed in DrugBank were structurally mapped to a commercial catalog of chemical feedstocks through reaction agnostic one step retrosynthetic decomposition. Enumerative combinatorics was utilized to retrosynthesize target molecules into commercially available building blocks, wherein only the bond formed and the minimal substructure template of each building block class are considered. In contrast to the status quo in automated retrosynthesis, our algorithm may suggest reactions that do not yet exist but, if they did, could enable the synthesis of drugs in just one reaction step from commercial feedstocks. Cross-referencing synthons to commercial datasets can thus reveal valuable reaction classes for development in addition to streamlining drug production. Decomposed synthons were linked to target molecules by transformations that form one bond after the elimination of each synthon's respective reactive functional handle, as indicated by their building block class. Specific reactivities were analyzed after <em>post hoc</em> refinement and clustering of commercial synthons. Maps between boronates, bromides, iodides, amines, acids, chlorides, alcohols, and various C–H motifs to form alkyl–alkyl, alkyl–aryl, and aryl–aryl carbon–carbon, carbon–nitrogen, and carbon–oxygen bonds are reported herein, with specific examples for each provided.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 153-160"},"PeriodicalIF":6.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00310e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kento Murakami, Yudai Yamaguchi, Yo Kato, Kazuki Ishikawa, Naoto Tanibata, Hayami Takeda, Masanobu Nakayama and Masayuki Karasuyama
Lithium-ion-conductive oxide materials have attracted considerable attention as solid electrolytes for all-solid-state batteries. In particular, LiZr2(PO4)3-related compounds are promising for high-energy-density devices using metallic lithium anodes, but further enhancement of their ionic conductivity is requested. In general, Li-ion conductivity is influenced by mechanisms operating on two distinct length scales. At the atomic scale, point defects and the associated migration barriers within the crystal lattice are critical, whereas at the micrometre scale, porosity and grain-boundary characteristics that develop during sintering become the dominant factors. These coupled effects make systematic optimization of conductivity difficult. In paticular, microstructural analysis has often relied on researchers' intuitive interpretation of scanning electron microscopy (SEM) images. Here, we apply a convolutional neural network (CNN), a deep-learning approach that has seen rapid advances in image analysis, to SEM images of LiZr2(PO4)3-based electrolytes. By combining image-derived features with conventional vector descriptors (composition, sintering parameters, etc.), our regression model achieved an R2 of 0.871. Furthermore, visual-interpretability analysis of the trained CNN revealed that grain-boundary regions were highlighted as low-conductivity areas. These findings demonstrate that deep-learning-based SEM analysis enables automated, quantitative evaluation of ionic conductivity and offers a powerful tool for accelerating the development of solid electrolyte materials.
{"title":"Deep learning based SEM image analysis for predicting ionic conductivity in LiZr2(PO4)3-based solid electrolytes","authors":"Kento Murakami, Yudai Yamaguchi, Yo Kato, Kazuki Ishikawa, Naoto Tanibata, Hayami Takeda, Masanobu Nakayama and Masayuki Karasuyama","doi":"10.1039/D5DD00232J","DOIUrl":"https://doi.org/10.1039/D5DD00232J","url":null,"abstract":"<p >Lithium-ion-conductive oxide materials have attracted considerable attention as solid electrolytes for all-solid-state batteries. In particular, LiZr<small><sub>2</sub></small>(PO<small><sub>4</sub></small>)<small><sub>3</sub></small>-related compounds are promising for high-energy-density devices using metallic lithium anodes, but further enhancement of their ionic conductivity is requested. In general, Li-ion conductivity is influenced by mechanisms operating on two distinct length scales. At the atomic scale, point defects and the associated migration barriers within the crystal lattice are critical, whereas at the micrometre scale, porosity and grain-boundary characteristics that develop during sintering become the dominant factors. These coupled effects make systematic optimization of conductivity difficult. In paticular, microstructural analysis has often relied on researchers' intuitive interpretation of scanning electron microscopy (SEM) images. Here, we apply a convolutional neural network (CNN), a deep-learning approach that has seen rapid advances in image analysis, to SEM images of LiZr<small><sub>2</sub></small>(PO<small><sub>4</sub></small>)<small><sub>3</sub></small>-based electrolytes. By combining image-derived features with conventional vector descriptors (composition, sintering parameters, <em>etc.</em>), our regression model achieved an <em>R</em><small><sup>2</sup></small> of 0.871. Furthermore, visual-interpretability analysis of the trained CNN revealed that grain-boundary regions were highlighted as low-conductivity areas. These findings demonstrate that deep-learning-based SEM analysis enables automated, quantitative evaluation of ionic conductivity and offers a powerful tool for accelerating the development of solid electrolyte materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 453-462"},"PeriodicalIF":6.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00232j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven G. Arturo, Clyde Fare, Kaoru Aou, Dan Dermody, Will Edsall, Jillian Emerson, Kathryn Grzesiak, Arjita Kulshreshtha, Paul Mwasame, Edward O. Pyzer-Knapp and Jed Pitera
Phase diagrams of complex fluids are essential tools for understanding solubility and miscibility. Using a new objective function coupled with a constrained Bayesian optimization algorithm, we demonstrate the efficient location of phase boundaries in a sample two-phase ternary modeled using polymer self-consistent field theory, regularly seeing 50% fewer observations than an exhaustive search. Our approach is general, gradient-free, and can be applied to either simulation or experimental campaigns.
{"title":"Efficient simulation of complex fluid phase diagrams with Bayesian optimization","authors":"Steven G. Arturo, Clyde Fare, Kaoru Aou, Dan Dermody, Will Edsall, Jillian Emerson, Kathryn Grzesiak, Arjita Kulshreshtha, Paul Mwasame, Edward O. Pyzer-Knapp and Jed Pitera","doi":"10.1039/D5DD00150A","DOIUrl":"https://doi.org/10.1039/D5DD00150A","url":null,"abstract":"<p >Phase diagrams of complex fluids are essential tools for understanding solubility and miscibility. Using a new objective function coupled with a constrained Bayesian optimization algorithm, we demonstrate the efficient location of phase boundaries in a sample two-phase ternary modeled using polymer self-consistent field theory, regularly seeing 50% fewer observations than an exhaustive search. Our approach is general, gradient-free, and can be applied to either simulation or experimental campaigns.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 88-92"},"PeriodicalIF":6.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00150a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangxin Fan, Yinghui Wu, Roger H. French, Danny Perez, Michael G. Taylor and Ping Yang
Solubility quantifies the concentration of a molecule that can dissolve in a given solvent. Accurate prediction of solubility is essential for optimizing drug efficacy, improving chemical and separation processes, and waste management, among many other industrial and research applications. Predicting solubility from first principles remains a complex and computationally intensive physicochemical challenge. Recent successes of graph neural networks for molecular learning tasks inspire us to develop HASolGNN, a hierarchical–attention graph neural network for solubility prediction. (1) HASolGNN adopts a three-level hierarchical attention framework to leverage atom-bond, molecular, and interaction-graph level features. This allows a more comprehensive modeling of both intra-molecular and inter-molecular interactions for solute–solvent dissolution as a complex system. (2) To mitigate the impact of small amounts of annotated data, we also investigate the role of Large Language Models (LLMs), and introduce HASolGNN-LLMs, an LLM-enhanced predictive framework that leverages LLMs to infer annotated features and embeddings to improve representation learning. Our experiments verified that (1) HASolGNN outperforms the state-of-the-art methods in solubility prediction; and (2) HASolGNN-LLMs effectively exploits LLMs to enhance sparsely annotated data and further improves overall accuracy.
{"title":"Hierarchical attention graph learning with LLM enhancement for molecular solubility prediction","authors":"Yangxin Fan, Yinghui Wu, Roger H. French, Danny Perez, Michael G. Taylor and Ping Yang","doi":"10.1039/D5DD00407A","DOIUrl":"https://doi.org/10.1039/D5DD00407A","url":null,"abstract":"<p >Solubility quantifies the concentration of a molecule that can dissolve in a given solvent. Accurate prediction of solubility is essential for optimizing drug efficacy, improving chemical and separation processes, and waste management, among many other industrial and research applications. Predicting solubility from first principles remains a complex and computationally intensive physicochemical challenge. Recent successes of graph neural networks for molecular learning tasks inspire us to develop HASolGNN, a hierarchical–attention graph neural network for solubility prediction. (1) HASolGNN adopts a three-level hierarchical attention framework to leverage atom-bond, molecular, and interaction-graph level features. This allows a more comprehensive modeling of both intra-molecular and inter-molecular interactions for solute–solvent dissolution as a complex system. (2) To mitigate the impact of small amounts of annotated data, we also investigate the role of Large Language Models (LLMs), and introduce HASolGNN-LLMs, an LLM-enhanced predictive framework that leverages LLMs to infer annotated features and embeddings to improve representation learning. Our experiments verified that (1) HASolGNN outperforms the state-of-the-art methods in solubility prediction; and (2) HASolGNN-LLMs effectively exploits LLMs to enhance sparsely annotated data and further improves overall accuracy.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 603-616"},"PeriodicalIF":6.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00407a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}