Wentao Li, Yijun Li, Qi Lei, Zemeng Wang and Xiaonan Wang
Designing high-performance polymers remains a critical challenge due to the vast design space. While machine learning and generative models have advanced polymer informatics, most approaches lack directional optimization capabilities and fail to close the loop between design and physical validation. Here we introduce PolyRL, a closed-loop reinforcement learning (RL) framework for the inverse design of gas separation polymers. By integrating reward model training, generative model pre-training, RL fine-tuning, and theoretical validation, PolyRL achieves multi-objective optimization under data-scarce conditions. We demonstrate that PolyRL is capable of efficiently generating polymer candidates with enhanced gas separation performance, as substantiated by detailed molecular simulation analyses. Additionally, we establish a standardized benchmark for RL-based polymer generation, providing a foundation for future research. This work showcases the power of reinforcement learning in polymer design and advances AI-driven materials discovery toward closed-loop, goal-directed paradigms.
{"title":"PolyRL: reinforcement learning-guided polymer generation for multi-objective polymer discovery","authors":"Wentao Li, Yijun Li, Qi Lei, Zemeng Wang and Xiaonan Wang","doi":"10.1039/D5DD00272A","DOIUrl":"https://doi.org/10.1039/D5DD00272A","url":null,"abstract":"<p >Designing high-performance polymers remains a critical challenge due to the vast design space. While machine learning and generative models have advanced polymer informatics, most approaches lack directional optimization capabilities and fail to close the loop between design and physical validation. Here we introduce PolyRL, a closed-loop reinforcement learning (RL) framework for the inverse design of gas separation polymers. By integrating reward model training, generative model pre-training, RL fine-tuning, and theoretical validation, PolyRL achieves multi-objective optimization under data-scarce conditions. We demonstrate that PolyRL is capable of efficiently generating polymer candidates with enhanced gas separation performance, as substantiated by detailed molecular simulation analyses. Additionally, we establish a standardized benchmark for RL-based polymer generation, providing a foundation for future research. This work showcases the power of reinforcement learning in polymer design and advances AI-driven materials discovery toward closed-loop, goal-directed paradigms.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 266-276"},"PeriodicalIF":6.2,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00272a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146007001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khayrul Islam, Ryan F. Forelli, Jianzhong Han, Deven Bhadane, Jian Huang, Joshua C. Agar, Nhan Tran, Seda Ogrenci and Yaling Liu
Precise cell classification is essential in biomedical diagnostics and therapeutic monitoring, particularly for identifying diverse cell types involved in various diseases. Traditional cell classification methods, such as flow cytometry, depend on molecular labeling, which is often costly, time-intensive, and can alter cell integrity. Real-time microfluidic sorters also impose a sub-ms decision window that existing machine-learning pipelines cannot meet. To overcome these limitations, we present a label-free machine learning framework for cell classification, designed for real-time sorting applications using bright-field microscopy images. This approach leverages a teacher–student model architecture enhanced by knowledge distillation, achieving high efficiency and scalability across different cell types. Demonstrated through a use case of classifying lymphocyte subsets, our framework accurately classifies T4, T8, and B cell types with a dataset of 80 000 pre-processed images, released publicly as the LymphoMNIST package for reproducible benchmarking. Our teacher model attained 98% accuracy in differentiating T4 cells from B cells and 93% accuracy in zero-shot classification between T8 and B cells. Remarkably, our student model operates with only 5682 parameters (∼0.02% of the teacher, a 5000-fold reduction), enabling field-programmable gate array (FPGA) deployment. Implemented directly on the frame-grabber FPGA as the first demonstration of in situ deep learning in this setting, the student model achieves an ultra-low inference latency of just 14.5 µs and a complete cell detection-to-sorting trigger time of 24.7 µs, delivering 12× and 40× improvements over the previous state of the art in inference and total latency, respectively, while preserving accuracy comparable to the teacher model. This framework establishes the first sub-25 µs ML benchmark for label-free cytometry and provides an open, cost-effective blueprint for upgrading existing imaging sorters.
{"title":"Real-time cell sorting with scalable in situ FPGA-accelerated deep learning","authors":"Khayrul Islam, Ryan F. Forelli, Jianzhong Han, Deven Bhadane, Jian Huang, Joshua C. Agar, Nhan Tran, Seda Ogrenci and Yaling Liu","doi":"10.1039/D5DD00345H","DOIUrl":"https://doi.org/10.1039/D5DD00345H","url":null,"abstract":"<p >Precise cell classification is essential in biomedical diagnostics and therapeutic monitoring, particularly for identifying diverse cell types involved in various diseases. Traditional cell classification methods, such as flow cytometry, depend on molecular labeling, which is often costly, time-intensive, and can alter cell integrity. Real-time microfluidic sorters also impose a sub-ms decision window that existing machine-learning pipelines cannot meet. To overcome these limitations, we present a label-free machine learning framework for cell classification, designed for real-time sorting applications using bright-field microscopy images. This approach leverages a teacher–student model architecture enhanced by knowledge distillation, achieving high efficiency and scalability across different cell types. Demonstrated through a use case of classifying lymphocyte subsets, our framework accurately classifies T4, T8, and B cell types with a dataset of 80 000 pre-processed images, released publicly as the LymphoMNIST package for reproducible benchmarking. Our teacher model attained 98% accuracy in differentiating T4 cells from B cells and 93% accuracy in zero-shot classification between T8 and B cells. Remarkably, our student model operates with only 5682 parameters (∼0.02% of the teacher, a 5000-fold reduction), enabling field-programmable gate array (FPGA) deployment. Implemented directly on the frame-grabber FPGA as the first demonstration of <em>in situ</em> deep learning in this setting, the student model achieves an ultra-low inference latency of just 14.5 µs and a complete cell detection-to-sorting trigger time of 24.7 µs, delivering 12× and 40× improvements over the previous state of the art in inference and total latency, respectively, while preserving accuracy comparable to the teacher model. This framework establishes the first sub-25 µs ML benchmark for label-free cytometry and provides an open, cost-effective blueprint for upgrading existing imaging sorters.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 254-265"},"PeriodicalIF":6.2,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00345h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146007000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trevor Hastings, James Paramore, Brady Butler and Raymundo Arróyave
Bayesian optimization (BO) has emerged as an effective strategy to accelerate the discovery of new materials by efficiently exploring complex and high-dimensional design spaces. However, the success of BO methods greatly depends on how well the optimization campaign is initialized—the selection of initial data points from which the optimization starts. In this study, we focus on improving these initial datasets by incorporating materials science expertise into the selection process. We identify common challenges and sources of uncertainty when choosing these starting points and propose practical guidelines for using expert-defined criteria to create more informative initial datasets. By evaluating these methods through simulations and real-world alloy design problems, we demonstrate that using domain-informed criteria leads to initial datasets that are more diverse and representative. This enhanced starting point significantly improves the efficiency and effectiveness of subsequent optimization efforts. We also introduce clear metrics for assessing the quality and diversity of initial datasets, providing a straightforward way to compare different initialization strategies. Our approach offers a robust and widely applicable framework to enhance Bayesian optimization across various materials discovery scenarios.
{"title":"Leveraging domain knowledge for optimal initialization in Bayesian materials optimization","authors":"Trevor Hastings, James Paramore, Brady Butler and Raymundo Arróyave","doi":"10.1039/D5DD00361J","DOIUrl":"https://doi.org/10.1039/D5DD00361J","url":null,"abstract":"<p >Bayesian optimization (BO) has emerged as an effective strategy to accelerate the discovery of new materials by efficiently exploring complex and high-dimensional design spaces. However, the success of BO methods greatly depends on how well the optimization campaign is initialized—the selection of initial data points from which the optimization starts. In this study, we focus on improving these initial datasets by incorporating materials science expertise into the selection process. We identify common challenges and sources of uncertainty when choosing these starting points and propose practical guidelines for using expert-defined criteria to create more informative initial datasets. By evaluating these methods through simulations and real-world alloy design problems, we demonstrate that using domain-informed criteria leads to initial datasets that are more diverse and representative. This enhanced starting point significantly improves the efficiency and effectiveness of subsequent optimization efforts. We also introduce clear metrics for assessing the quality and diversity of initial datasets, providing a straightforward way to compare different initialization strategies. Our approach offers a robust and widely applicable framework to enhance Bayesian optimization across various materials discovery scenarios.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 277-289"},"PeriodicalIF":6.2,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00361j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146007002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naruki Yoshikawa, Kevin Angers, Kourosh Darvish, Sargol Okhovatian, Dawn Bannerman, Ilya Yakavets, Milica Radisic and Alán Aspuru-Guzik
Precise liquid handling is an essential operation for self-driving laboratories. In 2023, we introduced the digital pipette, a low-cost, 3D-printed device that enables accurate liquid transfer by robotic arms. However, the initial version lacked mechanisms to prevent cross-contamination when handling multiple liquids. In this commit paper, we present the digital pipette v2, an updated design that mitigates contamination risk by allowing robotic arms to exchange pipette tips. The new hardware achieves liquid handling accuracy within the permissible error range defined by ISO 8655-2, supporting a broader range of experiments involving multiple liquids.
{"title":"Commit: Digital pipette: open hardware for liquid transfer in self-driving laboratories","authors":"Naruki Yoshikawa, Kevin Angers, Kourosh Darvish, Sargol Okhovatian, Dawn Bannerman, Ilya Yakavets, Milica Radisic and Alán Aspuru-Guzik","doi":"10.1039/D5DD00336A","DOIUrl":"https://doi.org/10.1039/D5DD00336A","url":null,"abstract":"<p >Precise liquid handling is an essential operation for self-driving laboratories. In 2023, we introduced the digital pipette, a low-cost, 3D-printed device that enables accurate liquid transfer by robotic arms. However, the initial version lacked mechanisms to prevent cross-contamination when handling multiple liquids. In this commit paper, we present the digital pipette v2, an updated design that mitigates contamination risk by allowing robotic arms to exchange pipette tips. The new hardware achieves liquid handling accuracy within the permissible error range defined by ISO 8655-2, supporting a broader range of experiments involving multiple liquids.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 93-97"},"PeriodicalIF":6.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00336a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jason L. Wu, David M. Friday, Changhyun Hwang, Seungjoo Yi, Tiara C. Torres-Flores, Martin D. Burke, Ying Diao, Charles M. Schroeder and Nicholas E. Jackson
Machine learning (ML) is increasingly central to chemical discovery, yet most efforts remain confined to distributed and isolated research groups, limiting external validation and community engagement. Here, we introduce a generalizable mode of scientific outreach that couples a published study to a community-engaged test set, enabling post-publication evaluation by the broader ML community. This approach is demonstrated using a prior study on AI-guided discovery of photostable light-harvesting small molecules. After publishing an experimental dataset and in-house ML models, we leveraged automated block chemistry to synthesize nine additional light-harvesting molecules to serve as a blinded community test set. We then hosted an open Kaggle competition where we challenged the world community to outperform our best in-house predictive photostability model. In only one month, this competition received >700 submissions, including several innovative strategies that improved upon our previously published results. Given the success of this competition, we propose community-engaged test sets as a blueprint for post-publication benchmarking that democratizes access to high-quality experimental data, encourages innovative scientific engagement, and strengthens cross-disciplinary collaboration in the chemical sciences.
{"title":"Democratizing machine learning in chemistry with community-engaged test sets","authors":"Jason L. Wu, David M. Friday, Changhyun Hwang, Seungjoo Yi, Tiara C. Torres-Flores, Martin D. Burke, Ying Diao, Charles M. Schroeder and Nicholas E. Jackson","doi":"10.1039/D5DD00424A","DOIUrl":"https://doi.org/10.1039/D5DD00424A","url":null,"abstract":"<p >Machine learning (ML) is increasingly central to chemical discovery, yet most efforts remain confined to distributed and isolated research groups, limiting external validation and community engagement. Here, we introduce a generalizable mode of scientific outreach that couples a published study to a community-engaged test set, enabling post-publication evaluation by the broader ML community. This approach is demonstrated using a prior study on AI-guided discovery of photostable light-harvesting small molecules. After publishing an experimental dataset and in-house ML models, we leveraged automated block chemistry to synthesize nine additional light-harvesting molecules to serve as a blinded community test set. We then hosted an open Kaggle competition where we challenged the world community to outperform our best in-house predictive photostability model. In only one month, this competition received >700 submissions, including several innovative strategies that improved upon our previously published results. Given the success of this competition, we propose community-engaged test sets as a blueprint for post-publication benchmarking that democratizes access to high-quality experimental data, encourages innovative scientific engagement, and strengthens cross-disciplinary collaboration in the chemical sciences.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 304-309"},"PeriodicalIF":6.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00424a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146007004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bing Ma, Na Qin, Qianqian Yan, Wei Zhou, Sheng Zhang, Xiao Wang, Lipiao Bao and Xing Lu
Porous framework materials—including metal–organic frameworks (MOFs) and covalent organic frameworks (COFs)—have attracted widespread attention due to their high surface areas, tunable pore structures, and diverse functionalities, enabling promising applications in gas separation, catalysis, and energy storage. However, the vast chemical configuration space and the complexity of multi-parameter synthesis conditions pose significant challenges to the rational design and controlled synthesis of materials with targeted properties. In recent years, artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL), in combination with multiscale molecular simulation methods such as density functional theory (DFT), grand canonical Monte Carlo (GCMC), and molecular dynamics (MD), has emerged as a powerful tool for accelerating the screening and optimization of framework materials. This review systematically summarizes AI-assisted strategies for framework material design, focusing on data-driven prediction of synthetic routes, optimization of reaction conditions, and inverse design targeting specific functionalities. We evaluate key AI models, including interpretable tree-based algorithms and neural networks capable of modeling complex structure–property relationships, and highlight their integration with atomistic simulations to enhance predictive accuracy. Furthermore, the synergy between AI and automated experimental platforms is advancing the development of high-throughput experimentation and self-optimizing workflows, often referred to as self-driving laboratories. Several case studies illustrate the effectiveness of AI methods in identifying high-performance framework materials and achieving morphology control, particularly when leveraging the integration of experimental and simulation data. The review also discusses key challenges in AI-assisted materials design, including inconsistent data quality, limited model interpretability, and the gap between prediction and practical synthesis. Looking ahead, the continued expansion of materials databases, advances in AI algorithms, and deeper integration of domain knowledge are expected to play an increasingly vital role in framework material development, driving a paradigm shift in materials research from empirical trial-and-error to more efficient, predictive, and intelligent design.
{"title":"Advancing metal organic framework and covalent organic framework design via the digital-intelligent paradigm","authors":"Bing Ma, Na Qin, Qianqian Yan, Wei Zhou, Sheng Zhang, Xiao Wang, Lipiao Bao and Xing Lu","doi":"10.1039/D5DD00401B","DOIUrl":"https://doi.org/10.1039/D5DD00401B","url":null,"abstract":"<p >Porous framework materials—including metal–organic frameworks (MOFs) and covalent organic frameworks (COFs)—have attracted widespread attention due to their high surface areas, tunable pore structures, and diverse functionalities, enabling promising applications in gas separation, catalysis, and energy storage. However, the vast chemical configuration space and the complexity of multi-parameter synthesis conditions pose significant challenges to the rational design and controlled synthesis of materials with targeted properties. In recent years, artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL), in combination with multiscale molecular simulation methods such as density functional theory (DFT), grand canonical Monte Carlo (GCMC), and molecular dynamics (MD), has emerged as a powerful tool for accelerating the screening and optimization of framework materials. This review systematically summarizes AI-assisted strategies for framework material design, focusing on data-driven prediction of synthetic routes, optimization of reaction conditions, and inverse design targeting specific functionalities. We evaluate key AI models, including interpretable tree-based algorithms and neural networks capable of modeling complex structure–property relationships, and highlight their integration with atomistic simulations to enhance predictive accuracy. Furthermore, the synergy between AI and automated experimental platforms is advancing the development of high-throughput experimentation and self-optimizing workflows, often referred to as self-driving laboratories. Several case studies illustrate the effectiveness of AI methods in identifying high-performance framework materials and achieving morphology control, particularly when leveraging the integration of experimental and simulation data. The review also discusses key challenges in AI-assisted materials design, including inconsistent data quality, limited model interpretability, and the gap between prediction and practical synthesis. Looking ahead, the continued expansion of materials databases, advances in AI algorithms, and deeper integration of domain knowledge are expected to play an increasingly vital role in framework material development, driving a paradigm shift in materials research from empirical trial-and-error to more efficient, predictive, and intelligent design.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 523-547"},"PeriodicalIF":6.2,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00401b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ilgar Baghishov, Jan Janssen, Graeme Henkelman and Danny Perez
Machine-learned interatomic potentials (MLIPs) are revolutionizing computational materials science and chemistry by offering an efficient alternative to ab initio molecular dynamics (MD) simulations. However, fitting high-quality MLIPs remains a challenging, time-consuming, and computationally intensive task where numerous trade-offs have to be considered, e.g., How much and what kind of atomic configurations should be included in the training set? Which level of ab initio convergence should be used to generate the training set? Which loss function should be used for fitting the MLIP? Which machine learning architecture should be used to train the MLIP? The answers to these questions significantly impact both the computational cost of MLIP training and the accuracy and computational cost of subsequent MLIP MD simulations. In this study, we use a configurationally diverse beryllium dataset and quadratic spectral neighbor analysis potential. We demonstrate that joint optimization of energy versus force weights, training set selection strategies, and convergence settings of the ab initio reference simulations, as well as model complexity can lead to a significant reduction in the overall computational cost associated with training and evaluating MLIPs. This opens the door to computationally efficient generation of high-quality MLIPs for a range of applications which demand different accuracy versus training and evaluation cost trade-offs.
{"title":"Application-specific machine-learned interatomic potentials: exploring the trade-off between DFT convergence, MLIP expressivity, and computational cost","authors":"Ilgar Baghishov, Jan Janssen, Graeme Henkelman and Danny Perez","doi":"10.1039/D5DD00294J","DOIUrl":"https://doi.org/10.1039/D5DD00294J","url":null,"abstract":"<p >Machine-learned interatomic potentials (MLIPs) are revolutionizing computational materials science and chemistry by offering an efficient alternative to <em>ab initio</em> molecular dynamics (MD) simulations. However, fitting high-quality MLIPs remains a challenging, time-consuming, and computationally intensive task where numerous trade-offs have to be considered, <em>e.g.,</em> How much and what kind of atomic configurations should be included in the training set? Which level of <em>ab initio</em> convergence should be used to generate the training set? Which loss function should be used for fitting the MLIP? Which machine learning architecture should be used to train the MLIP? The answers to these questions significantly impact both the computational cost of MLIP training and the accuracy and computational cost of subsequent MLIP MD simulations. In this study, we use a configurationally diverse beryllium dataset and quadratic spectral neighbor analysis potential. We demonstrate that joint optimization of energy <em>versus</em> force weights, training set selection strategies, and convergence settings of the <em>ab initio</em> reference simulations, as well as model complexity can lead to a significant reduction in the overall computational cost associated with training and evaluating MLIPs. This opens the door to computationally efficient generation of high-quality MLIPs for a range of applications which demand different accuracy <em>versus</em> training and evaluation cost trade-offs.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 332-347"},"PeriodicalIF":6.2,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00294j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present RetroSynFormer, a novel approach to multi-step retrosynthesis planning. Here, we express the task of iteratively breaking down a compound into building blocks as a sequence-modeling problem and train a model based on the Decision Transformer. The synthesis routes are generated by iteratively predicting chemical reactions from a set of predefined rules that encode known transformations, and routes are scored during construction using a novel reward function. RetroSynFormer was trained on routes extracted from the PaRoutes dataset of patented experimental routes. On targets from the PaRoutes test set, the RetroSynFormer could find routes to commercial starting materials for 92% of the targets, and we show that the produced routes on average are close to the reference patented route and of good quality. Furthermore, we explore alternative model implementations and discuss the robustness of the model with respect to beam width, reward function, and template space size. We also compare RetroSynFormer to AiZynthFinder, a conventional retrosynthesis algorithm, and find that our novel model is competitive and complementary to the established methodology, thus forming a valuable addition to the field of computer-aided synthesis planning.
{"title":"Retrosynformer: planning multi-step chemical synthesis routes via a decision transformer","authors":"Emma Granqvist, Rocío Mercado and Samuel Genheden","doi":"10.1039/D5DD00153F","DOIUrl":"https://doi.org/10.1039/D5DD00153F","url":null,"abstract":"<p >We present RetroSynFormer, a novel approach to multi-step retrosynthesis planning. Here, we express the task of iteratively breaking down a compound into building blocks as a sequence-modeling problem and train a model based on the Decision Transformer. The synthesis routes are generated by iteratively predicting chemical reactions from a set of predefined rules that encode known transformations, and routes are scored during construction using a novel reward function. RetroSynFormer was trained on routes extracted from the PaRoutes dataset of patented experimental routes. On targets from the PaRoutes test set, the RetroSynFormer could find routes to commercial starting materials for 92% of the targets, and we show that the produced routes on average are close to the reference patented route and of good quality. Furthermore, we explore alternative model implementations and discuss the robustness of the model with respect to beam width, reward function, and template space size. We also compare RetroSynFormer to AiZynthFinder, a conventional retrosynthesis algorithm, and find that our novel model is competitive and complementary to the established methodology, thus forming a valuable addition to the field of computer-aided synthesis planning.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 348-362"},"PeriodicalIF":6.2,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00153f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nakul Rampal, Dongrong Joe Fu, Chengbin Zhao, Hanan S. Murayshid, Albatool A. Abaalkhail, Nahla E. Alhazmi, Majed O. Alawad, Christian Borgs, Jennifer T. Chayes and Omar M. Yaghi
We report an automated evaluation agent that can reliably assign classification labels to different Q&A pairs of both single-hop and multi-hop types, as well as to synthesis conditions datasets. Our agent is built around a suite of large language models (LLMs) and is designed to eliminate human involvement in the evaluation process. Even though we believe that this approach has broad applicability, for concreteness, we apply it here to reticular chemistry. Through extensive testing of various approaches such as DSPy and finetuning, among others, we found that the performance of a given LLM on these Q&A and synthesis conditions classification tasks is determined primarily by the architecture of the agent, where how the different inputs are parsed and processed and how the LLMs are called make a significant difference. We also found that the quality of the prompt provided remains paramount, irrespective of the sophistication of the underlying model. Even models considered state-of-the-art, such as GPT-o1, exhibit poor performance when the prompt lacks sufficient detail and structure. To overcome these challenges, we performed systematic prompt optimization, iteratively refining the prompt to significantly improve classification accuracy and achieve human-level evaluation benchmarks. We show that while LLMs have made remarkable progress, they still fall short of human reasoning without substantial prompt engineering. The agent presented here provides a robust and reproducible tool for evaluating Q&A pairs and synthesis conditions in a scalable manner and can serve as a foundation for future developments in automated evaluation of LLM inputs and outputs and more generally to create foundation models in chemistry.
{"title":"An automated evaluation agent for Q&A pairs and reticular synthesis conditions","authors":"Nakul Rampal, Dongrong Joe Fu, Chengbin Zhao, Hanan S. Murayshid, Albatool A. Abaalkhail, Nahla E. Alhazmi, Majed O. Alawad, Christian Borgs, Jennifer T. Chayes and Omar M. Yaghi","doi":"10.1039/D5DD00413F","DOIUrl":"https://doi.org/10.1039/D5DD00413F","url":null,"abstract":"<p >We report an automated evaluation agent that can reliably assign classification labels to different Q&A pairs of both single-hop and multi-hop types, as well as to synthesis conditions datasets. Our agent is built around a suite of large language models (LLMs) and is designed to eliminate human involvement in the evaluation process. Even though we believe that this approach has broad applicability, for concreteness, we apply it here to reticular chemistry. Through extensive testing of various approaches such as DSPy and finetuning, among others, we found that the performance of a given LLM on these Q&A and synthesis conditions classification tasks is determined primarily by the architecture of the agent, where how the different inputs are parsed and processed and how the LLMs are called make a significant difference. We also found that the quality of the prompt provided remains paramount, irrespective of the sophistication of the underlying model. Even models considered state-of-the-art, such as GPT-o1, exhibit poor performance when the prompt lacks sufficient detail and structure. To overcome these challenges, we performed systematic prompt optimization, iteratively refining the prompt to significantly improve classification accuracy and achieve human-level evaluation benchmarks. We show that while LLMs have made remarkable progress, they still fall short of human reasoning without substantial prompt engineering. The agent presented here provides a robust and reproducible tool for evaluating Q&A pairs and synthesis conditions in a scalable manner and can serve as a foundation for future developments in automated evaluation of LLM inputs and outputs and more generally to create foundation models in chemistry.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 231-240"},"PeriodicalIF":6.2,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00413f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research investigates predicting the Highest Occupied Molecular Orbital and the Lowest Unoccupied Molecular Orbital (HOMO–LUMO; short HL) gap of natural compounds, a crucial property for understanding molecular electronic behavior relevant to cheminformatics and materials science. To address the high computational cost of traditional methods, this study develops a high-throughput, machine learning (ML)-based approach. Using 407 000 molecules from the COCONUT database, RDKit was employed to calculate and select molecular descriptors. The computational workflow, managed by Toil and CWL on a high-performance computing (HPC) Slurm cluster, utilized Geometry – Frequency – Noncovalent – eXtended Tight Binding (GFN2-xTB) for electronic structure calculations with Boltzmann weighting across multiple conformational states. Three ensemble methods, namely Gradient Boosting Regression (GBR), eXtreme Gradient Boosting Regression (XGBR), Random Forrest Regression (RFR) and a Multi-layer Perceptron Regressor (MLPR) were compared based on their ability to accurately predict HL-gaps in this chemical space. Key findings reveal molecular polarizability, particularly SMR_VSA descriptors, as crucial for HL-gap determination in all models. Aromatic rings and functional groups, such as ketones, also significantly influence the HL-gap prediction. While the MLPR model demonstrated good overall predictive performance, accuracy varied across molecular subsets. Challenges were observed in predicting HL-gaps for molecules containing aliphatic carboxylic acids, alcohols, and amines in molecular systems with complex electronic structure. This work emphasizes the importance of polarizability and structural features in HL-gap predictive modeling, showcasing the potential of machine learning while also highlighting limitations in handling specific structural motifs. These limitations point towards promising perspectives for further model improvements.
本研究旨在预测天然化合物的最高已占据分子轨道和最低未占据分子轨道(HOMO-LUMO; short HL)间隙,这是理解与化学信息学和材料科学相关的分子电子行为的重要性质。为了解决传统方法的高计算成本,本研究开发了一种基于机器学习(ML)的高通量方法。利用COCONUT数据库中的407 000个分子,使用RDKit计算和选择分子描述符。计算工作流由Toil和CWL在高性能计算(HPC) Slurm集群上管理,利用几何-频率-非共价-扩展紧密结合(GFN2-xTB)进行电子结构计算,并在多个构象状态上使用玻尔兹曼加权。比较了梯度增强回归(GBR)、极端梯度增强回归(XGBR)、随机Forrest回归(RFR)和多层感知器回归(MLPR)三种集成方法对该化学空间中hl -gap的准确预测能力。关键发现揭示了分子极化率,特别是SMR_VSA描述子,在所有模型中都是确定HL-gap的关键因素。芳香环和官能团(如酮类)也显著影响HL-gap的预测。虽然MLPR模型显示出良好的整体预测性能,但准确性在分子亚群之间存在差异。在具有复杂电子结构的分子体系中,预测含有脂肪族羧酸、醇和胺的分子的hl -间隙存在挑战。这项工作强调了极化和结构特征在HL-gap预测建模中的重要性,展示了机器学习的潜力,同时也强调了处理特定结构主题的局限性。这些限制为进一步的模型改进指明了有希望的前景。
{"title":"High throughput tight binding calculation of electronic HOMO–LUMO gaps and its prediction for natural compounds","authors":"Sascha Thinius","doi":"10.1039/D5DD00186B","DOIUrl":"https://doi.org/10.1039/D5DD00186B","url":null,"abstract":"<p >This research investigates predicting the Highest Occupied Molecular Orbital and the Lowest Unoccupied Molecular Orbital (HOMO–LUMO; short HL) gap of natural compounds, a crucial property for understanding molecular electronic behavior relevant to cheminformatics and materials science. To address the high computational cost of traditional methods, this study develops a high-throughput, machine learning (ML)-based approach. Using 407 000 molecules from the COCONUT database, RDKit was employed to calculate and select molecular descriptors. The computational workflow, managed by Toil and CWL on a high-performance computing (HPC) Slurm cluster, utilized Geometry – Frequency – Noncovalent – eXtended Tight Binding (GFN2-xTB) for electronic structure calculations with Boltzmann weighting across multiple conformational states. Three ensemble methods, namely Gradient Boosting Regression (GBR), eXtreme Gradient Boosting Regression (XGBR), Random Forrest Regression (RFR) and a Multi-layer Perceptron Regressor (MLPR) were compared based on their ability to accurately predict HL-gaps in this chemical space. Key findings reveal molecular polarizability, particularly SMR_VSA descriptors, as crucial for HL-gap determination in all models. Aromatic rings and functional groups, such as ketones, also significantly influence the HL-gap prediction. While the MLPR model demonstrated good overall predictive performance, accuracy varied across molecular subsets. Challenges were observed in predicting HL-gaps for molecules containing aliphatic carboxylic acids, alcohols, and amines in molecular systems with complex electronic structure. This work emphasizes the importance of polarizability and structural features in HL-gap predictive modeling, showcasing the potential of machine learning while also highlighting limitations in handling specific structural motifs. These limitations point towards promising perspectives for further model improvements.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 203-213"},"PeriodicalIF":6.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00186b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}