Ilgar Baghishov, Jan Janssen, Graeme Henkelman and Danny Perez
Machine-learned interatomic potentials (MLIPs) are revolutionizing computational materials science and chemistry by offering an efficient alternative to ab initio molecular dynamics (MD) simulations. However, fitting high-quality MLIPs remains a challenging, time-consuming, and computationally intensive task where numerous trade-offs have to be considered, e.g., How much and what kind of atomic configurations should be included in the training set? Which level of ab initio convergence should be used to generate the training set? Which loss function should be used for fitting the MLIP? Which machine learning architecture should be used to train the MLIP? The answers to these questions significantly impact both the computational cost of MLIP training and the accuracy and computational cost of subsequent MLIP MD simulations. In this study, we use a configurationally diverse beryllium dataset and quadratic spectral neighbor analysis potential. We demonstrate that joint optimization of energy versus force weights, training set selection strategies, and convergence settings of the ab initio reference simulations, as well as model complexity can lead to a significant reduction in the overall computational cost associated with training and evaluating MLIPs. This opens the door to computationally efficient generation of high-quality MLIPs for a range of applications which demand different accuracy versus training and evaluation cost trade-offs.
{"title":"Application-specific machine-learned interatomic potentials: exploring the trade-off between DFT convergence, MLIP expressivity, and computational cost","authors":"Ilgar Baghishov, Jan Janssen, Graeme Henkelman and Danny Perez","doi":"10.1039/D5DD00294J","DOIUrl":"https://doi.org/10.1039/D5DD00294J","url":null,"abstract":"<p >Machine-learned interatomic potentials (MLIPs) are revolutionizing computational materials science and chemistry by offering an efficient alternative to <em>ab initio</em> molecular dynamics (MD) simulations. However, fitting high-quality MLIPs remains a challenging, time-consuming, and computationally intensive task where numerous trade-offs have to be considered, <em>e.g.,</em> How much and what kind of atomic configurations should be included in the training set? Which level of <em>ab initio</em> convergence should be used to generate the training set? Which loss function should be used for fitting the MLIP? Which machine learning architecture should be used to train the MLIP? The answers to these questions significantly impact both the computational cost of MLIP training and the accuracy and computational cost of subsequent MLIP MD simulations. In this study, we use a configurationally diverse beryllium dataset and quadratic spectral neighbor analysis potential. We demonstrate that joint optimization of energy <em>versus</em> force weights, training set selection strategies, and convergence settings of the <em>ab initio</em> reference simulations, as well as model complexity can lead to a significant reduction in the overall computational cost associated with training and evaluating MLIPs. This opens the door to computationally efficient generation of high-quality MLIPs for a range of applications which demand different accuracy <em>versus</em> training and evaluation cost trade-offs.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 332-347"},"PeriodicalIF":6.2,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00294j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present RetroSynFormer, a novel approach to multi-step retrosynthesis planning. Here, we express the task of iteratively breaking down a compound into building blocks as a sequence-modeling problem and train a model based on the Decision Transformer. The synthesis routes are generated by iteratively predicting chemical reactions from a set of predefined rules that encode known transformations, and routes are scored during construction using a novel reward function. RetroSynFormer was trained on routes extracted from the PaRoutes dataset of patented experimental routes. On targets from the PaRoutes test set, the RetroSynFormer could find routes to commercial starting materials for 92% of the targets, and we show that the produced routes on average are close to the reference patented route and of good quality. Furthermore, we explore alternative model implementations and discuss the robustness of the model with respect to beam width, reward function, and template space size. We also compare RetroSynFormer to AiZynthFinder, a conventional retrosynthesis algorithm, and find that our novel model is competitive and complementary to the established methodology, thus forming a valuable addition to the field of computer-aided synthesis planning.
{"title":"Retrosynformer: planning multi-step chemical synthesis routes via a decision transformer","authors":"Emma Granqvist, Rocío Mercado and Samuel Genheden","doi":"10.1039/D5DD00153F","DOIUrl":"https://doi.org/10.1039/D5DD00153F","url":null,"abstract":"<p >We present RetroSynFormer, a novel approach to multi-step retrosynthesis planning. Here, we express the task of iteratively breaking down a compound into building blocks as a sequence-modeling problem and train a model based on the Decision Transformer. The synthesis routes are generated by iteratively predicting chemical reactions from a set of predefined rules that encode known transformations, and routes are scored during construction using a novel reward function. RetroSynFormer was trained on routes extracted from the PaRoutes dataset of patented experimental routes. On targets from the PaRoutes test set, the RetroSynFormer could find routes to commercial starting materials for 92% of the targets, and we show that the produced routes on average are close to the reference patented route and of good quality. Furthermore, we explore alternative model implementations and discuss the robustness of the model with respect to beam width, reward function, and template space size. We also compare RetroSynFormer to AiZynthFinder, a conventional retrosynthesis algorithm, and find that our novel model is competitive and complementary to the established methodology, thus forming a valuable addition to the field of computer-aided synthesis planning.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 348-362"},"PeriodicalIF":6.2,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00153f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nakul Rampal, Dongrong Joe Fu, Chengbin Zhao, Hanan S. Murayshid, Albatool A. Abaalkhail, Nahla E. Alhazmi, Majed O. Alawad, Christian Borgs, Jennifer T. Chayes and Omar M. Yaghi
We report an automated evaluation agent that can reliably assign classification labels to different Q&A pairs of both single-hop and multi-hop types, as well as to synthesis conditions datasets. Our agent is built around a suite of large language models (LLMs) and is designed to eliminate human involvement in the evaluation process. Even though we believe that this approach has broad applicability, for concreteness, we apply it here to reticular chemistry. Through extensive testing of various approaches such as DSPy and finetuning, among others, we found that the performance of a given LLM on these Q&A and synthesis conditions classification tasks is determined primarily by the architecture of the agent, where how the different inputs are parsed and processed and how the LLMs are called make a significant difference. We also found that the quality of the prompt provided remains paramount, irrespective of the sophistication of the underlying model. Even models considered state-of-the-art, such as GPT-o1, exhibit poor performance when the prompt lacks sufficient detail and structure. To overcome these challenges, we performed systematic prompt optimization, iteratively refining the prompt to significantly improve classification accuracy and achieve human-level evaluation benchmarks. We show that while LLMs have made remarkable progress, they still fall short of human reasoning without substantial prompt engineering. The agent presented here provides a robust and reproducible tool for evaluating Q&A pairs and synthesis conditions in a scalable manner and can serve as a foundation for future developments in automated evaluation of LLM inputs and outputs and more generally to create foundation models in chemistry.
{"title":"An automated evaluation agent for Q&A pairs and reticular synthesis conditions","authors":"Nakul Rampal, Dongrong Joe Fu, Chengbin Zhao, Hanan S. Murayshid, Albatool A. Abaalkhail, Nahla E. Alhazmi, Majed O. Alawad, Christian Borgs, Jennifer T. Chayes and Omar M. Yaghi","doi":"10.1039/D5DD00413F","DOIUrl":"https://doi.org/10.1039/D5DD00413F","url":null,"abstract":"<p >We report an automated evaluation agent that can reliably assign classification labels to different Q&A pairs of both single-hop and multi-hop types, as well as to synthesis conditions datasets. Our agent is built around a suite of large language models (LLMs) and is designed to eliminate human involvement in the evaluation process. Even though we believe that this approach has broad applicability, for concreteness, we apply it here to reticular chemistry. Through extensive testing of various approaches such as DSPy and finetuning, among others, we found that the performance of a given LLM on these Q&A and synthesis conditions classification tasks is determined primarily by the architecture of the agent, where how the different inputs are parsed and processed and how the LLMs are called make a significant difference. We also found that the quality of the prompt provided remains paramount, irrespective of the sophistication of the underlying model. Even models considered state-of-the-art, such as GPT-o1, exhibit poor performance when the prompt lacks sufficient detail and structure. To overcome these challenges, we performed systematic prompt optimization, iteratively refining the prompt to significantly improve classification accuracy and achieve human-level evaluation benchmarks. We show that while LLMs have made remarkable progress, they still fall short of human reasoning without substantial prompt engineering. The agent presented here provides a robust and reproducible tool for evaluating Q&A pairs and synthesis conditions in a scalable manner and can serve as a foundation for future developments in automated evaluation of LLM inputs and outputs and more generally to create foundation models in chemistry.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 231-240"},"PeriodicalIF":6.2,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00413f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research investigates predicting the Highest Occupied Molecular Orbital and the Lowest Unoccupied Molecular Orbital (HOMO–LUMO; short HL) gap of natural compounds, a crucial property for understanding molecular electronic behavior relevant to cheminformatics and materials science. To address the high computational cost of traditional methods, this study develops a high-throughput, machine learning (ML)-based approach. Using 407 000 molecules from the COCONUT database, RDKit was employed to calculate and select molecular descriptors. The computational workflow, managed by Toil and CWL on a high-performance computing (HPC) Slurm cluster, utilized Geometry – Frequency – Noncovalent – eXtended Tight Binding (GFN2-xTB) for electronic structure calculations with Boltzmann weighting across multiple conformational states. Three ensemble methods, namely Gradient Boosting Regression (GBR), eXtreme Gradient Boosting Regression (XGBR), Random Forrest Regression (RFR) and a Multi-layer Perceptron Regressor (MLPR) were compared based on their ability to accurately predict HL-gaps in this chemical space. Key findings reveal molecular polarizability, particularly SMR_VSA descriptors, as crucial for HL-gap determination in all models. Aromatic rings and functional groups, such as ketones, also significantly influence the HL-gap prediction. While the MLPR model demonstrated good overall predictive performance, accuracy varied across molecular subsets. Challenges were observed in predicting HL-gaps for molecules containing aliphatic carboxylic acids, alcohols, and amines in molecular systems with complex electronic structure. This work emphasizes the importance of polarizability and structural features in HL-gap predictive modeling, showcasing the potential of machine learning while also highlighting limitations in handling specific structural motifs. These limitations point towards promising perspectives for further model improvements.
本研究旨在预测天然化合物的最高已占据分子轨道和最低未占据分子轨道(HOMO-LUMO; short HL)间隙,这是理解与化学信息学和材料科学相关的分子电子行为的重要性质。为了解决传统方法的高计算成本,本研究开发了一种基于机器学习(ML)的高通量方法。利用COCONUT数据库中的407 000个分子,使用RDKit计算和选择分子描述符。计算工作流由Toil和CWL在高性能计算(HPC) Slurm集群上管理,利用几何-频率-非共价-扩展紧密结合(GFN2-xTB)进行电子结构计算,并在多个构象状态上使用玻尔兹曼加权。比较了梯度增强回归(GBR)、极端梯度增强回归(XGBR)、随机Forrest回归(RFR)和多层感知器回归(MLPR)三种集成方法对该化学空间中hl -gap的准确预测能力。关键发现揭示了分子极化率,特别是SMR_VSA描述子,在所有模型中都是确定HL-gap的关键因素。芳香环和官能团(如酮类)也显著影响HL-gap的预测。虽然MLPR模型显示出良好的整体预测性能,但准确性在分子亚群之间存在差异。在具有复杂电子结构的分子体系中,预测含有脂肪族羧酸、醇和胺的分子的hl -间隙存在挑战。这项工作强调了极化和结构特征在HL-gap预测建模中的重要性,展示了机器学习的潜力,同时也强调了处理特定结构主题的局限性。这些限制为进一步的模型改进指明了有希望的前景。
{"title":"High throughput tight binding calculation of electronic HOMO–LUMO gaps and its prediction for natural compounds","authors":"Sascha Thinius","doi":"10.1039/D5DD00186B","DOIUrl":"https://doi.org/10.1039/D5DD00186B","url":null,"abstract":"<p >This research investigates predicting the Highest Occupied Molecular Orbital and the Lowest Unoccupied Molecular Orbital (HOMO–LUMO; short HL) gap of natural compounds, a crucial property for understanding molecular electronic behavior relevant to cheminformatics and materials science. To address the high computational cost of traditional methods, this study develops a high-throughput, machine learning (ML)-based approach. Using 407 000 molecules from the COCONUT database, RDKit was employed to calculate and select molecular descriptors. The computational workflow, managed by Toil and CWL on a high-performance computing (HPC) Slurm cluster, utilized Geometry – Frequency – Noncovalent – eXtended Tight Binding (GFN2-xTB) for electronic structure calculations with Boltzmann weighting across multiple conformational states. Three ensemble methods, namely Gradient Boosting Regression (GBR), eXtreme Gradient Boosting Regression (XGBR), Random Forrest Regression (RFR) and a Multi-layer Perceptron Regressor (MLPR) were compared based on their ability to accurately predict HL-gaps in this chemical space. Key findings reveal molecular polarizability, particularly SMR_VSA descriptors, as crucial for HL-gap determination in all models. Aromatic rings and functional groups, such as ketones, also significantly influence the HL-gap prediction. While the MLPR model demonstrated good overall predictive performance, accuracy varied across molecular subsets. Challenges were observed in predicting HL-gaps for molecules containing aliphatic carboxylic acids, alcohols, and amines in molecular systems with complex electronic structure. This work emphasizes the importance of polarizability and structural features in HL-gap predictive modeling, showcasing the potential of machine learning while also highlighting limitations in handling specific structural motifs. These limitations point towards promising perspectives for further model improvements.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 203-213"},"PeriodicalIF":6.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00186b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paolo Vincenzo Freiesleben de Blasio, Rune Kruger, Nis Fisker-Bødker, Jin Hyun Chang and Christodoulos Chatzichristodoulou
Fast and accurate synthesis and testing of electrocatalysts is essential to accelerate development of next generation catalysts for sustainable energy technologies. In this paper, we introduce CatBot, a fully automated platform for reliable synthesis and testing of electrocatalysts capable of operating at temperatures of up to 100 °C from highly acidic to highly alkaline conditions. The platform leverages roll-to-roll transfer, integrating customizable stages for substrate cleaning, catalyst loading, and electrochemical testing, with a custom made liquid distribution system enabling multi-element electrocatalyst synthesis via electrodeposition. CatBot enables fabrication and testing of up to 100 electrocatalysts per day, significantly accelerating catalyst discovery and optimization. We demonstrate the platform's reproducibility, through synthesis and testing of various catalytic coatings for the hydrogen evolution reaction (HER) in alkaline conditions, achieving overpotential uncertainties in the range of 4–13 mV at −100 mA cm−2. Additionally, we benchmark the platform by comparing anodic and cathodic redox peaks for nickel in alkaline solutions confirming consistency with previous studies. Thus, CatBot comprises an automated, fast, reproducible, accurate and scalable synthesis and testing system for the accelerated development of next generation electrocatalysts.
快速准确地合成和测试电催化剂对于加速下一代可持续能源技术催化剂的开发至关重要。在本文中,我们介绍了CatBot,一个完全自动化的平台,用于可靠的合成和测试电催化剂,能够在高达100°C的温度下从高酸性到高碱性条件下工作。该平台利用卷对卷传输,集成了可定制的基板清洗、催化剂装载和电化学测试阶段,并配有定制的液体分配系统,可通过电沉积合成多元素电催化剂。CatBot每天可以制造和测试多达100种电催化剂,大大加快了催化剂的发现和优化。通过在碱性条件下合成和测试各种析氢反应(HER)的催化涂层,我们证明了该平台的可重复性,在- 100 mA cm - 2下实现了4-13 mV的过电位不确定度。此外,我们通过比较碱性溶液中镍的阳极和阴极氧化还原峰来对平台进行基准测试,以确认与先前研究的一致性。因此,CatBot包括一个自动化,快速,可重复,准确和可扩展的合成和测试系统,用于加速下一代电催化剂的开发。
{"title":"CatBot – a high-throughput catalyst synthesis and testing system with roll to roll transfer","authors":"Paolo Vincenzo Freiesleben de Blasio, Rune Kruger, Nis Fisker-Bødker, Jin Hyun Chang and Christodoulos Chatzichristodoulou","doi":"10.1039/D5DD00403A","DOIUrl":"https://doi.org/10.1039/D5DD00403A","url":null,"abstract":"<p >Fast and accurate synthesis and testing of electrocatalysts is essential to accelerate development of next generation catalysts for sustainable energy technologies. In this paper, we introduce CatBot, a fully automated platform for reliable synthesis and testing of electrocatalysts capable of operating at temperatures of up to 100 °C from highly acidic to highly alkaline conditions. The platform leverages roll-to-roll transfer, integrating customizable stages for substrate cleaning, catalyst loading, and electrochemical testing, with a custom made liquid distribution system enabling multi-element electrocatalyst synthesis <em>via</em> electrodeposition. CatBot enables fabrication and testing of up to 100 electrocatalysts per day, significantly accelerating catalyst discovery and optimization. We demonstrate the platform's reproducibility, through synthesis and testing of various catalytic coatings for the hydrogen evolution reaction (HER) in alkaline conditions, achieving overpotential uncertainties in the range of 4–13 mV at −100 mA cm<small><sup>−2</sup></small>. Additionally, we benchmark the platform by comparing anodic and cathodic redox peaks for nickel in alkaline solutions confirming consistency with previous studies. Thus, CatBot comprises an automated, fast, reproducible, accurate and scalable synthesis and testing system for the accelerated development of next generation electrocatalysts.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3810-3817"},"PeriodicalIF":6.2,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00403a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianming Mao, Yongkang Xi, Armin Shayesteh Zadeh, Allen P. Liu and Andrew L. Ferguson
Synthetic cells are prevalent models for understanding and recapitulating complicated functions of natural cells such as DNA replication and protein expression. Lipid-based vesicles are widely employed but are limited by their fragility under mechanical forces or osmotic pressure. Elastin-like polypeptides (ELPs) composed of repetitive (VPGXG) sequences present alternative building blocks with which to construct the delimiting membrane of synthetic cells possessing high structural stability and tolerance of harsh environmental stress. In this work, we present a high-throughput virtual screening pipeline combining coarse-grained simulations, alchemical free energy calculations, Gaussian process regression, and Bayesian optimization to traverse a library of amphiphilic diblock ELPs for mutant sequences predicted to form thermodynamically stable bilayer vesicles. From our screening campaign, we have identified a range of novel ELP candidates with enhanced predicted stability. Analysis of our screening data exposes new rational design principles that suggest incorporating particular guest residues in hydrophilic blocks – including histidine, tyrosine, and threonine – and in hydrophobic blocks – including alanine, phenylalanine, cysteine, and isoleucine – to enhance the thermodynamic stability of ELP bilayer vesicles. The computational pipeline greatly accelerates the discovery of ELP building blocks for synthetic cells, exposes new design principles for these molecules, and furnishes a transferable framework for designing peptides with desirable structural or functional properties.
{"title":"Computational design of polypeptide-based compartments for synthetic cells","authors":"Jianming Mao, Yongkang Xi, Armin Shayesteh Zadeh, Allen P. Liu and Andrew L. Ferguson","doi":"10.1039/D5DD00291E","DOIUrl":"https://doi.org/10.1039/D5DD00291E","url":null,"abstract":"<p >Synthetic cells are prevalent models for understanding and recapitulating complicated functions of natural cells such as DNA replication and protein expression. Lipid-based vesicles are widely employed but are limited by their fragility under mechanical forces or osmotic pressure. Elastin-like polypeptides (ELPs) composed of repetitive (VPGXG) sequences present alternative building blocks with which to construct the delimiting membrane of synthetic cells possessing high structural stability and tolerance of harsh environmental stress. In this work, we present a high-throughput virtual screening pipeline combining coarse-grained simulations, alchemical free energy calculations, Gaussian process regression, and Bayesian optimization to traverse a library of amphiphilic diblock ELPs for mutant sequences predicted to form thermodynamically stable bilayer vesicles. From our screening campaign, we have identified a range of novel ELP candidates with enhanced predicted stability. Analysis of our screening data exposes new rational design principles that suggest incorporating particular guest residues in hydrophilic blocks – including histidine, tyrosine, and threonine – and in hydrophobic blocks – including alanine, phenylalanine, cysteine, and isoleucine – to enhance the thermodynamic stability of ELP bilayer vesicles. The computational pipeline greatly accelerates the discovery of ELP building blocks for synthetic cells, exposes new design principles for these molecules, and furnishes a transferable framework for designing peptides with desirable structural or functional properties.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 214-230"},"PeriodicalIF":6.2,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00291e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pau Rocabert-Oriols, Camilla Lo Conte, Núria López and Javier Heras-Domingo
Identifying molecular structures from vibrational spectra is central to chemical analysis but remains challenging due to spectral ambiguity and the limitations of single-modality methods. While deep learning has advanced various spectroscopic characterization techniques, leveraging the complementary nature of infrared (IR) and Raman spectroscopies remains largely underexplored. We introduce VibraCLIP, a contrastive learning framework that embeds molecular graphs, IR and Raman spectra into a shared latent space. A lightweight fine-tuning protocol ensures generalization from theoretical to experimental datasets. VibraCLIP enables accurate, scalable, and data-efficient molecular identification, linking vibrational spectroscopy with structural interpretation. This tri-modal design captures rich structure–spectra relationships, achieving Top-1 retrieval accuracy of 81.7% and reaching 98.9% Top-25 accuracy with molecular mass integration. By integrating complementary vibrational spectroscopic signals with molecular representations, VibraCLIP provides a practical framework for automated spectral analysis, with potential applications in fields such as synthesis monitoring, drug development, and astrochemical detection.
{"title":"Multi-modal contrastive learning for chemical structure elucidation with VibraCLIP","authors":"Pau Rocabert-Oriols, Camilla Lo Conte, Núria López and Javier Heras-Domingo","doi":"10.1039/D5DD00269A","DOIUrl":"https://doi.org/10.1039/D5DD00269A","url":null,"abstract":"<p >Identifying molecular structures from vibrational spectra is central to chemical analysis but remains challenging due to spectral ambiguity and the limitations of single-modality methods. While deep learning has advanced various spectroscopic characterization techniques, leveraging the complementary nature of infrared (IR) and Raman spectroscopies remains largely underexplored. We introduce VibraCLIP, a contrastive learning framework that embeds molecular graphs, IR and Raman spectra into a shared latent space. A lightweight fine-tuning protocol ensures generalization from theoretical to experimental datasets. VibraCLIP enables accurate, scalable, and data-efficient molecular identification, linking vibrational spectroscopy with structural interpretation. This tri-modal design captures rich structure–spectra relationships, achieving Top-1 retrieval accuracy of 81.7% and reaching 98.9% Top-25 accuracy with molecular mass integration. By integrating complementary vibrational spectroscopic signals with molecular representations, VibraCLIP provides a practical framework for automated spectral analysis, with potential applications in fields such as synthesis monitoring, drug development, and astrochemical detection.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3818-3827"},"PeriodicalIF":6.2,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00269a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fullerenes, carbon-based nanomaterials with sp2-hybridized carbon atoms arranged in polyhedral cages, exhibit diverse isomeric structures with promising applications in optoelectronics, solar cells, and medicine. However, the vast number of possible fullerene isomers complicates efficient property prediction. In this study, we introduce FullereneNet, a graph neural network-based model that predicts fundamental properties of fullerenes using topological features derived solely from unoptimized structures, eliminating the need for computationally expensive quantum chemistry optimizations. The model leverages topological representations based on the chemical environments of pentagonal and hexagonal rings, enabling efficient capture of local structural details. We show that this approach yields superior performance in predicting the C–C binding energy for a wide range of fullerene sizes, achieving mean absolute errors of 3 meV per atom for C60, 4 meV per atom for C70, and 6 meV per atom for C72–C100, surpassing the values of the state-of-the-art machine learning interatomic potential GAP-20. Additionally, the FullereneNet model accurately predicts 11 other properties, including the HOMO–LUMO gap and solvation free energy, demonstrating robustness and transferability across fullerene types. This work provides a computationally efficient framework for high-throughput screening of fullerene candidates and establishes a foundation for future data-driven studies in fullerene chemistry.
{"title":"Extrapolating beyond C60: advancing prediction of fullerene isomers with FullereneNet","authors":"Bin Liu, Jirui Jin and Mingjie Liu","doi":"10.1039/D5DD00241A","DOIUrl":"https://doi.org/10.1039/D5DD00241A","url":null,"abstract":"<p >Fullerenes, carbon-based nanomaterials with sp<small><sup>2</sup></small>-hybridized carbon atoms arranged in polyhedral cages, exhibit diverse isomeric structures with promising applications in optoelectronics, solar cells, and medicine. However, the vast number of possible fullerene isomers complicates efficient property prediction. In this study, we introduce FullereneNet, a graph neural network-based model that predicts fundamental properties of fullerenes using topological features derived solely from unoptimized structures, eliminating the need for computationally expensive quantum chemistry optimizations. The model leverages topological representations based on the chemical environments of pentagonal and hexagonal rings, enabling efficient capture of local structural details. We show that this approach yields superior performance in predicting the C–C binding energy for a wide range of fullerene sizes, achieving mean absolute errors of 3 meV per atom for C<small><sub>60</sub></small>, 4 meV per atom for C<small><sub>70</sub></small>, and 6 meV per atom for C<small><sub>72</sub></small>–C<small><sub>100</sub></small>, surpassing the values of the state-of-the-art machine learning interatomic potential GAP-20. Additionally, the FullereneNet model accurately predicts 11 other properties, including the HOMO–LUMO gap and solvation free energy, demonstrating robustness and transferability across fullerene types. This work provides a computationally efficient framework for high-throughput screening of fullerene candidates and establishes a foundation for future data-driven studies in fullerene chemistry.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 123-133"},"PeriodicalIF":6.2,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00241a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shogo Tadokoro, Ryosuke Kamimura, Fumitaka Ishiwari and Akinori Saeki
Improving the performance of organic photovoltaics (OPVs) depends on the development of new p-type polymers and n-type non-fullerene acceptor (NFA) molecules. However, conventional experimental and theoretical methods are inefficient for exploring the vast chemical space. In this report, we use machine learning (ML) to explore simple-structured p-type polymers. The structural simplicity is associated with a small synthesis step relevant for low-cost, large-scale production. By considering the structural simplicity (primitively based on the molecular weight of its repeating unit) of the 200 thousand virtually generated polymers, together with synthetic accessibility, we focus on copolymers composed of benzoxadiazole as an acceptor and thiophene (or phenylene) as a donor. Although the structures of these copolymers resemble a high-performance simple-structured PTQ10, their structural symmetries (regioregularity) are modified for synthetic reasons. Through the characterization of the synthesized polymers, their OPV devices blended with Y6 NFA, and resultant synthetic complexity scores, we show that our polymer with a minor manual modification of the donor and alkyl chain exhibits a power conversion efficiency of 5.56%, which closely aligns with that predicted by ML and provides a basis for the further development of novel polymers with low synthesis and search costs.
{"title":"Design of simple-structured conjugated polymers for organic solar cells by machine learning-assisted structural modification and experimental validation","authors":"Shogo Tadokoro, Ryosuke Kamimura, Fumitaka Ishiwari and Akinori Saeki","doi":"10.1039/D5DD00418G","DOIUrl":"https://doi.org/10.1039/D5DD00418G","url":null,"abstract":"<p >Improving the performance of organic photovoltaics (OPVs) depends on the development of new p-type polymers and n-type non-fullerene acceptor (NFA) molecules. However, conventional experimental and theoretical methods are inefficient for exploring the vast chemical space. In this report, we use machine learning (ML) to explore simple-structured p-type polymers. The structural simplicity is associated with a small synthesis step relevant for low-cost, large-scale production. By considering the structural simplicity (primitively based on the molecular weight of its repeating unit) of the 200 thousand virtually generated polymers, together with synthetic accessibility, we focus on copolymers composed of benzoxadiazole as an acceptor and thiophene (or phenylene) as a donor. Although the structures of these copolymers resemble a high-performance simple-structured PTQ10, their structural symmetries (regioregularity) are modified for synthetic reasons. Through the characterization of the synthesized polymers, their OPV devices blended with Y6 NFA, and resultant synthetic complexity scores, we show that our polymer with a minor manual modification of the donor and alkyl chain exhibits a power conversion efficiency of 5.56%, which closely aligns with that predicted by ML and provides a basis for the further development of novel polymers with low synthesis and search costs.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 3774-3781"},"PeriodicalIF":6.2,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00418g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transforming in situ transmission electron microscopy (TEM) imaging into a tool for spatially-resolved operando characterization of solid-state reactions requires automated, high-precision semantic segmentation of dynamically evolving features. However, traditional deep learning methods for semantic segmentation often face limitations due to the scarcity of labeled data, visually ambiguous features of interest, and scenarios involving small objects. To tackle these challenges, we introduce MultiTaskDeltaNet (MTDN), a novel deep learning architecture that creatively reconceptualizes the segmentation task as a change detection problem. By implementing a unique Siamese network with a U-Net backbone and using paired images to capture feature changes, MTDN effectively leverages minimal data to produce high-quality segmentations. Furthermore, MTDN utilizes a multi-task learning strategy to exploit correlations between physical features of interest. In an evaluation using data from in situ environmental TEM (ETEM) videos of filamentous carbon gasification, MTDN demonstrated a significant advantage over conventional segmentation models, particularly in accurately delineating fine structural features. Notably, MTDN achieved a 10.22% performance improvement over conventional segmentation models in predicting small and visually ambiguous physical features. This work bridges key gaps between deep learning and practical TEM image analysis, advancing automated characterization of nanomaterials in complex experimental settings.
{"title":"MultiTaskDeltaNet: change detection-based image segmentation for operando ETEM with application to carbon gasification kinetics","authors":"Yushuo Niu, Tianyu Li, Yuanyuan Zhu and Qian Yang","doi":"10.1039/D5DD00333D","DOIUrl":"https://doi.org/10.1039/D5DD00333D","url":null,"abstract":"<p >Transforming <em>in situ</em> transmission electron microscopy (TEM) imaging into a tool for spatially-resolved <em>operando</em> characterization of solid-state reactions requires automated, high-precision semantic segmentation of dynamically evolving features. However, traditional deep learning methods for semantic segmentation often face limitations due to the scarcity of labeled data, visually ambiguous features of interest, and scenarios involving small objects. To tackle these challenges, we introduce MultiTaskDeltaNet (MTDN), a novel deep learning architecture that creatively reconceptualizes the segmentation task as a change detection problem. By implementing a unique Siamese network with a U-Net backbone and using paired images to capture feature changes, MTDN effectively leverages minimal data to produce high-quality segmentations. Furthermore, MTDN utilizes a multi-task learning strategy to exploit correlations between physical features of interest. In an evaluation using data from <em>in situ</em> environmental TEM (ETEM) videos of filamentous carbon gasification, MTDN demonstrated a significant advantage over conventional segmentation models, particularly in accurately delineating fine structural features. Notably, MTDN achieved a 10.22% performance improvement over conventional segmentation models in predicting small and visually ambiguous physical features. This work bridges key gaps between deep learning and practical TEM image analysis, advancing automated characterization of nanomaterials in complex experimental settings.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 290-303"},"PeriodicalIF":6.2,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00333d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146007003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}