Uriel Garcilazo-Cruz, Joseph O. Okeme and Rodrigo A. Vargas-Hernández
The lack of flexible annotation tools has hindered the deployment of AI models in some scientific areas. Most existing image annotation software requires users to upload a precollected dataset, which limits support for on-demand pipelines and introduces unnecessary steps to acquire images. This constraint is particularly problematic in laboratory environments, where on-site data acquisition from instruments such as microscopes is increasingly common. In this work, we introduce LivePixel, a Python-based graphical user interface that integrates with imaging systems, such as webcams, microscopes, and others, to enable on-site image annotation. LivePyxel is designed to be easy to use through a simple interface that allows users to precisely delimit areas for annotation using tools commonly found in commercial graphics editing software. Of particular interest is the availability of Bézier splines and binary masks, and the software's capacity to work with non-destructive layers that enable high-performance editing. LivePyxel also integrates a wide compatibility across video devices, and it's optimized for object detection operations via the use of OpenCV in combination with high-performance libraries designed to handle matrix and linear algebra operations via Numpy effectively. LivePyxel facilitates seamless data collection and labeling, accelerating the development of AI models in experimental workflows. LivePyxel is freely available at https://github.com/UGarCil/LivePyxel.
{"title":"LivePyxel: accelerating image annotations with a Python-integrated webcam live streaming","authors":"Uriel Garcilazo-Cruz, Joseph O. Okeme and Rodrigo A. Vargas-Hernández","doi":"10.1039/D5DD00421G","DOIUrl":"https://doi.org/10.1039/D5DD00421G","url":null,"abstract":"<p >The lack of flexible annotation tools has hindered the deployment of AI models in some scientific areas. Most existing image annotation software requires users to upload a precollected dataset, which limits support for on-demand pipelines and introduces unnecessary steps to acquire images. This constraint is particularly problematic in laboratory environments, where on-site data acquisition from instruments such as microscopes is increasingly common. In this work, we introduce LivePixel, a Python-based graphical user interface that integrates with imaging systems, such as webcams, microscopes, and others, to enable on-site image annotation. LivePyxel is designed to be easy to use through a simple interface that allows users to precisely delimit areas for annotation using tools commonly found in commercial graphics editing software. Of particular interest is the availability of Bézier splines and binary masks, and the software's capacity to work with non-destructive layers that enable high-performance editing. LivePyxel also integrates a wide compatibility across video devices, and it's optimized for object detection operations <em>via</em> the use of OpenCV in combination with high-performance libraries designed to handle matrix and linear algebra operations <em>via</em> Numpy effectively. LivePyxel facilitates seamless data collection and labeling, accelerating the development of AI models in experimental workflows. LivePyxel is freely available at https://github.com/UGarCil/LivePyxel.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 835-843"},"PeriodicalIF":6.2,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00421g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite the rapidly growing applications of robots in industry, the use of robots to automate tasks in scientific laboratories is less prolific due to the lack of generalized methodologies and the high cost of hardware. This paper focuses on the automation of characterization tasks necessary for reducing cost while maintaining generalization and proposes a software architecture for building robotic systems in scientific laboratory environments. A dual-layer (Socket.IO and ROS) action server design is the basic building block, which facilitates the implementation of a web-based front end for user-friendly operation and the use of ROS Behavior Trees for convenient task planning and execution. A robotic platform for automating mineral and material sample characterization is built upon the architecture, with an open-source, low-cost three-axis computer numerical control gantry system serving as the main robot. A handheld laser induced breakdown spectroscopy (LIBS) analyzer is integrated with a 3D printed adapter, enabling (1) automated 2D chemical mapping and (2) autonomous sample measurement (with the support of an RGB-Depth camera). We demonstrate the utility of automated chemical mapping by scanning the surface of a spodumene-bearing pegmatite core sample with a 1071-point dense hyperspectral map acquired at a rate of 1520 bits per second. Furthermore, we showcase the autonomy of the platform in terms of perception, dynamic decision-making, and execution, through a case study of LIBS measurement of multiple mineral samples. The platform enables controlled and autonomous chemical quantification in the laboratory that complements field-based measurements acquired with the same handheld device, linking resource exploration and processing steps in the supply chain for lithium-based battery materials.
{"title":"Autonomous elemental characterization enabled by a low cost robotic platform built upon a generalized software architecture","authors":"Xuan Cao, Yuxin Wu and Michael L. Whittaker","doi":"10.1039/D5DD00263J","DOIUrl":"https://doi.org/10.1039/D5DD00263J","url":null,"abstract":"<p >Despite the rapidly growing applications of robots in industry, the use of robots to automate tasks in scientific laboratories is less prolific due to the lack of generalized methodologies and the high cost of hardware. This paper focuses on the automation of characterization tasks necessary for reducing cost while maintaining generalization and proposes a software architecture for building robotic systems in scientific laboratory environments. A dual-layer (Socket.IO and ROS) action server design is the basic building block, which facilitates the implementation of a web-based front end for user-friendly operation and the use of ROS Behavior Trees for convenient task planning and execution. A robotic platform for automating mineral and material sample characterization is built upon the architecture, with an open-source, low-cost three-axis computer numerical control gantry system serving as the main robot. A handheld laser induced breakdown spectroscopy (LIBS) analyzer is integrated with a 3D printed adapter, enabling (1) automated 2D chemical mapping and (2) autonomous sample measurement (with the support of an RGB-Depth camera). We demonstrate the utility of automated chemical mapping by scanning the surface of a spodumene-bearing pegmatite core sample with a 1071-point dense hyperspectral map acquired at a rate of 1520 bits per second. Furthermore, we showcase the autonomy of the platform in terms of perception, dynamic decision-making, and execution, through a case study of LIBS measurement of multiple mineral samples. The platform enables controlled and autonomous chemical quantification in the laboratory that complements field-based measurements acquired with the same handheld device, linking resource exploration and processing steps in the supply chain for lithium-based battery materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 891-900"},"PeriodicalIF":6.2,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00263j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junru Ren, Abhijoy Mandal, Rama El-khawaldeh, Shi Xuan Leong, Jason Hein, Alán Aspuru-Guzik, Lazaros Nalpantidis and Kourosh Darvish
Real-time monitoring of laboratory experiments is essential for automating complex workflows and enhancing experimental efficiency. Accurate detection and classification of chemicals in varying forms and states support a range of techniques, including liquid–liquid extraction, distillation, and crystallization. However, challenges exist in the detection of chemical forms: some classes appear visually similar, and the classification of the forms is often context-dependent. In this study, we adapt the YOLO model into a multi-modal architecture that integrates scene images and task context for object detection. With the help of Large Language Models (LLM), the developed method facilitates reasoning about the experimental process and uses the reasoning result as the context guidance for the detection model. Experimental results show that by introducing context during training and inference, the performance of the proposed model, YOLO-text, has improved among all classes, and the model is able to make accurate predictions on visually similar areas. Compared to the baseline, our model increases 4.8% overall mAP without context given and 7% with context. The proposed framework can classify and localize substances with and without contextual suggestions, thereby enhancing the adaptability and flexibility of the detection process.
实验室实验的实时监控对于实现复杂工作流程的自动化和提高实验效率至关重要。对不同形式和状态的化学物质的准确检测和分类支持一系列技术,包括液-液萃取、蒸馏和结晶。然而,在化学形式的检测中存在挑战:一些类别在视觉上看起来相似,并且形式的分类通常依赖于上下文。在本研究中,我们将YOLO模型调整为一个多模态架构,该架构集成了场景图像和任务上下文,用于目标检测。该方法借助大型语言模型(Large Language Models, LLM)对实验过程进行推理,并将推理结果作为检测模型的上下文指导。实验结果表明,通过在训练和推理过程中引入上下文,所提出的模型(YOLO-text)在所有类别中的性能都得到了提高,并且该模型能够对视觉上相似的区域做出准确的预测。与基线相比,我们的模型在没有给定上下文的情况下总体mAP增加4.8%,在给定上下文的情况下增加7%。提出的框架可以在有无上下文提示的情况下对物质进行分类和定位,从而增强检测过程的适应性和灵活性。
{"title":"Context-aware computer vision for chemical reaction state detection","authors":"Junru Ren, Abhijoy Mandal, Rama El-khawaldeh, Shi Xuan Leong, Jason Hein, Alán Aspuru-Guzik, Lazaros Nalpantidis and Kourosh Darvish","doi":"10.1039/D5DD00346F","DOIUrl":"https://doi.org/10.1039/D5DD00346F","url":null,"abstract":"<p >Real-time monitoring of laboratory experiments is essential for automating complex workflows and enhancing experimental efficiency. Accurate detection and classification of chemicals in varying forms and states support a range of techniques, including liquid–liquid extraction, distillation, and crystallization. However, challenges exist in the detection of chemical forms: some classes appear visually similar, and the classification of the forms is often context-dependent. In this study, we adapt the YOLO model into a multi-modal architecture that integrates scene images and task context for object detection. With the help of Large Language Models (LLM), the developed method facilitates reasoning about the experimental process and uses the reasoning result as the context guidance for the detection model. Experimental results show that by introducing context during training and inference, the performance of the proposed model, YOLO-text, has improved among all classes, and the model is able to make accurate predictions on visually similar areas. Compared to the baseline, our model increases 4.8% overall mAP without context given and 7% with context. The proposed framework can classify and localize substances with and without contextual suggestions, thereby enhancing the adaptability and flexibility of the detection process.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 630-642"},"PeriodicalIF":6.2,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00346f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danny Reidenbach, Filipp Nikitin, Olexandr Isayev and Saee Gopal Paliwal
De novo 3D molecule generation is a pivotal task in drug discovery. However, many recent geometric generative models struggle to produce high-quality geometries, even if they able to generate valid molecular graphs. To tackle this issue and enhance the learning of effective molecular generation dynamics, we present Megalodon – a family of scalable transformer models. These models are enhanced with basic equivariant layers and trained using a joint continuous and discrete denoising co-design objective. We assess Megalodon's performance on established molecule generation benchmarks and introduce new 3D structure benchmarks that evaluate a model's capability to generate realistic molecular structures, particularly focusing on geometry precision. We show that Megalodon achieves state-of-the-art results in 3D molecule generation, conditional structure generation, and structure energy benchmarks using diffusion and flow matching. Furthermore, we demonstrate that scaling Megalodon produces up to 49× more valid molecules at large sizes and 2–10× lower energy compared to the prior best generative models. The code and the model are available at https://github.com/NVIDIA-Digital-Bio/megalodon.
{"title":"Applications of modular co-design for de novo 3D molecule generation","authors":"Danny Reidenbach, Filipp Nikitin, Olexandr Isayev and Saee Gopal Paliwal","doi":"10.1039/D5DD00380F","DOIUrl":"https://doi.org/10.1039/D5DD00380F","url":null,"abstract":"<p > <em>De novo</em> 3D molecule generation is a pivotal task in drug discovery. However, many recent geometric generative models struggle to produce high-quality geometries, even if they able to generate valid molecular graphs. To tackle this issue and enhance the learning of effective molecular generation dynamics, we present Megalodon – a family of scalable transformer models. These models are enhanced with basic equivariant layers and trained using a joint continuous and discrete denoising co-design objective. We assess Megalodon's performance on established molecule generation benchmarks and introduce new 3D structure benchmarks that evaluate a model's capability to generate realistic molecular structures, particularly focusing on geometry precision. We show that Megalodon achieves state-of-the-art results in 3D molecule generation, conditional structure generation, and structure energy benchmarks using diffusion and flow matching. Furthermore, we demonstrate that scaling Megalodon produces up to 49× more valid molecules at large sizes and 2–10× lower energy compared to the prior best generative models. The code and the model are available at https://github.com/NVIDIA-Digital-Bio/megalodon.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 754-768"},"PeriodicalIF":6.2,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00380f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuefeng Bai, Zhiling Zheng, Xin Zhang, Hao-Tian Wang, Rui Yang and Jian-Rong Li
Large Language Models (LLMs) have the potential to transform chemical research. Nevertheless, their general-purpose design constrains scientific understanding and reasoning within specialized fields like chemistry. In this study, we introduce MOFReasoner, a domain model designed to enhance scientific reasoning, using Metal–Organic Framework (MOF) adsorption as a case study. By employing knowledge distillation from teacher models and Chain-of-Thought (CoT) reasoning extracted from a corpus of over 8242 research articles and 500 reviews, we developed a domain-specific chemical reasoning dataset. Using domain-specific chemical reasoning datasets, general chemistry datasets, and general reasoning datasets, the LLMs were fine-tuned. The model's performance was evaluated across four tasks: experimental studies, chemical mechanisms, application scenarios, and industrialization challenges. MOFReasoner outperformed existing general-purpose models, such as GPT-4.5 and DeepSeek-R1. Furthermore, the model achieves prediction accuracy comparable to DFT, enabling material recommendations. This work underscores the potential of integrating domain-specific knowledge, CoT reasoning, and knowledge distillation in creating LLMs that support scientific inquiry and decision-making within the discipline of chemistry.
{"title":"MOFReasoner: think like a scientist—a reasoning large language model via knowledge distillation","authors":"Xuefeng Bai, Zhiling Zheng, Xin Zhang, Hao-Tian Wang, Rui Yang and Jian-Rong Li","doi":"10.1039/D5DD00429B","DOIUrl":"https://doi.org/10.1039/D5DD00429B","url":null,"abstract":"<p >Large Language Models (LLMs) have the potential to transform chemical research. Nevertheless, their general-purpose design constrains scientific understanding and reasoning within specialized fields like chemistry. In this study, we introduce MOFReasoner, a domain model designed to enhance scientific reasoning, using Metal–Organic Framework (MOF) adsorption as a case study. By employing knowledge distillation from teacher models and Chain-of-Thought (CoT) reasoning extracted from a corpus of over 8242 research articles and 500 reviews, we developed a domain-specific chemical reasoning dataset. Using domain-specific chemical reasoning datasets, general chemistry datasets, and general reasoning datasets, the LLMs were fine-tuned. The model's performance was evaluated across four tasks: experimental studies, chemical mechanisms, application scenarios, and industrialization challenges. MOFReasoner outperformed existing general-purpose models, such as GPT-4.5 and DeepSeek-R1. Furthermore, the model achieves prediction accuracy comparable to DFT, enabling material recommendations. This work underscores the potential of integrating domain-specific knowledge, CoT reasoning, and knowledge distillation in creating LLMs that support scientific inquiry and decision-making within the discipline of chemistry.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 869-877"},"PeriodicalIF":6.2,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00429b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tarik Ćerimagić, Sergey Sosnin and Gerhard F. Ecker
Solute carrier (SLC) transporters constitute the largest family of membrane transport proteins in humans. They facilitate the movement of ions, neurotransmitters, nutrients, and drugs. Given their critical role in regulating cellular physiology, they are important therapeutic targets for neurological and psychological disorders, metabolic diseases, and cancer. Inhibition of SLC transporters can modulate substrate gradients, restrict the cellular uptake of nutrients and drugs, and thereby facilitate specific pharmacological effects. Despite their pharmaceutical relevance, many SLC transporters remain understudied. Having a complete bioactivity matrix of associated compounds can expand the knowledge base of SLC ligands, enlarge the information pool to guide downstream processes, and promote informed decision-making steps in the discovery of new drug candidates for SLC transporters. To address the data sparsity of available compound-bioactivity values causing inhibitory responses for SLC transporters, we employed a multi-task learning (MTL) approach with a data imputation objective. By leveraging relationships between related tasks, deep learning has previously shown promise in imputing compound bioactivities across multiple assays. We developed a multi-task deep neural network (MT-DNN) to predict and impute missing pChEMBL (−log(IC50)) values across the SLC transporter superfamily. With a data matrix density of 2.53% and an R2 of 0.74, our model demonstrated robust predictive performance. Specifically, we predicted missing values for 9122 unique compounds across 54 SLC targets spanning various folds and subfamilies, generating 480 133 predictions from 12 455 known interactions. The advantages of the multi-task learning approach were indicated in the ability of certain targets to leverage the shared representation of knowledge and acquire increased predictive accuracy over single-task learning (STL) counterparts. Despite the limitations set by low data density, activity cliffs, and inter-protein heterogeneity, the MT-DNN showed promising potential as a tool to address data sparsity within the SLC superfamily.
{"title":"A multi-task learning approach for prediction of missing bioactivity values of compounds for the SLC transporter superfamily","authors":"Tarik Ćerimagić, Sergey Sosnin and Gerhard F. Ecker","doi":"10.1039/D5DD00536A","DOIUrl":"https://doi.org/10.1039/D5DD00536A","url":null,"abstract":"<p >Solute carrier (SLC) transporters constitute the largest family of membrane transport proteins in humans. They facilitate the movement of ions, neurotransmitters, nutrients, and drugs. Given their critical role in regulating cellular physiology, they are important therapeutic targets for neurological and psychological disorders, metabolic diseases, and cancer. Inhibition of SLC transporters can modulate substrate gradients, restrict the cellular uptake of nutrients and drugs, and thereby facilitate specific pharmacological effects. Despite their pharmaceutical relevance, many SLC transporters remain understudied. Having a complete bioactivity matrix of associated compounds can expand the knowledge base of SLC ligands, enlarge the information pool to guide downstream processes, and promote informed decision-making steps in the discovery of new drug candidates for SLC transporters. To address the data sparsity of available compound-bioactivity values causing inhibitory responses for SLC transporters, we employed a multi-task learning (MTL) approach with a data imputation objective. By leveraging relationships between related tasks, deep learning has previously shown promise in imputing compound bioactivities across multiple assays. We developed a multi-task deep neural network (MT-DNN) to predict and impute missing pChEMBL (−log(IC50)) values across the SLC transporter superfamily. With a data matrix density of 2.53% and an <em>R</em><small><sup>2</sup></small> of 0.74, our model demonstrated robust predictive performance. Specifically, we predicted missing values for 9122 unique compounds across 54 SLC targets spanning various folds and subfamilies, generating 480 133 predictions from 12 455 known interactions. The advantages of the multi-task learning approach were indicated in the ability of certain targets to leverage the shared representation of knowledge and acquire increased predictive accuracy over single-task learning (STL) counterparts. Despite the limitations set by low data density, activity cliffs, and inter-protein heterogeneity, the MT-DNN showed promising potential as a tool to address data sparsity within the SLC superfamily.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 878-890"},"PeriodicalIF":6.2,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00536a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weiling Wang, Isabel Cooley, Morgan R. Alexander, Ricky D. Wildman, Anna K. Croft and Blair F. Johnston
Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.
{"title":"A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents","authors":"Weiling Wang, Isabel Cooley, Morgan R. Alexander, Ricky D. Wildman, Anna K. Croft and Blair F. Johnston","doi":"10.1039/D5DD00456J","DOIUrl":"https://doi.org/10.1039/D5DD00456J","url":null,"abstract":"<p >Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 716-733"},"PeriodicalIF":6.2,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00456j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Xiao, Jinfeng Chen, Ye Ding, You Xu and Jing Huang
Molecular dynamics (MD) simulation is a powerful tool for investigating complex systems in physical, materials, and biological sciences. However, computational speed remains a critical bottleneck that limits its broader application. To address this challenge, we developed a dedicated hardware module based on modern field-programmable gate arrays (FPGAs) that accelerates all components of MD simulations. Our design employs pipelining strategies to optimize task execution within a fully parallel architecture, significantly enhancing performance. The latest generation of high-bandwidth memory (HBM2) is integrated and optimized to improve computational throughput. At the hardware level, we implemented an optimized register-transfer level (RTL) circuit design for a single node to maximize the efficiency of register read and write operations. Software co-design with SIMD frameworks ensures seamless integration of force calculations and system propagation. We validated the implementation across systems ranging from argon gas to solvated proteins, demonstrating stable MD trajectories and close agreement with reference energy values. This work presents a novel FPGA-based MD simulation architecture and provides a foundation for further improvements in hardware-accelerated molecular simulations.
{"title":"Molecular dynamics simulations accelerated on FPGA with high-bandwidth memory","authors":"Jing Xiao, Jinfeng Chen, Ye Ding, You Xu and Jing Huang","doi":"10.1039/D5DD00391A","DOIUrl":"https://doi.org/10.1039/D5DD00391A","url":null,"abstract":"<p >Molecular dynamics (MD) simulation is a powerful tool for investigating complex systems in physical, materials, and biological sciences. However, computational speed remains a critical bottleneck that limits its broader application. To address this challenge, we developed a dedicated hardware module based on modern field-programmable gate arrays (FPGAs) that accelerates all components of MD simulations. Our design employs pipelining strategies to optimize task execution within a fully parallel architecture, significantly enhancing performance. The latest generation of high-bandwidth memory (HBM2) is integrated and optimized to improve computational throughput. At the hardware level, we implemented an optimized register-transfer level (RTL) circuit design for a single node to maximize the efficiency of register read and write operations. Software co-design with SIMD frameworks ensures seamless integration of force calculations and system propagation. We validated the implementation across systems ranging from argon gas to solvated proteins, demonstrating stable MD trajectories and close agreement with reference energy values. This work presents a novel FPGA-based MD simulation architecture and provides a foundation for further improvements in hardware-accelerated molecular simulations.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 844-861"},"PeriodicalIF":6.2,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00391a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sensing devices are fabricated using stimuli-responsive materials. In general, the responsivity is controlled by designing molecules and materials based on professional experience. If predictors are constructed for the responsivity control, the number of experiments can be reduced without consumption of time, cost, and effort. However, such dynamic properties of functional polymer materials are not easily predicted because of the small data and complex structure–function relationship. How to prepare a dataset and train small data remain significant challenges. The present work shows construction and application of a prediction model for controlling thermoresponsive color-changing properties of layered polydiacetylenes (PDAs). The responsivity was changed by the intercalated guest molecules. The training dataset was prepared from a series of the photographs representing the color at each temperature. The prediction model of the thermoresponsivity, namely color-changing temperature, was constructed by combining machine learning and our chemical insight based on the small experimental data. The thermoresponsivity of the newly synthesized layered PDAs was predicted by the model. The modeling methods can be applied to predict various dynamic properties of functional polymer materials.
{"title":"A data-driven approach to control stimulus responsivity of functional polymer materials: predicting thermoresponsive color-changing properties of polydiacetylene","authors":"Risako Shibata, Nano Shioda, Hiroaki Imai, Yasuhiko Igarashi and Yuya Oaki","doi":"10.1039/D5DD00442J","DOIUrl":"https://doi.org/10.1039/D5DD00442J","url":null,"abstract":"<p >Sensing devices are fabricated using stimuli-responsive materials. In general, the responsivity is controlled by designing molecules and materials based on professional experience. If predictors are constructed for the responsivity control, the number of experiments can be reduced without consumption of time, cost, and effort. However, such dynamic properties of functional polymer materials are not easily predicted because of the small data and complex structure–function relationship. How to prepare a dataset and train small data remain significant challenges. The present work shows construction and application of a prediction model for controlling thermoresponsive color-changing properties of layered polydiacetylenes (PDAs). The responsivity was changed by the intercalated guest molecules. The training dataset was prepared from a series of the photographs representing the color at each temperature. The prediction model of the thermoresponsivity, namely color-changing temperature, was constructed by combining machine learning and our chemical insight based on the small experimental data. The thermoresponsivity of the newly synthesized layered PDAs was predicted by the model. The modeling methods can be applied to predict various dynamic properties of functional polymer materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 862-868"},"PeriodicalIF":6.2,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00442j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edward C. Lee, Daniel Salley, Abhishek Sharma and Leroy Cronin
Crystallisation is central to purification and to determining structure and material properties, yet small changes in conditions can produce many different polymorphs with distinct behaviours. Because crystallisation depends on multiple variables including solvent, temperature, pressure, and atmosphere and often proceeds unpredictably, mapping these outcomes is slow and expensive. Here we introduce a robotic crystal search engine that explores crystallisation space efficiently and autonomously. The platform couples high-throughput liquid handling with a closed-loop computer-vision system combined with human supervision that uses machine learning to detect crystals, distinguish polymorphs, and identify previously unseen forms. Using a benchmark polymorphic compound, we show that the robot can rapidly navigate a high-dimensional solvent space, quantify relative polymorph yields directly from images, and build a phase diagram without recourse to crystallography. This approach reveals the full set of polymorphs accessible under given conditions and identifies the optimal conditions for producing each one.
{"title":"AI-driven robotic crystal explorer for rapid polymorph identification","authors":"Edward C. Lee, Daniel Salley, Abhishek Sharma and Leroy Cronin","doi":"10.1039/D5DD00203F","DOIUrl":"https://doi.org/10.1039/D5DD00203F","url":null,"abstract":"<p >Crystallisation is central to purification and to determining structure and material properties, yet small changes in conditions can produce many different polymorphs with distinct behaviours. Because crystallisation depends on multiple variables including solvent, temperature, pressure, and atmosphere and often proceeds unpredictably, mapping these outcomes is slow and expensive. Here we introduce a robotic crystal search engine that explores crystallisation space efficiently and autonomously. The platform couples high-throughput liquid handling with a closed-loop computer-vision system combined with human supervision that uses machine learning to detect crystals, distinguish polymorphs, and identify previously unseen forms. Using a benchmark polymorphic compound, we show that the robot can rapidly navigate a high-dimensional solvent space, quantify relative polymorph yields directly from images, and build a phase diagram without recourse to crystallography. This approach reveals the full set of polymorphs accessible under given conditions and identifies the optimal conditions for producing each one.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 734-742"},"PeriodicalIF":6.2,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2026/dd/d5dd00203f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146211314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}