Ben Cree, Mateusz K. Bieniek, Siddique Amin, Akane Kawamura and Daniel J. Cole
FEgrow is an open-source software package for building congeneric series of compounds in protein binding pockets. For a given ligand core and receptor structure, it employs hybrid machine learning/molecular mechanics potential energy functions to optimise the bioactive conformers of supplied linkers and functional groups. Here, we introduce significant new functionality to automate, parallelise and accelerate the building and scoring of compound suggestions, such that it can be used for automated de novo design. We interface the workflow with active learning to improve the efficiency of searching the combinatorial space of possible linkers and functional groups, make use of interactions formed by crystallographic fragments in scoring compound designs, and introduce the option to seed the chemical space with molecules available from on-demand chemical libraries. As a test case, we target the main protease (Mpro) of SARS-CoV-2, identifying several small molecules with high similarity to molecules discovered by the COVID moonshot effort, using only structural information from a fragment screen in a fully automated fashion. Finally, we order and test 19 compound designs, of which three show weak activity in a fluorescence-based Mpro assay, but work is needed to further optimise the prioritisation of compounds for purchase. The FEgrow package and full tutorials demonstrating the active learning workflow are available at https://github.com/cole-group/FEgrow.
{"title":"Active learning driven prioritisation of compounds from on-demand libraries targeting the SARS-CoV-2 main protease†","authors":"Ben Cree, Mateusz K. Bieniek, Siddique Amin, Akane Kawamura and Daniel J. Cole","doi":"10.1039/D4DD00343H","DOIUrl":"10.1039/D4DD00343H","url":null,"abstract":"<p >FEgrow is an open-source software package for building congeneric series of compounds in protein binding pockets. For a given ligand core and receptor structure, it employs hybrid machine learning/molecular mechanics potential energy functions to optimise the bioactive conformers of supplied linkers and functional groups. Here, we introduce significant new functionality to automate, parallelise and accelerate the building and scoring of compound suggestions, such that it can be used for automated <em>de novo</em> design. We interface the workflow with active learning to improve the efficiency of searching the combinatorial space of possible linkers and functional groups, make use of interactions formed by crystallographic fragments in scoring compound designs, and introduce the option to seed the chemical space with molecules available from on-demand chemical libraries. As a test case, we target the main protease (Mpro) of SARS-CoV-2, identifying several small molecules with high similarity to molecules discovered by the COVID moonshot effort, using only structural information from a fragment screen in a fully automated fashion. Finally, we order and test 19 compound designs, of which three show weak activity in a fluorescence-based Mpro assay, but work is needed to further optimise the prioritisation of compounds for purchase. The FEgrow package and full tutorials demonstrating the active learning workflow are available at https://github.com/cole-group/FEgrow.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 438-450"},"PeriodicalIF":6.2,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11726688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143017414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Max Pinheiro, Matheus de Oliveira Bispo, Rafael S Mattos, Mariana Telles do Casal, Bidhan Chandra Garain, Josene M Toldo, Saikat Mukherjee, Mario Barbatti
The analysis of nonadiabatic molecular dynamics (NAMD) data presents significant challenges due to its high dimensionality and complexity. To address these issues, we introduce ULaMDyn, a Python-based, open-source package designed to automate the unsupervised analysis of large datasets generated by NAMD simulations. ULaMDyn integrates seamlessly with the Newton-X platform and employs advanced dimensionality reduction and clustering techniques to uncover hidden patterns in molecular trajectories, enabling a more intuitive understanding of excited-state processes. Using the photochemical dynamics of fulvene as a test case, we demonstrate how ULaMDyn efficiently identifies critical molecular geometries and critical nonadiabatic transitions. The package offers a streamlined, scalable solution for interpreting large NAMD datasets. It is poised to facilitate advances in the study of excited-state dynamics across a wide range of molecular systems.
{"title":"ULaMDyn: enhancing excited-state dynamics analysis through streamlined unsupervised learning.","authors":"Max Pinheiro, Matheus de Oliveira Bispo, Rafael S Mattos, Mariana Telles do Casal, Bidhan Chandra Garain, Josene M Toldo, Saikat Mukherjee, Mario Barbatti","doi":"10.1039/d4dd00374h","DOIUrl":"10.1039/d4dd00374h","url":null,"abstract":"<p><p>The analysis of nonadiabatic molecular dynamics (NAMD) data presents significant challenges due to its high dimensionality and complexity. To address these issues, we introduce ULaMDyn, a Python-based, open-source package designed to automate the unsupervised analysis of large datasets generated by NAMD simulations. ULaMDyn integrates seamlessly with the Newton-X platform and employs advanced dimensionality reduction and clustering techniques to uncover hidden patterns in molecular trajectories, enabling a more intuitive understanding of excited-state processes. Using the photochemical dynamics of fulvene as a test case, we demonstrate how ULaMDyn efficiently identifies critical molecular geometries and critical nonadiabatic transitions. The package offers a streamlined, scalable solution for interpreting large NAMD datasets. It is poised to facilitate advances in the study of excited-state dynamics across a wide range of molecular systems.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11774233/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143070061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sara Masarone, Katie V. Beckwith, Matthew R. Wilkinson, Shreshth Tuli, Amy Lane, Sam Windsor, Jordan Lane and Layla Hosseini-Gerami
Modern drug discovery projects are plagued with high failure rates, many of which have safety as the underlying cause. The drug discovery process involves selecting the right compounds from a pool of possible candidates to satisfy some pre-set requirements. As this process is costly and time consuming, finding toxicities at later stages can result in project failure. In this context, the use of existing data from previous projects can help develop computational models (e.g. QSARs) and algorithms to speed up the identification of compound toxicity. While clinical and in vivo data continues to be fundamental, data originating from organ-on-a-chip models, cell lines and previous studies can accelerate the drug discovery process allowing for faster identification of toxicities and thus saving time and resources.
{"title":"Advancing predictive toxicology: overcoming hurdles and shaping the future","authors":"Sara Masarone, Katie V. Beckwith, Matthew R. Wilkinson, Shreshth Tuli, Amy Lane, Sam Windsor, Jordan Lane and Layla Hosseini-Gerami","doi":"10.1039/D4DD00257A","DOIUrl":"https://doi.org/10.1039/D4DD00257A","url":null,"abstract":"<p >Modern drug discovery projects are plagued with high failure rates, many of which have safety as the underlying cause. The drug discovery process involves selecting the right compounds from a pool of possible candidates to satisfy some pre-set requirements. As this process is costly and time consuming, finding toxicities at later stages can result in project failure. In this context, the use of existing data from previous projects can help develop computational models (<em>e.g.</em> QSARs) and algorithms to speed up the identification of compound toxicity. While clinical and <em>in vivo</em> data continues to be fundamental, data originating from organ-on-a-chip models, cell lines and previous studies can accelerate the drug discovery process allowing for faster identification of toxicities and thus saving time and resources.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 303-315"},"PeriodicalIF":6.2,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00257a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chemical shifts are crucial parameters in protein Nuclear Magnetic Resonance (NMR) experiments. Specifically, the chemical shifts of backbone atoms are essential for determining the constraints in protein structure analysis. Despite their importance, protein NMR experiments are costly and spectral analysis presents challenges due to sample impurities, complex experimental environments, and spectral overlap. Here, we propose a chemical shift prediction method that requires only protein sequences as input. This low-cost chemical shift predictor provides a chemical shift corresponding to each backbone atom, offers valuable prior information for peak assignment, and can significantly aid protein NMR spectrum analysis. Our approach leverages recent advances in pre-trained protein language models (PLMs) and employs a deep learning model to obtain chemical shifts. Different from other chemical shift prediction programs, our method does not require protein structures as input, significantly reducing costs and enhancing robustness. Our method can achieve comparable accuracy to other existing programs that require protein structures as input. In summary, this work introduces a novel method for protein chemical shift prediction and demonstrates the potential of PLMs for diverse applications.
{"title":"A novel approach to protein chemical shift prediction from sequences using a protein language model†","authors":"He Zhu, Lingyue Hu, Yu Yang and Zhong Chen","doi":"10.1039/D4DD00367E","DOIUrl":"https://doi.org/10.1039/D4DD00367E","url":null,"abstract":"<p >Chemical shifts are crucial parameters in protein Nuclear Magnetic Resonance (NMR) experiments. Specifically, the chemical shifts of backbone atoms are essential for determining the constraints in protein structure analysis. Despite their importance, protein NMR experiments are costly and spectral analysis presents challenges due to sample impurities, complex experimental environments, and spectral overlap. Here, we propose a chemical shift prediction method that requires only protein sequences as input. This low-cost chemical shift predictor provides a chemical shift corresponding to each backbone atom, offers valuable prior information for peak assignment, and can significantly aid protein NMR spectrum analysis. Our approach leverages recent advances in pre-trained protein language models (PLMs) and employs a deep learning model to obtain chemical shifts. Different from other chemical shift prediction programs, our method does not require protein structures as input, significantly reducing costs and enhancing robustness. Our method can achieve comparable accuracy to other existing programs that require protein structures as input. In summary, this work introduces a novel method for protein chemical shift prediction and demonstrates the potential of PLMs for diverse applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 331-337"},"PeriodicalIF":6.2,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00367e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jay I. Myung, James R. Deneault, Jorge Chang, Inhan Kang, Benji Maruyama and Mark A. Pitt
Autonomous experimentation is a rapidly growing approach to materials science research. Machine learning can assist in improving the efficiency and capability of experimentation with algorithms that adaptively identify optimal design parameters that achieve one or more objectives in iterative, closed-loop fashion. Optimization in additive manufacturing, which can be slow and costly because of its complexity, stands to benefit greatly from such technologies. The present study demonstrates the application of an algorithm (multi-objective Bayesian optimization; MOBO) that optimizes two objectives simultaneously given multiple parameter inputs. The generality and robustness of MOBO are demonstrated in repeated print campaigns of two different test specimens. The results push the boundaries of integrating machine learning with autonomous experimentation for accelerated materials development in additive manufacturing and related areas.
{"title":"Multi-objective Bayesian optimization: a case study in material extrusion","authors":"Jay I. Myung, James R. Deneault, Jorge Chang, Inhan Kang, Benji Maruyama and Mark A. Pitt","doi":"10.1039/D4DD00281D","DOIUrl":"https://doi.org/10.1039/D4DD00281D","url":null,"abstract":"<p >Autonomous experimentation is a rapidly growing approach to materials science research. Machine learning can assist in improving the efficiency and capability of experimentation with algorithms that adaptively identify optimal design parameters that achieve one or more objectives in iterative, closed-loop fashion. Optimization in additive manufacturing, which can be slow and costly because of its complexity, stands to benefit greatly from such technologies. The present study demonstrates the application of an algorithm (multi-objective Bayesian optimization; MOBO) that optimizes two objectives simultaneously given multiple parameter inputs. The generality and robustness of MOBO are demonstrated in repeated print campaigns of two different test specimens. The results push the boundaries of integrating machine learning with autonomous experimentation for accelerated materials development in additive manufacturing and related areas.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 464-476"},"PeriodicalIF":6.2,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00281d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beck R. Miller, Ryan C. Cammarota and Matthew S. Sigman
Steric molecular descriptors designed for machine learning (ML) applications are critical for connecting structure–function relationships to mechanistic insight. However, many of these descriptors are not suitable for application to complex systems, such as catalyst reactive site pockets. In this context, we recently disclosed a new set of 3D steric molecular descriptors that were originally designed for dirhodium(II) tetra-carboxylate catalysts. Herein, we expand the spatial molding for rigid targets (SMART) descriptor toolkit by releasing SMARTpy; an automated, open-source Python API package for computational workflow integration of SMART descriptors. The impact of the structure of the molecular probe for generation of SMART descriptors was analyzed. Resultant SMART descriptors and pocket features were found to be highly dependent upon probe selection, and do not scale linearly. Flexible probes with smaller substituents can explore narrow pocket regions resulting in a higher resolution pocket imprint. Macrocyclic probes with larger substituents are more applicable to larger cavities with smooth boundaries, such as dirhodium paddlewheel complexes. In these cases, SMARTpy provides comparable descriptors to the original calculation method using UCSF Chimera. Finally, we analyzed a series of case studies demonstrating how SMART descriptors can impact other areas of catalysis, such as organocatalysis, biocatalysis, and protein pocket analysis.
{"title":"SMARTpy: a Python package for the generation of cavity steric molecular descriptors and applications to diverse systems†","authors":"Beck R. Miller, Ryan C. Cammarota and Matthew S. Sigman","doi":"10.1039/D4DD00329B","DOIUrl":"https://doi.org/10.1039/D4DD00329B","url":null,"abstract":"<p >Steric molecular descriptors designed for machine learning (ML) applications are critical for connecting structure–function relationships to mechanistic insight. However, many of these descriptors are not suitable for application to complex systems, such as catalyst reactive site pockets. In this context, we recently disclosed a new set of 3D steric molecular descriptors that were originally designed for dirhodium(<small>II</small>) tetra-carboxylate catalysts. Herein, we expand the spatial molding for rigid targets (SMART) descriptor toolkit by releasing SMARTpy; an automated, open-source Python API package for computational workflow integration of SMART descriptors. The impact of the structure of the molecular probe for generation of SMART descriptors was analyzed. Resultant SMART descriptors and pocket features were found to be highly dependent upon probe selection, and do not scale linearly. Flexible probes with smaller substituents can explore narrow pocket regions resulting in a higher resolution pocket imprint. Macrocyclic probes with larger substituents are more applicable to larger cavities with smooth boundaries, such as dirhodium paddlewheel complexes. In these cases, SMARTpy provides comparable descriptors to the original calculation method using UCSF Chimera. Finally, we analyzed a series of case studies demonstrating how SMART descriptors can impact other areas of catalysis, such as organocatalysis, biocatalysis, and protein pocket analysis.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 451-463"},"PeriodicalIF":6.2,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00329b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrij Vasylenko, Dmytro Antypov, Sven Schewe, Luke M. Daniels, John B. Claridge, Matthew S. Dyer and Matthew J. Rosseinsky
Computational modelling of materials using machine learning (ML) and historical data has become integral to materials research across physical sciences. The accuracy of predictions for material properties using computational modelling is strongly affected by the choice of the numerical representation that describes a material's composition, crystal structure and constituent chemical elements. Structure, both extended and local, has a controlling effect on properties, but often only the composition of a candidate material is available. However, existing elemental and compositional descriptors lack direct access to structural insights such as the coordination geometry of an element. In this study, we introduce Local Environment-induced Atomic Features (LEAFs), which incorporate information about the statistically preferred local coordination geometry at an element in a crystal structure into descriptors for chemical elements, enabling the modelling of materials solely as compositions without requiring knowledge of their crystal structure. In the crystal structure of a material, each atomic site can be quantitatively described by similarity to common local structural motifs; by aggregating these unique features of similarity from the experimentally verified crystal structures of inorganic materials, LEAFs formulate a set of descriptors for chemical elements and compositions. The direct connection of LEAFs to the local coordination geometry enables the analysis of ML model property predictions, linking compositions to the underlying structure–property relationships. We demonstrate the versatility of LEAFs in structure-informed property predictions for compositions, mapping of chemical space in structural terms, and prioritisation of elemental substitutions. Based on the latter for predicting crystal structures of binary ionic compounds, LEAFs achieve the state-of-the-art accuracy of 86%. These results suggest that the structurally informed description of chemical elements and compositions developed in this work can effectively guide synthetic efforts in discovering new materials.
{"title":"Digital features of chemical elements extracted from local geometries in crystal structures†","authors":"Andrij Vasylenko, Dmytro Antypov, Sven Schewe, Luke M. Daniels, John B. Claridge, Matthew S. Dyer and Matthew J. Rosseinsky","doi":"10.1039/D4DD00346B","DOIUrl":"https://doi.org/10.1039/D4DD00346B","url":null,"abstract":"<p >Computational modelling of materials using machine learning (ML) and historical data has become integral to materials research across physical sciences. The accuracy of predictions for material properties using computational modelling is strongly affected by the choice of the numerical representation that describes a material's composition, crystal structure and constituent chemical elements. Structure, both extended and local, has a controlling effect on properties, but often only the composition of a candidate material is available. However, existing elemental and compositional descriptors lack direct access to structural insights such as the coordination geometry of an element. In this study, we introduce Local Environment-induced Atomic Features (LEAFs), which incorporate information about the statistically preferred local coordination geometry at an element in a crystal structure into descriptors for chemical elements, enabling the modelling of materials solely as compositions without requiring knowledge of their crystal structure. In the crystal structure of a material, each atomic site can be quantitatively described by similarity to common local structural motifs; by aggregating these unique features of similarity from the experimentally verified crystal structures of inorganic materials, LEAFs formulate a set of descriptors for chemical elements and compositions. The direct connection of LEAFs to the local coordination geometry enables the analysis of ML model property predictions, linking compositions to the underlying structure–property relationships. We demonstrate the versatility of LEAFs in structure-informed property predictions for compositions, mapping of chemical space in structural terms, and prioritisation of elemental substitutions. Based on the latter for predicting crystal structures of binary ionic compounds, LEAFs achieve the state-of-the-art accuracy of 86%. These results suggest that the structurally informed description of chemical elements and compositions developed in this work can effectively guide synthetic efforts in discovering new materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 477-485"},"PeriodicalIF":6.2,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00346b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Madeline A. Murphy, Kyle Noordhoek, Sallye R. Gathmann, Paul J. Dauenhauer and Christopher J. Bartel
Chemical transformations on catalyst surfaces occur through series and parallel reaction pathways. These complex networks and their behavior can be most simply evaluated through a three-species surface reaction loop (A* to B* to C* to A*) that is internal to the overall chemical reaction. Application of an oscillating dynamic catalyst to this reactive loop has been shown to exhibit one of three types of behavior: (1) a positive net flux of molecules about the loop in the clockwise direction, (2) a negative net flux of molecules about the loop in the counterclockwise direction, or (3) negligible flux of molecules about the loop at the limit cycle of reaction. Three-species surface loops were simulated with microkinetic modeling to assess the reaction loop behavior resulting from a catalytic surface oscillating between two or more catalyst surface energy states. Selected input parameters for the simulations spanned an 11-dimensional parameter space using 127 688 different parameter combinations. Their converged limit cycle solutions were analyzed for their loop turnover frequencies, the majority of which were found to be approximately zero. Classification and regression machine learning models were trained to predict the sign and magnitude of the loop turnover frequency and successfully performed above accessible baselines. Notably, the classification models exhibited a baseline weighted F1 score of 0.49, whereas trained models achieved weighted F1 scores of 0.94 and 0.96 when trained on the parameters used to define the simulations and derived rate constants, respectively. The trained models successfully predicted catalytic loop behavior, and interpretation of these models revealed all input parameters to be important for the prediction and performance of each model.
{"title":"Catalytic resonance theory: forecasting the flow of programmable catalytic loops†","authors":"Madeline A. Murphy, Kyle Noordhoek, Sallye R. Gathmann, Paul J. Dauenhauer and Christopher J. Bartel","doi":"10.1039/D4DD00216D","DOIUrl":"https://doi.org/10.1039/D4DD00216D","url":null,"abstract":"<p >Chemical transformations on catalyst surfaces occur through series and parallel reaction pathways. These complex networks and their behavior can be most simply evaluated through a three-species surface reaction loop (A* to B* to C* to A*) that is internal to the overall chemical reaction. Application of an oscillating dynamic catalyst to this reactive loop has been shown to exhibit one of three types of behavior: (1) a positive net flux of molecules about the loop in the clockwise direction, (2) a negative net flux of molecules about the loop in the counterclockwise direction, or (3) negligible flux of molecules about the loop at the limit cycle of reaction. Three-species surface loops were simulated with microkinetic modeling to assess the reaction loop behavior resulting from a catalytic surface oscillating between two or more catalyst surface energy states. Selected input parameters for the simulations spanned an 11-dimensional parameter space using 127 688 different parameter combinations. Their converged limit cycle solutions were analyzed for their loop turnover frequencies, the majority of which were found to be approximately zero. Classification and regression machine learning models were trained to predict the sign and magnitude of the loop turnover frequency and successfully performed above accessible baselines. Notably, the classification models exhibited a baseline weighted <em>F</em><small><sub>1</sub></small> score of 0.49, whereas trained models achieved weighted <em>F</em><small><sub>1</sub></small> scores of 0.94 and 0.96 when trained on the parameters used to define the simulations and derived rate constants, respectively. The trained models successfully predicted catalytic loop behavior, and interpretation of these models revealed all input parameters to be important for the prediction and performance of each model.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 411-423"},"PeriodicalIF":6.2,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00216d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher D. Stubbs, Yeonjoon Kim, Ethan C. Quinn, Raúl Pérez-Soto, Eugene Y.-X. Chen and Seonah Kim
Polymer solubility has applications in many important and diverse fields, including microprocessor fabrication, environmental conservation, paint formulation, and drug delivery, but it remains under-explored compared to its relative importance. This can be seen in the relative scarcity of solvent-based systems for recycling plastics, despite a need for efficient and selective methods amid the looming plastics and climate crises. Towards this need for better predictive tools, this work examines the use of classical and deep machine learning (ML) models for predicting categorical solubility in homopolymers and copolymers, with model architectures including random forest (RF), decision tree (DT), naive Bayes, AdaBoost, and graph neural networks (GNNs). We achieve high accuracy for both our homopolymer (82%, RF) and copolymer models (92%, RF) on unseen polymer–solvent systems in our 5-fold cross-validation studies. The relevance and applicability of our homopolymer models are then verified through in-house experiments examining the solubility of common commercial plastics, followed by an explainable AI (XAI) analysis using Shapley Additive Explanations (SHAP), which explores the relative contribution of each feature toward model predictions. We then apply our homopolymer solubility prediction model to remove unwanted or hazardous additives in polyethylene (PE) and polystyrene (PS) waste. This work demonstrates the validity/feasibility of using ML to predict homopolymer solubility, provides novel ML models for the prediction of copolymer solubility, and explains homopolymer model predictions before applying the explained model to a globally relevant waste challenge.
{"title":"Predicting homopolymer and copolymer solubility through machine learning†","authors":"Christopher D. Stubbs, Yeonjoon Kim, Ethan C. Quinn, Raúl Pérez-Soto, Eugene Y.-X. Chen and Seonah Kim","doi":"10.1039/D4DD00290C","DOIUrl":"https://doi.org/10.1039/D4DD00290C","url":null,"abstract":"<p >Polymer solubility has applications in many important and diverse fields, including microprocessor fabrication, environmental conservation, paint formulation, and drug delivery, but it remains under-explored compared to its relative importance. This can be seen in the relative scarcity of solvent-based systems for recycling plastics, despite a need for efficient and selective methods amid the looming plastics and climate crises. Towards this need for better predictive tools, this work examines the use of classical and deep machine learning (ML) models for predicting categorical solubility in homopolymers and copolymers, with model architectures including random forest (RF), decision tree (DT), naive Bayes, AdaBoost, and graph neural networks (GNNs). We achieve high accuracy for both our homopolymer (82%, RF) and copolymer models (92%, RF) on unseen polymer–solvent systems in our 5-fold cross-validation studies. The relevance and applicability of our homopolymer models are then verified through in-house experiments examining the solubility of common commercial plastics, followed by an explainable AI (XAI) analysis using Shapley Additive Explanations (SHAP), which explores the relative contribution of each feature toward model predictions. We then apply our homopolymer solubility prediction model to remove unwanted or hazardous additives in polyethylene (PE) and polystyrene (PS) waste. This work demonstrates the validity/feasibility of using ML to predict homopolymer solubility, provides novel ML models for the prediction of copolymer solubility, and explains homopolymer model predictions before applying the explained model to a globally relevant waste challenge.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 424-437"},"PeriodicalIF":6.2,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00290c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaoyi Su, Siyuan Yang, Yuanhan Liu, Aiting Kai, Linjiang Chen and Ming Liu
Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in the text, including authors, affiliations, synthetic procedures, surface area, and the Cambridge Crystallographic Data Centre (CCDC) number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding to cage-related questions.
{"title":"Knowledge discovery from porous organic cage literature using a large language model†","authors":"Yaoyi Su, Siyuan Yang, Yuanhan Liu, Aiting Kai, Linjiang Chen and Ming Liu","doi":"10.1039/D4DD00337C","DOIUrl":"https://doi.org/10.1039/D4DD00337C","url":null,"abstract":"<p >Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in the text, including authors, affiliations, synthetic procedures, surface area, and the Cambridge Crystallographic Data Centre (CCDC) number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding to cage-related questions.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 403-410"},"PeriodicalIF":6.2,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00337c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}