Digital discovery最新文献

英文中文

Active learning driven prioritisation of compounds from on-demand libraries targeting the SARS-CoV-2 main protease† 主动学习驱动了针对SARS-CoV-2主要蛋白酶的按需文库中化合物的优先顺序。

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-01-08 DOI: 10.1039/D4DD00343H

Ben Cree, Mateusz K. Bieniek, Siddique Amin, Akane Kawamura and Daniel J. Cole

FEgrow is an open-source software package for building congeneric series of compounds in protein binding pockets. For a given ligand core and receptor structure, it employs hybrid machine learning/molecular mechanics potential energy functions to optimise the bioactive conformers of supplied linkers and functional groups. Here, we introduce significant new functionality to automate, parallelise and accelerate the building and scoring of compound suggestions, such that it can be used for automated de novo design. We interface the workflow with active learning to improve the efficiency of searching the combinatorial space of possible linkers and functional groups, make use of interactions formed by crystallographic fragments in scoring compound designs, and introduce the option to seed the chemical space with molecules available from on-demand chemical libraries. As a test case, we target the main protease (Mpro) of SARS-CoV-2, identifying several small molecules with high similarity to molecules discovered by the COVID moonshot effort, using only structural information from a fragment screen in a fully automated fashion. Finally, we order and test 19 compound designs, of which three show weak activity in a fluorescence-based Mpro assay, but work is needed to further optimise the prioritisation of compounds for purchase. The FEgrow package and full tutorials demonstrating the active learning workflow are available at https://github.com/cole-group/FEgrow.

FEgrow是一个开源软件包，用于在蛋白质结合口袋中构建同源系列化合物。对于给定的配体核和受体结构，它采用混合机器学习/分子力学势能函数来优化所提供的连接体和官能团的生物活性构象。在这里，我们引入了重要的新功能来自动化、并行化和加速复合建议的构建和评分，这样它就可以用于自动化的从头设计。我们将工作流程与主动学习相结合，以提高搜索可能的连接体和官能团组合空间的效率，利用晶体碎片形成的相互作用来评分化合物设计，并引入从按需化学文库中获得分子的化学空间种子选项。作为一个测试案例，我们以SARS-CoV-2的主要蛋白酶（Mpro）为目标，识别出几个与COVID登月计划中发现的分子高度相似的小分子，仅使用片段筛选的结构信息，以全自动的方式进行。最后，我们订购并测试了19种化合物设计，其中3种在基于荧光的Mpro分析中表现出较弱的活性，但需要进一步优化化合物购买的优先级。FEgrow包和演示主动学习工作流的完整教程可在https://github.com/cole-group/FEgrow上获得。

{"title":"Active learning driven prioritisation of compounds from on-demand libraries targeting the SARS-CoV-2 main protease†","authors":"Ben Cree, Mateusz K. Bieniek, Siddique Amin, Akane Kawamura and Daniel J. Cole","doi":"10.1039/D4DD00343H","DOIUrl":"10.1039/D4DD00343H","url":null,"abstract":"FEgrow is an open-source software package for building congeneric series of compounds in protein binding pockets. For a given ligand core and receptor structure, it employs hybrid machine learning/molecular mechanics potential energy functions to optimise the bioactive conformers of supplied linkers and functional groups. Here, we introduce significant new functionality to automate, parallelise and accelerate the building and scoring of compound suggestions, such that it can be used for automated de novo design. We interface the workflow with active learning to improve the efficiency of searching the combinatorial space of possible linkers and functional groups, make use of interactions formed by crystallographic fragments in scoring compound designs, and introduce the option to seed the chemical space with molecules available from on-demand chemical libraries. As a test case, we target the main protease (Mpro) of SARS-CoV-2, identifying several small molecules with high similarity to molecules discovered by the COVID moonshot effort, using only structural information from a fragment screen in a fully automated fashion. Finally, we order and test 19 compound designs, of which three show weak activity in a fluorescence-based Mpro assay, but work is needed to further optimise the prioritisation of compounds for purchase. The FEgrow package and full tutorials demonstrating the active learning workflow are available at https://github.com/cole-group/FEgrow.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 438-450"},"PeriodicalIF":6.2,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11726688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143017414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ULaMDyn: enhancing excited-state dynamics analysis through streamlined unsupervised learning.

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-01-08 DOI: 10.1039/d4dd00374h

Max Pinheiro, Matheus de Oliveira Bispo, Rafael S Mattos, Mariana Telles do Casal, Bidhan Chandra Garain, Josene M Toldo, Saikat Mukherjee, Mario Barbatti

The analysis of nonadiabatic molecular dynamics (NAMD) data presents significant challenges due to its high dimensionality and complexity. To address these issues, we introduce ULaMDyn, a Python-based, open-source package designed to automate the unsupervised analysis of large datasets generated by NAMD simulations. ULaMDyn integrates seamlessly with the Newton-X platform and employs advanced dimensionality reduction and clustering techniques to uncover hidden patterns in molecular trajectories, enabling a more intuitive understanding of excited-state processes. Using the photochemical dynamics of fulvene as a test case, we demonstrate how ULaMDyn efficiently identifies critical molecular geometries and critical nonadiabatic transitions. The package offers a streamlined, scalable solution for interpreting large NAMD datasets. It is poised to facilitate advances in the study of excited-state dynamics across a wide range of molecular systems.

由于非绝热分子动力学（NAMD）数据的高维性和复杂性，对其进行分析面临着巨大挑战。为了解决这些问题，我们推出了 ULaMDyn，这是一款基于 Python 的开源软件包，旨在自动对 NAMD 模拟生成的大型数据集进行无监督分析。ULaMDyn 与 Newton-X 平台无缝集成，采用先进的降维和聚类技术来揭示分子轨迹中隐藏的模式，从而更直观地了解激发态过程。我们以富勒烯的光化学动力学为测试案例，展示了 ULaMDyn 如何高效地识别临界分子几何形状和临界非绝热转变。该软件包为解释大型 NAMD 数据集提供了简化、可扩展的解决方案。它将推动对各种分子系统激发态动力学的研究取得进展。

引用次数: 0

Advancing predictive toxicology: overcoming hurdles and shaping the future

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-01-06 DOI: 10.1039/D4DD00257A

Sara Masarone, Katie V. Beckwith, Matthew R. Wilkinson, Shreshth Tuli, Amy Lane, Sam Windsor, Jordan Lane and Layla Hosseini-Gerami

Modern drug discovery projects are plagued with high failure rates, many of which have safety as the underlying cause. The drug discovery process involves selecting the right compounds from a pool of possible candidates to satisfy some pre-set requirements. As this process is costly and time consuming, finding toxicities at later stages can result in project failure. In this context, the use of existing data from previous projects can help develop computational models (e.g. QSARs) and algorithms to speed up the identification of compound toxicity. While clinical and in vivo data continues to be fundamental, data originating from organ-on-a-chip models, cell lines and previous studies can accelerate the drug discovery process allowing for faster identification of toxicities and thus saving time and resources.

引用次数: 0

A novel approach to protein chemical shift prediction from sequences using a protein language model†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-01-06 DOI: 10.1039/D4DD00367E

He Zhu, Lingyue Hu, Yu Yang and Zhong Chen

Chemical shifts are crucial parameters in protein Nuclear Magnetic Resonance (NMR) experiments. Specifically, the chemical shifts of backbone atoms are essential for determining the constraints in protein structure analysis. Despite their importance, protein NMR experiments are costly and spectral analysis presents challenges due to sample impurities, complex experimental environments, and spectral overlap. Here, we propose a chemical shift prediction method that requires only protein sequences as input. This low-cost chemical shift predictor provides a chemical shift corresponding to each backbone atom, offers valuable prior information for peak assignment, and can significantly aid protein NMR spectrum analysis. Our approach leverages recent advances in pre-trained protein language models (PLMs) and employs a deep learning model to obtain chemical shifts. Different from other chemical shift prediction programs, our method does not require protein structures as input, significantly reducing costs and enhancing robustness. Our method can achieve comparable accuracy to other existing programs that require protein structures as input. In summary, this work introduces a novel method for protein chemical shift prediction and demonstrates the potential of PLMs for diverse applications.

{"title":"A novel approach to protein chemical shift prediction from sequences using a protein language model†","authors":"He Zhu, Lingyue Hu, Yu Yang and Zhong Chen","doi":"10.1039/D4DD00367E","DOIUrl":"https://doi.org/10.1039/D4DD00367E","url":null,"abstract":"Chemical shifts are crucial parameters in protein Nuclear Magnetic Resonance (NMR) experiments. Specifically, the chemical shifts of backbone atoms are essential for determining the constraints in protein structure analysis. Despite their importance, protein NMR experiments are costly and spectral analysis presents challenges due to sample impurities, complex experimental environments, and spectral overlap. Here, we propose a chemical shift prediction method that requires only protein sequences as input. This low-cost chemical shift predictor provides a chemical shift corresponding to each backbone atom, offers valuable prior information for peak assignment, and can significantly aid protein NMR spectrum analysis. Our approach leverages recent advances in pre-trained protein language models (PLMs) and employs a deep learning model to obtain chemical shifts. Different from other chemical shift prediction programs, our method does not require protein structures as input, significantly reducing costs and enhancing robustness. Our method can achieve comparable accuracy to other existing programs that require protein structures as input. In summary, this work introduces a novel method for protein chemical shift prediction and demonstrates the potential of PLMs for diverse applications.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 331-337"},"PeriodicalIF":6.2,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00367e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-objective Bayesian optimization: a case study in material extrusion

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-01-06 DOI: 10.1039/D4DD00281D

Jay I. Myung, James R. Deneault, Jorge Chang, Inhan Kang, Benji Maruyama and Mark A. Pitt

Autonomous experimentation is a rapidly growing approach to materials science research. Machine learning can assist in improving the efficiency and capability of experimentation with algorithms that adaptively identify optimal design parameters that achieve one or more objectives in iterative, closed-loop fashion. Optimization in additive manufacturing, which can be slow and costly because of its complexity, stands to benefit greatly from such technologies. The present study demonstrates the application of an algorithm (multi-objective Bayesian optimization; MOBO) that optimizes two objectives simultaneously given multiple parameter inputs. The generality and robustness of MOBO are demonstrated in repeated print campaigns of two different test specimens. The results push the boundaries of integrating machine learning with autonomous experimentation for accelerated materials development in additive manufacturing and related areas.

引用次数: 0

SMARTpy: a Python package for the generation of cavity steric molecular descriptors and applications to diverse systems†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-01-03 DOI: 10.1039/D4DD00329B

Beck R. Miller, Ryan C. Cammarota and Matthew S. Sigman

Steric molecular descriptors designed for machine learning (ML) applications are critical for connecting structure–function relationships to mechanistic insight. However, many of these descriptors are not suitable for application to complex systems, such as catalyst reactive site pockets. In this context, we recently disclosed a new set of 3D steric molecular descriptors that were originally designed for dirhodium(II) tetra-carboxylate catalysts. Herein, we expand the spatial molding for rigid targets (SMART) descriptor toolkit by releasing SMARTpy; an automated, open-source Python API package for computational workflow integration of SMART descriptors. The impact of the structure of the molecular probe for generation of SMART descriptors was analyzed. Resultant SMART descriptors and pocket features were found to be highly dependent upon probe selection, and do not scale linearly. Flexible probes with smaller substituents can explore narrow pocket regions resulting in a higher resolution pocket imprint. Macrocyclic probes with larger substituents are more applicable to larger cavities with smooth boundaries, such as dirhodium paddlewheel complexes. In these cases, SMARTpy provides comparable descriptors to the original calculation method using UCSF Chimera. Finally, we analyzed a series of case studies demonstrating how SMART descriptors can impact other areas of catalysis, such as organocatalysis, biocatalysis, and protein pocket analysis.

{"title":"SMARTpy: a Python package for the generation of cavity steric molecular descriptors and applications to diverse systems†","authors":"Beck R. Miller, Ryan C. Cammarota and Matthew S. Sigman","doi":"10.1039/D4DD00329B","DOIUrl":"https://doi.org/10.1039/D4DD00329B","url":null,"abstract":"Steric molecular descriptors designed for machine learning (ML) applications are critical for connecting structure–function relationships to mechanistic insight. However, many of these descriptors are not suitable for application to complex systems, such as catalyst reactive site pockets. In this context, we recently disclosed a new set of 3D steric molecular descriptors that were originally designed for dirhodium(II) tetra-carboxylate catalysts. Herein, we expand the spatial molding for rigid targets (SMART) descriptor toolkit by releasing SMARTpy; an automated, open-source Python API package for computational workflow integration of SMART descriptors. The impact of the structure of the molecular probe for generation of SMART descriptors was analyzed. Resultant SMART descriptors and pocket features were found to be highly dependent upon probe selection, and do not scale linearly. Flexible probes with smaller substituents can explore narrow pocket regions resulting in a higher resolution pocket imprint. Macrocyclic probes with larger substituents are more applicable to larger cavities with smooth boundaries, such as dirhodium paddlewheel complexes. In these cases, SMARTpy provides comparable descriptors to the original calculation method using UCSF Chimera. Finally, we analyzed a series of case studies demonstrating how SMART descriptors can impact other areas of catalysis, such as organocatalysis, biocatalysis, and protein pocket analysis.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 451-463"},"PeriodicalIF":6.2,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00329b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Digital features of chemical elements extracted from local geometries in crystal structures†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2025-01-03 DOI: 10.1039/D4DD00346B

Andrij Vasylenko, Dmytro Antypov, Sven Schewe, Luke M. Daniels, John B. Claridge, Matthew S. Dyer and Matthew J. Rosseinsky

Computational modelling of materials using machine learning (ML) and historical data has become integral to materials research across physical sciences. The accuracy of predictions for material properties using computational modelling is strongly affected by the choice of the numerical representation that describes a material's composition, crystal structure and constituent chemical elements. Structure, both extended and local, has a controlling effect on properties, but often only the composition of a candidate material is available. However, existing elemental and compositional descriptors lack direct access to structural insights such as the coordination geometry of an element. In this study, we introduce Local Environment-induced Atomic Features (LEAFs), which incorporate information about the statistically preferred local coordination geometry at an element in a crystal structure into descriptors for chemical elements, enabling the modelling of materials solely as compositions without requiring knowledge of their crystal structure. In the crystal structure of a material, each atomic site can be quantitatively described by similarity to common local structural motifs; by aggregating these unique features of similarity from the experimentally verified crystal structures of inorganic materials, LEAFs formulate a set of descriptors for chemical elements and compositions. The direct connection of LEAFs to the local coordination geometry enables the analysis of ML model property predictions, linking compositions to the underlying structure–property relationships. We demonstrate the versatility of LEAFs in structure-informed property predictions for compositions, mapping of chemical space in structural terms, and prioritisation of elemental substitutions. Based on the latter for predicting crystal structures of binary ionic compounds, LEAFs achieve the state-of-the-art accuracy of 86%. These results suggest that the structurally informed description of chemical elements and compositions developed in this work can effectively guide synthetic efforts in discovering new materials.

{"title":"Digital features of chemical elements extracted from local geometries in crystal structures†","authors":"Andrij Vasylenko, Dmytro Antypov, Sven Schewe, Luke M. Daniels, John B. Claridge, Matthew S. Dyer and Matthew J. Rosseinsky","doi":"10.1039/D4DD00346B","DOIUrl":"https://doi.org/10.1039/D4DD00346B","url":null,"abstract":"Computational modelling of materials using machine learning (ML) and historical data has become integral to materials research across physical sciences. The accuracy of predictions for material properties using computational modelling is strongly affected by the choice of the numerical representation that describes a material's composition, crystal structure and constituent chemical elements. Structure, both extended and local, has a controlling effect on properties, but often only the composition of a candidate material is available. However, existing elemental and compositional descriptors lack direct access to structural insights such as the coordination geometry of an element. In this study, we introduce Local Environment-induced Atomic Features (LEAFs), which incorporate information about the statistically preferred local coordination geometry at an element in a crystal structure into descriptors for chemical elements, enabling the modelling of materials solely as compositions without requiring knowledge of their crystal structure. In the crystal structure of a material, each atomic site can be quantitatively described by similarity to common local structural motifs; by aggregating these unique features of similarity from the experimentally verified crystal structures of inorganic materials, LEAFs formulate a set of descriptors for chemical elements and compositions. The direct connection of LEAFs to the local coordination geometry enables the analysis of ML model property predictions, linking compositions to the underlying structure–property relationships. We demonstrate the versatility of LEAFs in structure-informed property predictions for compositions, mapping of chemical space in structural terms, and prioritisation of elemental substitutions. Based on the latter for predicting crystal structures of binary ionic compounds, LEAFs achieve the state-of-the-art accuracy of 86%. These results suggest that the structurally informed description of chemical elements and compositions developed in this work can effectively guide synthetic efforts in discovering new materials.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 477-485"},"PeriodicalIF":6.2,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00346b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Catalytic resonance theory: forecasting the flow of programmable catalytic loops†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-30 DOI: 10.1039/D4DD00216D

Madeline A. Murphy, Kyle Noordhoek, Sallye R. Gathmann, Paul J. Dauenhauer and Christopher J. Bartel

Chemical transformations on catalyst surfaces occur through series and parallel reaction pathways. These complex networks and their behavior can be most simply evaluated through a three-species surface reaction loop (A* to B* to C* to A*) that is internal to the overall chemical reaction. Application of an oscillating dynamic catalyst to this reactive loop has been shown to exhibit one of three types of behavior: (1) a positive net flux of molecules about the loop in the clockwise direction, (2) a negative net flux of molecules about the loop in the counterclockwise direction, or (3) negligible flux of molecules about the loop at the limit cycle of reaction. Three-species surface loops were simulated with microkinetic modeling to assess the reaction loop behavior resulting from a catalytic surface oscillating between two or more catalyst surface energy states. Selected input parameters for the simulations spanned an 11-dimensional parameter space using 127 688 different parameter combinations. Their converged limit cycle solutions were analyzed for their loop turnover frequencies, the majority of which were found to be approximately zero. Classification and regression machine learning models were trained to predict the sign and magnitude of the loop turnover frequency and successfully performed above accessible baselines. Notably, the classification models exhibited a baseline weighted F₁ score of 0.49, whereas trained models achieved weighted F₁ scores of 0.94 and 0.96 when trained on the parameters used to define the simulations and derived rate constants, respectively. The trained models successfully predicted catalytic loop behavior, and interpretation of these models revealed all input parameters to be important for the prediction and performance of each model.

{"title":"Catalytic resonance theory: forecasting the flow of programmable catalytic loops†","authors":"Madeline A. Murphy, Kyle Noordhoek, Sallye R. Gathmann, Paul J. Dauenhauer and Christopher J. Bartel","doi":"10.1039/D4DD00216D","DOIUrl":"https://doi.org/10.1039/D4DD00216D","url":null,"abstract":"Chemical transformations on catalyst surfaces occur through series and parallel reaction pathways. These complex networks and their behavior can be most simply evaluated through a three-species surface reaction loop (A* to B* to C* to A*) that is internal to the overall chemical reaction. Application of an oscillating dynamic catalyst to this reactive loop has been shown to exhibit one of three types of behavior: (1) a positive net flux of molecules about the loop in the clockwise direction, (2) a negative net flux of molecules about the loop in the counterclockwise direction, or (3) negligible flux of molecules about the loop at the limit cycle of reaction. Three-species surface loops were simulated with microkinetic modeling to assess the reaction loop behavior resulting from a catalytic surface oscillating between two or more catalyst surface energy states. Selected input parameters for the simulations spanned an 11-dimensional parameter space using 127 688 different parameter combinations. Their converged limit cycle solutions were analyzed for their loop turnover frequencies, the majority of which were found to be approximately zero. Classification and regression machine learning models were trained to predict the sign and magnitude of the loop turnover frequency and successfully performed above accessible baselines. Notably, the classification models exhibited a baseline weighted F1 score of 0.49, whereas trained models achieved weighted F1 scores of 0.94 and 0.96 when trained on the parameters used to define the simulations and derived rate constants, respectively. The trained models successfully predicted catalytic loop behavior, and interpretation of these models revealed all input parameters to be important for the prediction and performance of each model.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 411-423"},"PeriodicalIF":6.2,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00216d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting homopolymer and copolymer solubility through machine learning†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-24 DOI: 10.1039/D4DD00290C

Christopher D. Stubbs, Yeonjoon Kim, Ethan C. Quinn, Raúl Pérez-Soto, Eugene Y.-X. Chen and Seonah Kim

Polymer solubility has applications in many important and diverse fields, including microprocessor fabrication, environmental conservation, paint formulation, and drug delivery, but it remains under-explored compared to its relative importance. This can be seen in the relative scarcity of solvent-based systems for recycling plastics, despite a need for efficient and selective methods amid the looming plastics and climate crises. Towards this need for better predictive tools, this work examines the use of classical and deep machine learning (ML) models for predicting categorical solubility in homopolymers and copolymers, with model architectures including random forest (RF), decision tree (DT), naive Bayes, AdaBoost, and graph neural networks (GNNs). We achieve high accuracy for both our homopolymer (82%, RF) and copolymer models (92%, RF) on unseen polymer–solvent systems in our 5-fold cross-validation studies. The relevance and applicability of our homopolymer models are then verified through in-house experiments examining the solubility of common commercial plastics, followed by an explainable AI (XAI) analysis using Shapley Additive Explanations (SHAP), which explores the relative contribution of each feature toward model predictions. We then apply our homopolymer solubility prediction model to remove unwanted or hazardous additives in polyethylene (PE) and polystyrene (PS) waste. This work demonstrates the validity/feasibility of using ML to predict homopolymer solubility, provides novel ML models for the prediction of copolymer solubility, and explains homopolymer model predictions before applying the explained model to a globally relevant waste challenge.

{"title":"Predicting homopolymer and copolymer solubility through machine learning†","authors":"Christopher D. Stubbs, Yeonjoon Kim, Ethan C. Quinn, Raúl Pérez-Soto, Eugene Y.-X. Chen and Seonah Kim","doi":"10.1039/D4DD00290C","DOIUrl":"https://doi.org/10.1039/D4DD00290C","url":null,"abstract":"Polymer solubility has applications in many important and diverse fields, including microprocessor fabrication, environmental conservation, paint formulation, and drug delivery, but it remains under-explored compared to its relative importance. This can be seen in the relative scarcity of solvent-based systems for recycling plastics, despite a need for efficient and selective methods amid the looming plastics and climate crises. Towards this need for better predictive tools, this work examines the use of classical and deep machine learning (ML) models for predicting categorical solubility in homopolymers and copolymers, with model architectures including random forest (RF), decision tree (DT), naive Bayes, AdaBoost, and graph neural networks (GNNs). We achieve high accuracy for both our homopolymer (82%, RF) and copolymer models (92%, RF) on unseen polymer–solvent systems in our 5-fold cross-validation studies. The relevance and applicability of our homopolymer models are then verified through in-house experiments examining the solubility of common commercial plastics, followed by an explainable AI (XAI) analysis using Shapley Additive Explanations (SHAP), which explores the relative contribution of each feature toward model predictions. We then apply our homopolymer solubility prediction model to remove unwanted or hazardous additives in polyethylene (PE) and polystyrene (PS) waste. This work demonstrates the validity/feasibility of using ML to predict homopolymer solubility, provides novel ML models for the prediction of copolymer solubility, and explains homopolymer model predictions before applying the explained model to a globally relevant waste challenge.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 424-437"},"PeriodicalIF":6.2,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00290c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Knowledge discovery from porous organic cage literature using a large language model†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery

Pub Date : 2024-12-19 DOI: 10.1039/D4DD00337C

Yaoyi Su, Siyuan Yang, Yuanhan Liu, Aiting Kai, Linjiang Chen and Ming Liu

Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in the text, including authors, affiliations, synthetic procedures, surface area, and the Cambridge Crystallographic Data Centre (CCDC) number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding to cage-related questions.

{"title":"Knowledge discovery from porous organic cage literature using a large language model†","authors":"Yaoyi Su, Siyuan Yang, Yuanhan Liu, Aiting Kai, Linjiang Chen and Ming Liu","doi":"10.1039/D4DD00337C","DOIUrl":"https://doi.org/10.1039/D4DD00337C","url":null,"abstract":"Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in the text, including authors, affiliations, synthetic procedures, surface area, and the Cambridge Crystallographic Data Centre (CCDC) number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding to cage-related questions.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 403-410"},"PeriodicalIF":6.2,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00337c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Digital discovery

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀