首页 > 最新文献

Digital discovery最新文献

英文 中文
Visualizing high entropy alloy spaces: methods and best practices† 可视化高熵合金空间:方法和最佳实践
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-04 DOI: 10.1039/D4DD00262H
Brent Vela, Trevor Hastings, Marshall Allen and Raymundo Arróyave

Multi-Principal Element Alloys (MPEAs) have emerged as an exciting area of research in materials science in the 2020s, owing to the vast potential for discovering alloys with unique and tailored properties enabled by the combinations of elements. However, the chemical complexity of MPEAs poses a significant challenge in visualizing composition–property relationships in high-dimensional design spaces. Without effective visualization techniques, designing chemically complex alloys is practically impossible. In this methods article, we present a suite of visualization techniques that allow for meaningful and insightful visualizations of MPEA composition spaces and property spaces. Our contribution to this suite are projections of entire alloy spaces for the purposes of design. We deploy this of visualization techniques on the following MPEA case studies: (1) constraint-satisfaction alloy design scheme, (2) Bayesian optimization alloy design campaigns, (3) and various other scenarios in the ESI. Furthermore, we show how this method can be applied to any barycentric design space. While there is no one-size-fits-all visualization technique, our toolbox offers a range of methods and best practices that can be tailored to specific MPEA research needs. This article is intended for materials scientists interested in performing research on multi-principal element alloys, chemically complex alloys, or high entropy alloys and is expected to facilitate the discovery of novel and tailored properties in MPEAs.

多主元素合金(mpea)在21世纪20年代成为材料科学研究的一个令人兴奋的领域,因为通过元素组合可以发现具有独特和定制性能的合金的巨大潜力。然而,mpea的化学复杂性对高维设计空间中可视化组成-属性关系提出了重大挑战。如果没有有效的可视化技术,设计化学上复杂的合金实际上是不可能的。在本文中,我们提出了一套可视化技术,允许对MPEA组合空间和属性空间进行有意义和深刻的可视化。我们对这个套件的贡献是整个合金空间的投影,用于设计目的。我们将可视化技术应用于以下MPEA案例研究:(1)约束满足合金设计方案,(2)贝叶斯优化合金设计活动,(3)ESI中的各种其他场景。此外,我们展示了如何将这种方法应用于任何以重心为中心的设计空间。虽然没有放之四海而皆准的可视化技术,但我们的工具箱提供了一系列方法和最佳实践,可以根据特定的MPEA研究需求进行定制。本文面向对多主元素合金、化学复杂合金或高熵合金的研究感兴趣的材料科学家,有望促进mpea中新颖和定制特性的发现。
{"title":"Visualizing high entropy alloy spaces: methods and best practices†","authors":"Brent Vela, Trevor Hastings, Marshall Allen and Raymundo Arróyave","doi":"10.1039/D4DD00262H","DOIUrl":"https://doi.org/10.1039/D4DD00262H","url":null,"abstract":"<p >Multi-Principal Element Alloys (MPEAs) have emerged as an exciting area of research in materials science in the 2020s, owing to the vast potential for discovering alloys with unique and tailored properties enabled by the combinations of elements. However, the chemical complexity of MPEAs poses a significant challenge in visualizing composition–property relationships in high-dimensional design spaces. Without effective visualization techniques, designing chemically complex alloys is practically impossible. In this methods article, we present a suite of visualization techniques that allow for meaningful and insightful visualizations of MPEA composition spaces and property spaces. Our contribution to this suite are projections of entire alloy spaces for the purposes of design. We deploy this of visualization techniques on the following MPEA case studies: (1) constraint-satisfaction alloy design scheme, (2) Bayesian optimization alloy design campaigns, (3) and various other scenarios in the ESI. Furthermore, we show how this method can be applied to any barycentric design space. While there is no one-size-fits-all visualization technique, our toolbox offers a range of methods and best practices that can be tailored to specific MPEA research needs. This article is intended for materials scientists interested in performing research on multi-principal element alloys, chemically complex alloys, or high entropy alloys and is expected to facilitate the discovery of novel and tailored properties in MPEAs.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 181-194"},"PeriodicalIF":6.2,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00262h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scientific exploration with expert knowledge (SEEK) in autonomous scanning probe microscopy with active learning† 科学探索与专家知识(SEEK)在自主扫描探针显微镜与主动学习†
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-04 DOI: 10.1039/D4DD00277F
Utkarsh Pratiush, Hiroshi Funakubo, Rama Vasudevan, Sergei V. Kalinin and Yongtao Liu

Microscopy plays a foundational role in materials science, biology, and nanotechnology, offering high-resolution imaging and detailed insights into properties at the nanoscale and atomic level. Microscopy automation via active machine learning approaches is a transformative advancement, offering increased efficiency, reproducibility, and the capability to perform complex experiments. Our previous work on autonomous experimentation with scanning probe microscopy (SPM) demonstrated an active learning framework using deep kernel learning (DKL) for structure–property relationship discovery. Here we extend this approach to a multi-stage decision process to incorporate prior knowledge and human interest into DKL-based workflows, we operationalize these workflows in SPM. By integrating expected rewards from structure libraries or spectroscopic features, we enhanced the exploration efficiency of autonomous microscopy, demonstrating more efficient and targeted exploration in autonomous microscopy. These methods can be seamlessly applied to other microscopy and imaging techniques. Furthermore, the concept can be adapted for general Bayesian optimization in material discovery across a broad range of autonomous experimental fields.

显微镜在材料科学,生物学和纳米技术中起着基础作用,提供高分辨率成像和纳米级和原子级特性的详细见解。通过主动机器学习方法实现显微镜自动化是一个变革性的进步,提供了更高的效率、可重复性和执行复杂实验的能力。我们之前在扫描探针显微镜(SPM)的自主实验中展示了一个使用深度核学习(DKL)进行结构-性质关系发现的主动学习框架。在这里,我们将这种方法扩展到一个多阶段的决策过程,将先验知识和人类兴趣结合到基于dcl的工作流中,我们在SPM中操作这些工作流。通过整合结构库或光谱特征的预期回报,我们提高了自治显微镜的探测效率,展示了自治显微镜更有效和有针对性的探测。这些方法可以无缝地应用于其他显微镜和成像技术。此外,该概念可以适用于广泛的自主实验领域中材料发现的一般贝叶斯优化。
{"title":"Scientific exploration with expert knowledge (SEEK) in autonomous scanning probe microscopy with active learning†","authors":"Utkarsh Pratiush, Hiroshi Funakubo, Rama Vasudevan, Sergei V. Kalinin and Yongtao Liu","doi":"10.1039/D4DD00277F","DOIUrl":"https://doi.org/10.1039/D4DD00277F","url":null,"abstract":"<p >Microscopy plays a foundational role in materials science, biology, and nanotechnology, offering high-resolution imaging and detailed insights into properties at the nanoscale and atomic level. Microscopy automation <em>via</em> active machine learning approaches is a transformative advancement, offering increased efficiency, reproducibility, and the capability to perform complex experiments. Our previous work on autonomous experimentation with scanning probe microscopy (SPM) demonstrated an active learning framework using deep kernel learning (DKL) for structure–property relationship discovery. Here we extend this approach to a multi-stage decision process to incorporate prior knowledge and human interest into DKL-based workflows, we operationalize these workflows in SPM. By integrating expected rewards from structure libraries or spectroscopic features, we enhanced the exploration efficiency of autonomous microscopy, demonstrating more efficient and targeted exploration in autonomous microscopy. These methods can be seamlessly applied to other microscopy and imaging techniques. Furthermore, the concept can be adapted for general Bayesian optimization in material discovery across a broad range of autonomous experimental fields.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 252-263"},"PeriodicalIF":6.2,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A materials discovery framework based on conditional generative models applied to the design of polymer electrolytes† 基于条件生成模型的材料发现框架在聚合物电解质设计中的应用
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-04 DOI: 10.1039/D4DD00293H
Arash Khajeh, Xiangyun Lei, Weike Ye, Zhenze Yang, Linda Hung, Daniel Schweigert and Ha-Kyung Kwon

In this work, we introduce a computational polymer discovery framework that efficiently designs polymers with tailored properties. The framework comprises three core components—a conditioned generative model, a computational evaluation module, and a feedback mechanism—all integrated into an iterative framework for material innovation. To demonstrate the efficacy of this framework, we used it to design polymer electrolyte materials with high ionic conductivity. A conditional generative model based on the minGPT architecture can generate candidate polymers that exhibit a mean ionic conductivity that is greater than that of the original training set. This approach, coupled with molecular dynamics (MD) simulations for testing and a specifically planned acquisition mechanism, allows the framework to refine its output iteratively. Notably, we observe an increase in both the mean and the lower bound of the ionic conductivity of the new polymer candidates. The framework's effectiveness is underscored by its identification of 14 distinct polymer repeating units that display a computed ionic conductivity surpassing that of polyethylene oxide (PEO).

在这项工作中,我们引入了一个计算聚合物发现框架,可以有效地设计具有定制特性的聚合物。该框架包括三个核心组件——条件生成模型、计算评估模块和反馈机制——所有这些都集成到材料创新的迭代框架中。为了证明该框架的有效性,我们使用它来设计具有高离子电导率的聚合物电解质材料。基于minGPT架构的条件生成模型可以生成平均离子电导率大于原始训练集的候选聚合物。这种方法与用于测试的分子动力学(MD)模拟和专门计划的获取机制相结合,允许框架迭代地改进其输出。值得注意的是,我们观察到新的候选聚合物的离子电导率的平均值和下界都有所增加。该框架的有效性是通过其识别14种不同的聚合物重复单元来强调的,这些重复单元显示出超过聚乙烯氧化物(PEO)的计算离子电导率。
{"title":"A materials discovery framework based on conditional generative models applied to the design of polymer electrolytes†","authors":"Arash Khajeh, Xiangyun Lei, Weike Ye, Zhenze Yang, Linda Hung, Daniel Schweigert and Ha-Kyung Kwon","doi":"10.1039/D4DD00293H","DOIUrl":"https://doi.org/10.1039/D4DD00293H","url":null,"abstract":"<p >In this work, we introduce a computational polymer discovery framework that efficiently designs polymers with tailored properties. The framework comprises three core components—a conditioned generative model, a computational evaluation module, and a feedback mechanism—all integrated into an iterative framework for material innovation. To demonstrate the efficacy of this framework, we used it to design polymer electrolyte materials with high ionic conductivity. A conditional generative model based on the minGPT architecture can generate candidate polymers that exhibit a mean ionic conductivity that is greater than that of the original training set. This approach, coupled with molecular dynamics (MD) simulations for testing and a specifically planned acquisition mechanism, allows the framework to refine its output iteratively. Notably, we observe an increase in both the mean and the lower bound of the ionic conductivity of the new polymer candidates. The framework's effectiveness is underscored by its identification of 14 distinct polymer repeating units that display a computed ionic conductivity surpassing that of polyethylene oxide (PEO).</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 11-20"},"PeriodicalIF":6.2,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00293h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data efficiency of classification strategies for chemical and materials design† 化工与材料设计分类策略的数据效率
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-03 DOI: 10.1039/D4DD00298A
Quinn M. Gallagher and Michael A. Webb

Active learning and design–build–test–learn strategies are increasingly employed to accelerate materials discovery and characterization. Many data-driven materials design campaigns require that materials are synthesizable, stable, soluble, recyclable, or non-toxic. Resources are wasted when materials are recommended that do not satisfy these constraints. Acquiring this knowledge during the design campaign is inefficient, and many materials constraints transcend specific design objectives. However, there is no consensus on the most data-efficient algorithm for classifying whether a material satisfies a constraint. To address this gap, we comprehensively compare the performance of 100 strategies for classifying chemical and materials behavior. Performance is assessed across 31 classification tasks sourced from the literature in chemical and materials science. From these results, we recommend best practices for building data-efficient classifiers, showing the neural network- and random forest-based active learning algorithms are most efficient across tasks. We also show that classification task complexity can be quantified by task metafeatures, most notably the noise-to-signal ratio. These metafeatures are then used to rationalize the data efficiency of different molecular representations and the impact of domain size on task complexity. Overall, this work provides a comprehensive survey of data-efficient classification strategies, identifies attributes of top-performing strategies, and suggests avenues for further study.

主动学习和设计-构建-测试-学习策略越来越多地用于加速材料的发现和表征。许多数据驱动的材料设计活动要求材料是可合成的、稳定的、可溶的、可回收的或无毒的。当推荐的材料不满足这些约束时,资源就被浪费了。在设计活动中获取这些知识是低效的,并且许多材料限制超出了特定的设计目标。然而,对于对材料是否满足约束进行分类的最有效的数据算法尚无共识。为了解决这一差距,我们全面比较了100种分类化学和材料行为的策略的性能。通过化学和材料科学文献中的31个分类任务来评估绩效。根据这些结果,我们推荐了构建数据高效分类器的最佳实践,表明基于神经网络和随机森林的主动学习算法在任务中是最有效的。我们还表明,分类任务的复杂性可以通过任务元特征来量化,最明显的是噪声与信号比。然后使用这些元特征来合理化不同分子表示的数据效率以及域大小对任务复杂性的影响。总的来说,这项工作提供了数据高效分类策略的全面调查,确定了表现最好的策略的属性,并提出了进一步研究的途径。
{"title":"Data efficiency of classification strategies for chemical and materials design†","authors":"Quinn M. Gallagher and Michael A. Webb","doi":"10.1039/D4DD00298A","DOIUrl":"https://doi.org/10.1039/D4DD00298A","url":null,"abstract":"<p >Active learning and design–build–test–learn strategies are increasingly employed to accelerate materials discovery and characterization. Many data-driven materials design campaigns require that materials are synthesizable, stable, soluble, recyclable, or non-toxic. Resources are wasted when materials are recommended that do not satisfy these constraints. Acquiring this knowledge during the design campaign is inefficient, and many materials constraints transcend specific design objectives. However, there is no consensus on the most data-efficient algorithm for classifying whether a material satisfies a constraint. To address this gap, we comprehensively compare the performance of 100 strategies for classifying chemical and materials behavior. Performance is assessed across 31 classification tasks sourced from the literature in chemical and materials science. From these results, we recommend best practices for building data-efficient classifiers, showing the neural network- and random forest-based active learning algorithms are most efficient across tasks. We also show that classification task complexity can be quantified by task metafeatures, most notably the noise-to-signal ratio. These metafeatures are then used to rationalize the data efficiency of different molecular representations and the impact of domain size on task complexity. Overall, this work provides a comprehensive survey of data-efficient classification strategies, identifies attributes of top-performing strategies, and suggests avenues for further study.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 135-148"},"PeriodicalIF":6.2,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00298a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning. 基于掩蔽语言建模和迁移学习的RiPP生物合成酶底物预测。
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-02 DOI: 10.1039/d4dd00170b
Joseph D Clark, Xuenan Mi, Douglas A Mitchell, Diwakar Shukla

Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.

核糖体合成和翻译后修饰肽(RiPP)生物合成酶通常表现出混杂的底物偏好,不能简化为简单的规则。大型语言模型是预测RiPP生物合成酶特异性的有前途的工具。然而,最先进的蛋白质语言模型是在相对较少的肽序列上训练的。先前的一项研究全面地分析了LazBF(一种双组分丝氨酸脱水酶)和LazDEF(一种三组分唑合成酶)在乳酸唑生物合成途径中的肽底物偏好。我们证明了对LazBF底物偏好的隐藏语言建模产生了语言模型嵌入,从而改善了对LazBF和LazDEF底物的下游预测。类似地,对LazDEF底物偏好的屏蔽语言建模产生的嵌入改进了对LazBF和LazDEF底物的预测。我们的研究结果表明,模型学习的功能形式是在相同的生物合成途径内不同的酶转化之间可转移的。我们发现,在数据稀缺的情况下,RiPP生物合成酶的底物和非底物的单一高质量数据集改进了对不同酶的底物预测。然后,我们对每个数据集的模型进行了微调,并表明微调模型提供了可解释的见解,我们预计将有助于设计与期望的RiPP生物合成途径兼容的底物文库。
{"title":"Substrate prediction for RiPP biosynthetic enzymes <i>via</i> masked language modeling and transfer learning.","authors":"Joseph D Clark, Xuenan Mi, Douglas A Mitchell, Diwakar Shukla","doi":"10.1039/d4dd00170b","DOIUrl":"10.1039/d4dd00170b","url":null,"abstract":"<p><p>Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11622008/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142803666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rapid prediction of conformationally-dependent DFT-level descriptors using graph neural networks for carboxylic acids and alkyl amines† 使用图神经网络快速预测羧酸和烷基胺的构象依赖的dft级描述符。
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-28 DOI: 10.1039/D4DD00284A
Brittany C. Haas, Melissa A. Hardy, Shree Sowndarya S. V., Keir Adams, Connor W. Coley, Robert S. Paton and Matthew S. Sigman

Data-driven reaction discovery and development is a growing field that relies on the use of molecular descriptors to capture key information about substrates, ligands, and targets. Broad adaptation of this strategy is hindered by the associated computational cost of descriptor calculation, especially when considering conformational flexibility. Descriptor libraries can be precomputed agnostic of application to reduce the computational burden of data-driven reaction development. However, as one often applies these models to evaluate novel hypothetical structures, it would be ideal to predict the descriptors of compounds on-the-fly. Herein, we report DFT-level descriptor libraries for conformational ensembles of 8528 carboxylic acids and 8172 alkyl amines towards this goal. Employing 2D and 3D graph neural network architectures trained on these libraries culminated in the development of predictive models for molecule-level descriptors, as well as the bond- and atom-level descriptors for the conserved reactive site (carboxylic acid or amine). The predictions were confirmed to be robust for an external validation set of medicinally-relevant carboxylic acids and alkyl amines. Additionally, a retrospective study correlating the rate of amide coupling reactions demonstrated the suitability of the predicted DFT-level descriptors for downstream applications. Ultimately, these models enable high-fidelity predictions for a vast number of potential substrates, greatly increasing accessibility to the field of data-driven reaction development.

数据驱动的反应发现和开发是一个不断发展的领域,它依赖于使用分子描述符来捕获关于底物、配体和靶标的关键信息。该策略的广泛适应受到描述符计算的相关计算成本的阻碍,特别是在考虑构象灵活性时。描述符库可以预先计算不可知的应用程序,以减少数据驱动的反应开发的计算负担。然而,由于人们经常应用这些模型来评估新的假设结构,因此在动态中预测化合物的描述符将是理想的。为此,我们报告了8528种羧酸和8172种烷基胺构象集合的dft级描述符库。利用在这些文库上训练的2D和3D图形神经网络架构,最终开发了分子级描述符的预测模型,以及保守活性位点(羧酸或胺)的键和原子级描述符。预测被证实是稳健的外部验证集的医学相关羧酸和烷基胺。此外,一项与酰胺偶联反应速率相关的回顾性研究表明,预测的dft水平描述符适用于下游应用。最终,这些模型能够对大量潜在底物进行高保真度预测,极大地增加了数据驱动反应发展领域的可及性。
{"title":"Rapid prediction of conformationally-dependent DFT-level descriptors using graph neural networks for carboxylic acids and alkyl amines†","authors":"Brittany C. Haas, Melissa A. Hardy, Shree Sowndarya S. V., Keir Adams, Connor W. Coley, Robert S. Paton and Matthew S. Sigman","doi":"10.1039/D4DD00284A","DOIUrl":"10.1039/D4DD00284A","url":null,"abstract":"<p >Data-driven reaction discovery and development is a growing field that relies on the use of molecular descriptors to capture key information about substrates, ligands, and targets. Broad adaptation of this strategy is hindered by the associated computational cost of descriptor calculation, especially when considering conformational flexibility. Descriptor libraries can be precomputed agnostic of application to reduce the computational burden of data-driven reaction development. However, as one often applies these models to evaluate novel hypothetical structures, it would be ideal to predict the descriptors of compounds on-the-fly. Herein, we report DFT-level descriptor libraries for conformational ensembles of 8528 carboxylic acids and 8172 alkyl amines towards this goal. Employing 2D and 3D graph neural network architectures trained on these libraries culminated in the development of predictive models for molecule-level descriptors, as well as the bond- and atom-level descriptors for the conserved reactive site (carboxylic acid or amine). The predictions were confirmed to be robust for an external validation set of medicinally-relevant carboxylic acids and alkyl amines. Additionally, a retrospective study correlating the rate of amide coupling reactions demonstrated the suitability of the predicted DFT-level descriptors for downstream applications. Ultimately, these models enable high-fidelity predictions for a vast number of potential substrates, greatly increasing accessibility to the field of data-driven reaction development.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 222-233"},"PeriodicalIF":6.2,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11626426/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142814928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PolyCL: contrastive learning for polymer representation learning via explicit and implicit augmentations† 通过显式和隐式增强的聚合物表征学习的对比学习。
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-28 DOI: 10.1039/D4DD00236A
Jiajun Zhou, Yijie Yang, Austin M. Mroz and Kim E. Jelfs

Polymers play a crucial role in a wide array of applications due to their diverse and tunable properties. Establishing the relationship between polymer representations and their properties is crucial to the computational design and screening of potential polymers via machine learning. The quality of the representation significantly influences the effectiveness of these computational methods. Here, we present a self-supervised contrastive learning paradigm, PolyCL, for learning robust and high-quality polymer representation without the need for labels. Our model combines explicit and implicit augmentation strategies for improved learning performance. The results demonstrate that our model achieves either better, or highly competitive, performances on transfer learning tasks as a feature extractor without an overcomplicated training strategy or hyperparameter optimisation. Further enhancing the efficacy of our model, we conducted extensive analyses on various augmentation combinations used in contrastive learning. This led to identifying the most effective combination to maximise PolyCL's performance.

聚合物由于其多样化和可调的特性,在广泛的应用中起着至关重要的作用。建立聚合物表征及其性质之间的关系对于通过机器学习进行潜在聚合物的计算设计和筛选至关重要。表征的质量显著影响这些计算方法的有效性。在这里,我们提出了一个自我监督的对比学习范式,PolyCL,用于学习鲁棒和高质量的聚合物表示,而不需要标签。我们的模型结合了显式和隐式增强策略来提高学习成绩。结果表明,作为特征提取器,我们的模型在迁移学习任务上实现了更好的或高度竞争的性能,而无需过于复杂的训练策略或超参数优化。为了进一步提高模型的有效性,我们对对比学习中使用的各种增强组合进行了广泛的分析。这导致确定最有效的组合,以最大限度地提高PolyCL的性能。
{"title":"PolyCL: contrastive learning for polymer representation learning via explicit and implicit augmentations†","authors":"Jiajun Zhou, Yijie Yang, Austin M. Mroz and Kim E. Jelfs","doi":"10.1039/D4DD00236A","DOIUrl":"10.1039/D4DD00236A","url":null,"abstract":"<p >Polymers play a crucial role in a wide array of applications due to their diverse and tunable properties. Establishing the relationship between polymer representations and their properties is crucial to the computational design and screening of potential polymers <em>via</em> machine learning. The quality of the representation significantly influences the effectiveness of these computational methods. Here, we present a self-supervised contrastive learning paradigm, PolyCL, for learning robust and high-quality polymer representation without the need for labels. Our model combines explicit and implicit augmentation strategies for improved learning performance. The results demonstrate that our model achieves either better, or highly competitive, performances on transfer learning tasks as a feature extractor without an overcomplicated training strategy or hyperparameter optimisation. Further enhancing the efficacy of our model, we conducted extensive analyses on various augmentation combinations used in contrastive learning. This led to identifying the most effective combination to maximise PolyCL's performance.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 149-160"},"PeriodicalIF":6.2,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616009/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142803664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A graph neural network-state predictive information bottleneck (GNN-SPIB) approach for learning molecular thermodynamics and kinetics† 基于图神经网络状态预测信息瓶颈(GNN-SPIB)的分子热力学和动力学学习方法[j]
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-28 DOI: 10.1039/D4DD00315B
Ziyue Zou, Dedi Wang and Pratyush Tiwary

Molecular dynamics simulations offer detailed insights into atomic motions but face timescale limitations. Enhanced sampling methods have addressed these challenges but even with machine learning, they often rely on pre-selected expert-based features. In this work, we present a Graph Neural Network-State Predictive Information Bottleneck (GNN-SPIB) framework, which combines graph neural networks and the state predictive information bottleneck to automatically learn low-dimensional representations directly from atomic coordinates. Tested on three benchmark systems, our approach predicts essential structural, thermodynamic and kinetic information for slow processes, demonstrating robustness across diverse systems. The method shows promise for complex systems, enabling effective enhanced sampling without requiring pre-defined reaction coordinates or input features.

分子动力学模拟提供了对原子运动的详细见解,但面临时间尺度的限制。增强的采样方法已经解决了这些挑战,但即使使用机器学习,它们通常也依赖于预先选择的基于专家的特征。在这项工作中,我们提出了一个图神经网络-状态预测信息瓶颈(GNN-SPIB)框架,该框架将图神经网络和状态预测信息瓶颈相结合,直接从原子坐标中自动学习低维表示。在三个基准系统上进行了测试,我们的方法预测了缓慢过程的基本结构、热力学和动力学信息,证明了不同系统的鲁棒性。该方法显示出对复杂系统的承诺,在不需要预定义的反应坐标或输入特征的情况下实现有效的增强采样。
{"title":"A graph neural network-state predictive information bottleneck (GNN-SPIB) approach for learning molecular thermodynamics and kinetics†","authors":"Ziyue Zou, Dedi Wang and Pratyush Tiwary","doi":"10.1039/D4DD00315B","DOIUrl":"https://doi.org/10.1039/D4DD00315B","url":null,"abstract":"<p >Molecular dynamics simulations offer detailed insights into atomic motions but face timescale limitations. Enhanced sampling methods have addressed these challenges but even with machine learning, they often rely on pre-selected expert-based features. In this work, we present a Graph Neural Network-State Predictive Information Bottleneck (GNN-SPIB) framework, which combines graph neural networks and the state predictive information bottleneck to automatically learn low-dimensional representations directly from atomic coordinates. Tested on three benchmark systems, our approach predicts essential structural, thermodynamic and kinetic information for slow processes, demonstrating robustness across diverse systems. The method shows promise for complex systems, enabling effective enhanced sampling without requiring pre-defined reaction coordinates or input features.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 211-221"},"PeriodicalIF":6.2,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00315b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CopDDB: a descriptor database for copolymers and its applications to machine learning† 共聚物描述符数据库及其在机器学习中的应用
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-28 DOI: 10.1039/D4DD00266K
Takayoshi Yoshimura, Hiromoto Kato, Shunto Oikawa, Taichi Inagaki, Shigehito Asano, Tetsunori Sugawara, Tomoyuki Miyao, Takamitsu Matsubara, Hiroharu Ajiro, Mikiya Fujii, Yu-ya Ohnishi and Miho Hatanaka

Polymer informatics, which involves applying data-driven science to polymers, has attracted considerable research interest. However, developing adequate descriptors for polymers, particularly copolymers, to facilitate machine learning (ML) models with limited datasets remains a challenge. To address this issue, we computed sets of parameters, including reaction energies and activation barriers of elementary reactions in the early stage of radical polymerization, for 2500 radical–monomer pairs derived from 50 commercially available monomers and constructed an open database named “Copolymer Descriptor Database”. Furthermore, we built ML models using our descriptors as explanatory variables and physical properties such as the reactivity ratio, monomer conversion, monomer composition ratio, and molecular weight as objective variables. These models achieved high predictive accuracy, demonstrating the potential of our descriptors to advance the field of polymer informatics.

聚合物信息学涉及到将数据驱动的科学应用于聚合物,已经引起了相当大的研究兴趣。然而,开发足够的聚合物描述符,特别是共聚物,以促进有限数据集的机器学习(ML)模型仍然是一个挑战。为了解决这一问题,我们计算了50种市售单体衍生的2500对自由基-单体对在自由基聚合初期的反应能和基本反应的激活势垒等参数,并构建了一个名为“共聚物描述符数据库”的开放数据库。此外,我们使用我们的描述符作为解释变量和物理性质(如反应性比、单体转化率、单体组成比和分子量)作为客观变量来构建ML模型。这些模型达到了很高的预测精度,证明了我们的描述符在推进聚合物信息学领域的潜力。
{"title":"CopDDB: a descriptor database for copolymers and its applications to machine learning†","authors":"Takayoshi Yoshimura, Hiromoto Kato, Shunto Oikawa, Taichi Inagaki, Shigehito Asano, Tetsunori Sugawara, Tomoyuki Miyao, Takamitsu Matsubara, Hiroharu Ajiro, Mikiya Fujii, Yu-ya Ohnishi and Miho Hatanaka","doi":"10.1039/D4DD00266K","DOIUrl":"https://doi.org/10.1039/D4DD00266K","url":null,"abstract":"<p >Polymer informatics, which involves applying data-driven science to polymers, has attracted considerable research interest. However, developing adequate descriptors for polymers, particularly copolymers, to facilitate machine learning (ML) models with limited datasets remains a challenge. To address this issue, we computed sets of parameters, including reaction energies and activation barriers of elementary reactions in the early stage of radical polymerization, for 2500 radical–monomer pairs derived from 50 commercially available monomers and constructed an open database named “Copolymer Descriptor Database”. Furthermore, we built ML models using our descriptors as explanatory variables and physical properties such as the reactivity ratio, monomer conversion, monomer composition ratio, and molecular weight as objective variables. These models achieved high predictive accuracy, demonstrating the potential of our descriptors to advance the field of polymer informatics.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 195-203"},"PeriodicalIF":6.2,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00266k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning for accelerated prediction of lattice thermal conductivity at arbitrary temperature 在任意温度下加速预测晶格热导率的机器学习
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-11-27 DOI: 10.1039/D4DD00286E
Zihe Li, Mengke Li, Yufeng Luo, Haibin Cao, Huijun Liu and Ying Fang

Efficient evaluation of lattice thermal conductivity (κL) is critical for applications ranging from thermal management to energy conversion. In this work, we propose a neural network (NN) model that allows ready and accurate prediction of the κL of crystalline materials at arbitrary temperature. It is found that the data-driven model exhibits a high coefficient of determination between the real and predicted κL. Beyond the initial dataset, the strong predictive power of the NN model is further demonstrated by checking several systems randomly selected from previous first-principles studies. Most importantly, our model can realize high-throughput screening on countless systems either inside or beyond the existing databases, which is very beneficial for accelerated discovery or design of new materials with desired κL.

晶格导热系数(κL)的有效评估对于从热管理到能量转换的应用至关重要。在这项工作中,我们提出了一个神经网络(NN)模型,可以随时准确地预测任意温度下晶体材料的κL。发现数据驱动模型在实际和预测的κL之间有很高的决定系数。除了初始数据集之外,通过检查从先前的第一性原理研究中随机选择的几个系统,进一步证明了神经网络模型的强大预测能力。最重要的是,我们的模型可以在现有数据库内外的无数系统上实现高通量筛选,这对于加速发现或设计具有理想κL的新材料非常有益。
{"title":"Machine learning for accelerated prediction of lattice thermal conductivity at arbitrary temperature","authors":"Zihe Li, Mengke Li, Yufeng Luo, Haibin Cao, Huijun Liu and Ying Fang","doi":"10.1039/D4DD00286E","DOIUrl":"https://doi.org/10.1039/D4DD00286E","url":null,"abstract":"<p >Efficient evaluation of lattice thermal conductivity (<em>κ</em><small><sub>L</sub></small>) is critical for applications ranging from thermal management to energy conversion. In this work, we propose a neural network (NN) model that allows ready and accurate prediction of the <em>κ</em><small><sub>L</sub></small> of crystalline materials at arbitrary temperature. It is found that the data-driven model exhibits a high coefficient of determination between the real and predicted <em>κ</em><small><sub>L</sub></small>. Beyond the initial dataset, the strong predictive power of the NN model is further demonstrated by checking several systems randomly selected from previous first-principles studies. Most importantly, our model can realize high-throughput screening on countless systems either inside or beyond the existing databases, which is very beneficial for accelerated discovery or design of new materials with desired <em>κ</em><small><sub>L</sub></small>.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 1","pages":" 204-210"},"PeriodicalIF":6.2,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00286e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Digital discovery
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1