首页 > 最新文献

Digital discovery最新文献

英文 中文
Every atom counts: predicting sites of reaction based on chemistry within two bonds† 每个原子都很重要:根据两个化学键内的化学反应预测反应场所†。
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-16 DOI: 10.1039/D4DD00092G
Ching Ching Lam and Jonathan M. Goodman

How much chemistry can be described by looking only at each atom, its neighbours and its next-nearest neighbours? We present a method for predicting reaction sites based only on a simple, two-bond model. Machine learning classification models were trained and evaluated using atom-level labels and descriptors, including bond strength and connectivity. Despite limitations in covering only local chemical environments, the models achieved over 80% accuracy even with challenging datasets that cover a diverse chemical space. Whilst this simplistic model is necessarily incomplete, it describes a large amount of interesting chemistry.

只看每个原子、其邻原子和近邻原子,能描述多少化学反应?我们介绍了一种仅基于简单双键模型预测反应场所的方法。我们使用原子级标签和描述符(包括键强度和连通性)对机器学习分类模型进行了训练和评估。尽管存在仅覆盖局部化学环境的局限性,但这些模型的准确率达到了 80% 以上,即使是在覆盖多种化学空间的挑战性数据集上也是如此。虽然这种简单化的模型必然是不完整的,但它描述了大量有趣的化学现象。
{"title":"Every atom counts: predicting sites of reaction based on chemistry within two bonds†","authors":"Ching Ching Lam and Jonathan M. Goodman","doi":"10.1039/D4DD00092G","DOIUrl":"https://doi.org/10.1039/D4DD00092G","url":null,"abstract":"<p >How much chemistry can be described by looking only at each atom, its neighbours and its next-nearest neighbours? We present a method for predicting reaction sites based only on a simple, two-bond model. Machine learning classification models were trained and evaluated using atom-level labels and descriptors, including bond strength and connectivity. Despite limitations in covering only local chemical environments, the models achieved over 80% accuracy even with challenging datasets that cover a diverse chemical space. Whilst this simplistic model is necessarily incomplete, it describes a large amount of interesting chemistry.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1878-1888"},"PeriodicalIF":6.2,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00092g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introduction to “Accelerate Conference 2022” 2022 年加速会议 "简介
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-15 DOI: 10.1039/D4DD90036G
Keith A. Brown, Fedwa El Mellouhi and Claudiane Ouellet-Plamondon

A graphical abstract is available for this content

本内容有图解摘要
{"title":"Introduction to “Accelerate Conference 2022”","authors":"Keith A. Brown, Fedwa El Mellouhi and Claudiane Ouellet-Plamondon","doi":"10.1039/D4DD90036G","DOIUrl":"https://doi.org/10.1039/D4DD90036G","url":null,"abstract":"<p >A graphical abstract is available for this content</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1659-1661"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd90036g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Insights into pharmacokinetic properties for exposure chemicals: predictive modelling of human plasma fraction unbound (fu) and hepatocyte intrinsic clearance (Clint) data using machine learning† 揭示暴露化学品的药代动力学特性:利用机器学习† 建立人体血浆非结合分数(fu)和肝细胞固有清除率(Clint)数据的预测模型
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-15 DOI: 10.1039/D4DD00082J
Souvik Pore and Kunal Roy

An external chemical substance (which may be a medicinal drug or an exposome), after ingestion, undergoes a series of dynamic movements and metabolic alterations known as pharmacokinetic events while exerting different physiological actions on the body (pharmacodynamics events). Plasma protein binding and hepatocyte intrinsic clearance are crucial pharmacokinetic events that influence the efficacy and safety of a chemical substance. Plasma protein binding determines the fraction of a chemical compound bound to plasma proteins, affecting the distribution and duration of action of the compound. The compounds with high protein binding may have a smaller free fraction available for pharmacological activity, potentially altering their therapeutic effects. On the other hand, hepatocyte intrinsic clearance represents the liver's capacity to eliminate a chemical compound through metabolism. It is a critical determinant of the elimination half-life of the chemical substance. Understanding hepatic clearance is essential for predicting chemical toxicity and designing safety guidelines. Recently, the huge expansion of computational resources has led to the development of various in silico models to generate predictive models as an alternative to animal experimentation. In this research work, we developed different types of machine learning (ML) based quantitative structure–activity relationship (QSAR) models for the prediction of the compound's plasma protein fraction unbound values and hepatocyte intrinsic clearance. Here, we have developed regression-based models with the protein fraction unbound (fu) human data set (n = 1812) and a classification-based model with the hepatocyte intrinsic clearance (Clint) human data set (n = 1241) collected from the recently published ICE (Integrated Chemical Environment) database. We have further analyzed the influence of the plasma protein binding on the hepatocyte intrinsic clearance, by considering the compounds having both types of target variable values. For the fraction unbound data set, the support vector machine (SVM) model shows superior results compared to other models, but for the hepatocyte intrinsic clearance data set, random forest (RF) shows the best results. We have further made predictions of these important pharmacokinetic parameters through the similarity-based read-across (RA) method. A Python-based tool for predicting the endpoints has been developed and made available from https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/pkpy-tool.

外来化学物质(可能是药物或暴露体)摄入人体后,在对人体产生不同生理作用(药效学事件)的同时,会发生一系列被称为药代动力学事件的动态变化和代谢改变。血浆蛋白结合和肝细胞固有清除率是影响化学物质疗效和安全性的关键药代动力学事件。血浆蛋白结合率决定了化合物与血浆蛋白结合的比例,从而影响化合物的分布和作用时间。蛋白结合率高的化合物可用于药理活性的游离部分可能较小,从而可能改变其治疗效果。另一方面,肝细胞固有清除率代表肝脏通过新陈代谢消除化合物的能力。它是决定化学物质消除半衰期的关键因素。了解肝脏清除率对于预测化学毒性和设计安全指南至关重要。最近,随着计算资源的大幅扩展,人们开发出了各种硅学模型来生成预测模型,以替代动物实验。在这项研究工作中,我们开发了不同类型的基于机器学习(ML)的定量结构-活性关系(QSAR)模型,用于预测化合物的血浆蛋白部分未结合值和肝细胞固有清除率。在此,我们利用从最近发布的 ICE(集成化学环境)数据库中收集的未结合蛋白分数(fu)人类数据集(n = 1812)开发了基于回归的模型,并利用肝细胞固有清除率(Clint)人类数据集(n = 1241)开发了基于分类的模型。通过考虑两种类型目标变量值的化合物,我们进一步分析了血浆蛋白结合对肝细胞固有清除率的影响。对于非结合分数数据集,支持向量机(SVM)模型显示出优于其他模型的结果,但对于肝细胞固有清除率数据集,随机森林(RF)显示出最佳结果。我们还通过基于相似性的read-across(RA)方法进一步预测了这些重要的药代动力学参数。我们开发了一个基于 Python 的终点预测工具,可从 https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/pkpy-tool 获取。
{"title":"Insights into pharmacokinetic properties for exposure chemicals: predictive modelling of human plasma fraction unbound (fu) and hepatocyte intrinsic clearance (Clint) data using machine learning†","authors":"Souvik Pore and Kunal Roy","doi":"10.1039/D4DD00082J","DOIUrl":"https://doi.org/10.1039/D4DD00082J","url":null,"abstract":"<p >An external chemical substance (which may be a medicinal drug or an exposome), after ingestion, undergoes a series of dynamic movements and metabolic alterations known as pharmacokinetic events while exerting different physiological actions on the body (pharmacodynamics events). Plasma protein binding and hepatocyte intrinsic clearance are crucial pharmacokinetic events that influence the efficacy and safety of a chemical substance. Plasma protein binding determines the fraction of a chemical compound bound to plasma proteins, affecting the distribution and duration of action of the compound. The compounds with high protein binding may have a smaller free fraction available for pharmacological activity, potentially altering their therapeutic effects. On the other hand, hepatocyte intrinsic clearance represents the liver's capacity to eliminate a chemical compound through metabolism. It is a critical determinant of the elimination half-life of the chemical substance. Understanding hepatic clearance is essential for predicting chemical toxicity and designing safety guidelines. Recently, the huge expansion of computational resources has led to the development of various <em>in silico</em> models to generate predictive models as an alternative to animal experimentation. In this research work, we developed different types of machine learning (ML) based quantitative structure–activity relationship (QSAR) models for the prediction of the compound's plasma protein fraction unbound values and hepatocyte intrinsic clearance. Here, we have developed regression-based models with the protein fraction unbound (<em>f</em><small><sub>u</sub></small>) human data set (<em>n</em> = 1812) and a classification-based model with the hepatocyte intrinsic clearance (Cl<small><sub>int</sub></small>) human data set (<em>n</em> = 1241) collected from the recently published ICE (Integrated Chemical Environment) database. We have further analyzed the influence of the plasma protein binding on the hepatocyte intrinsic clearance, by considering the compounds having both types of target variable values. For the fraction unbound data set, the support vector machine (SVM) model shows superior results compared to other models, but for the hepatocyte intrinsic clearance data set, random forest (RF) shows the best results. We have further made predictions of these important pharmacokinetic parameters through the similarity-based read-across (RA) method. A Python-based tool for predicting the endpoints has been developed and made available from https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/pkpy-tool.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1852-1877"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00082j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dismai-Bench: benchmarking and designing generative models using disordered materials and interfaces† Dismai-Bench:使用无序材料和界面设计生成模型†。
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-15 DOI: 10.1039/D4DD00100A
Adrian Xiao Bin Yong, Tianyu Su and Elif Ertekin

Generative models have received significant attention in recent years for materials science applications, particularly in the area of inverse design for materials discovery. However, these models are usually assessed based on newly generated, unverified materials, using heuristic metrics such as charge neutrality, which provide a narrow evaluation of a model's performance. Also, current efforts for inorganic materials have predominantly focused on small, periodic crystals (≤20 atoms), even though the capability to generate large, more intricate and disordered structures would expand the applicability of generative modeling to a broader spectrum of materials. In this work, we present the Disordered Materials & Interfaces Benchmark (Dismai-Bench), a generative model benchmark that uses datasets of disordered alloys, interfaces, and amorphous silicon (256–264 atoms per structure). Models are trained on each dataset independently, and evaluated through direct structural comparisons between training and generated structures. Such comparisons are only possible because the material system of each training dataset is fixed. Benchmarking was performed on two graph diffusion models and two (coordinate-based) U-Net diffusion models. The graph models were found to significantly outperform the U-Net models due to the higher expressive power of graphs. While noise in the less expressive models can assist in discovering materials by facilitating exploration beyond the training distribution, these models face significant challenges when confronted with more complex structures. To further demonstrate the benefits of this benchmarking in the development process of a generative model, we considered the case of developing a point-cloud-based generative adversarial network (GAN) to generate low-energy disordered interfaces. We tested different GAN architectures and identified reasons for good/poor performance. We show that the best performing architecture, CryinGAN, outperforms the U-Net models, and is competitive against the graph models despite its lack of invariances and weaker expressive power. This work provides a new framework and insights to guide the development of future generative models, whether for ordered or disordered materials.

近年来,生成模型在材料科学应用领域,特别是在材料发现的逆向设计领域受到了极大关注。然而,对这些模型的评估通常是基于新生成的、未经验证的材料,使用启发式指标(如电荷中性),对模型性能的评估范围较窄。此外,目前针对无机材料的研究主要集中在小型、周期性晶体(≤20 个原子)上,尽管生成大型、更复杂和无序结构的能力会将生成模型的适用性扩展到更广泛的材料领域。在这项工作中,我们提出了无序材料与界面基准(Dismai-Bench),这是一个生成模型基准,使用无序合金、界面和非晶硅数据集(每个结构 256-264 个原子)。模型在每个数据集上独立训练,并通过训练结构和生成结构之间的直接结构比较进行评估。由于每个训练数据集的材料系统是固定的,因此这种比较才有可能进行。对两个图形扩散模型和两个(基于坐标的)U-Net 扩散模型进行了基准测试。结果发现,由于图形的表现力更强,图形模型明显优于 U-Net 模型。虽然表现力较弱的模型中的噪声可以通过促进对训练分布以外的探索来帮助发现材料,但这些模型在面对更复杂的结构时面临着巨大的挑战。为了进一步证明这种基准测试在生成模型开发过程中的益处,我们考虑了开发基于点云的生成对抗网络(GAN)以生成低能无序界面的案例。我们测试了不同的 GAN 架构,并找出了性能好/差的原因。我们发现,性能最好的架构 CryinGAN 优于 U-Net 模型,尽管它缺乏不变性,表现力也较弱,但与图模型相比仍具有竞争力。这项工作提供了一个新的框架和见解,可用于指导未来生成模型的开发,无论是有序材料还是无序材料。
{"title":"Dismai-Bench: benchmarking and designing generative models using disordered materials and interfaces†","authors":"Adrian Xiao Bin Yong, Tianyu Su and Elif Ertekin","doi":"10.1039/D4DD00100A","DOIUrl":"https://doi.org/10.1039/D4DD00100A","url":null,"abstract":"<p >Generative models have received significant attention in recent years for materials science applications, particularly in the area of inverse design for materials discovery. However, these models are usually assessed based on newly generated, unverified materials, using heuristic metrics such as charge neutrality, which provide a narrow evaluation of a model's performance. Also, current efforts for inorganic materials have predominantly focused on small, periodic crystals (≤20 atoms), even though the capability to generate large, more intricate and disordered structures would expand the applicability of generative modeling to a broader spectrum of materials. In this work, we present the Disordered Materials &amp; Interfaces Benchmark (Dismai-Bench), a generative model benchmark that uses datasets of disordered alloys, interfaces, and amorphous silicon (256–264 atoms per structure). Models are trained on each dataset independently, and evaluated through direct structural comparisons between training and generated structures. Such comparisons are only possible because the material system of each training dataset is fixed. Benchmarking was performed on two graph diffusion models and two (coordinate-based) U-Net diffusion models. The graph models were found to significantly outperform the U-Net models due to the higher expressive power of graphs. While noise in the less expressive models can assist in discovering materials by facilitating exploration beyond the training distribution, these models face significant challenges when confronted with more complex structures. To further demonstrate the benefits of this benchmarking in the development process of a generative model, we considered the case of developing a point-cloud-based generative adversarial network (GAN) to generate low-energy disordered interfaces. We tested different GAN architectures and identified reasons for good/poor performance. We show that the best performing architecture, CryinGAN, outperforms the U-Net models, and is competitive against the graph models despite its lack of invariances and weaker expressive power. This work provides a new framework and insights to guide the development of future generative models, whether for ordered or disordered materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1889-1909"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00100a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Active learning for regression of structure–property mapping: the importance of sampling and representation† 结构-属性映射回归的主动学习:取样和表征的重要性
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-12 DOI: 10.1039/D4DD00073K
Hao Liu, Berkay Yucel, Baskar Ganapathysubramanian, Surya R. Kalidindi, Daniel Wheeler and Olga Wodo

Data-driven approaches now allow for systematic mappings from materials microstructures to materials properties. In particular, diverse data-driven approaches are available to establish mappings using varied microstructure representations, each posing different demands on the resources required to calibrate machine learning models. In this work, using active learning regression and iteratively increasing the data pool, three questions are explored: (a) what is the minimal subset of data required to train a predictive structure–property model with sufficient accuracy? (b) Is this minimal subset highly dependent on the sampling strategy managing the datapool? And (c) what is the cost associated with the model calibration? Using case studies with different types of microstructure (composite vs. spinodal), dimensionality (two- and three-dimensional), and properties (elastic and electronic), we explore these questions using two separate microstructure representations: graph-based descriptors derived from a graph representation of the microstructure and two-point correlation functions. This work demonstrates that as few as 5% of evaluations are required to calibrate robust data-driven structure–property maps when selections are made from a library of diverse microstructures. The findings show that both representations (graph-based descriptors and two-point correlation functions) can be effective with only a small quantity of property evaluations when combined with different active learning strategies. However, the dimensionality of the latent space differs substantially depending on the microstructure representation and active learning strategy.

目前,数据驱动方法可实现从材料微观结构到材料特性的系统映射。特别是,有多种数据驱动方法可用于使用不同的微观结构表示法建立映射,每种方法都对校准机器学习模型所需的资源提出了不同的要求。在这项工作中,我们利用主动学习回归和迭代增加数据池的方法,探索了三个问题:(a) 以足够的准确性训练预测性结构-性能模型所需的最小数据子集是什么?(b) 这个最小子集是否高度依赖于管理数据池的采样策略?(c) 模型校准的相关成本是多少?通过对不同类型的微观结构(复合微观结构与尖晶石微观结构)、维度(二维与三维)和属性(弹性与电子)进行案例研究,评估了两种不同的微观结构表示方法:从微观结构图表示法和两点相关函数中得出的基于图形的描述符。这项研究表明,从不同的微观结构库中进行选择时,只需进行 5% 的评估即可校准稳健的数据驱动结构-属性图。研究结果表明,这两种表征(基于图形的描述符和两点相关函数)在与不同的主动学习策略相结合时,只需少量的属性评估就能产生效果。然而,根据微观结构表示法和主动学习策略的不同,潜在空间的维度也大不相同。
{"title":"Active learning for regression of structure–property mapping: the importance of sampling and representation†","authors":"Hao Liu, Berkay Yucel, Baskar Ganapathysubramanian, Surya R. Kalidindi, Daniel Wheeler and Olga Wodo","doi":"10.1039/D4DD00073K","DOIUrl":"10.1039/D4DD00073K","url":null,"abstract":"<p >Data-driven approaches now allow for systematic mappings from materials microstructures to materials properties. In particular, diverse data-driven approaches are available to establish mappings using varied microstructure representations, each posing different demands on the resources required to calibrate machine learning models. In this work, using active learning regression and iteratively increasing the data pool, three questions are explored: (a) what is the minimal subset of data required to train a predictive structure–property model with sufficient accuracy? (b) Is this minimal subset highly dependent on the sampling strategy managing the datapool? And (c) what is the cost associated with the model calibration? Using case studies with different types of microstructure (composite <em>vs.</em> spinodal), dimensionality (two- and three-dimensional), and properties (elastic and electronic), we explore these questions using two separate microstructure representations: graph-based descriptors derived from a graph representation of the microstructure and two-point correlation functions. This work demonstrates that as few as 5% of evaluations are required to calibrate robust data-driven structure–property maps when selections are made from a library of diverse microstructures. The findings show that both representations (graph-based descriptors and two-point correlation functions) can be effective with only a small quantity of property evaluations when combined with different active learning strategies. However, the dimensionality of the latent space differs substantially depending on the microstructure representation and active learning strategy.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 1997-2009"},"PeriodicalIF":6.2,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00073k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-optimizing Bayesian for continuous flow synthesis process† 用于连续流合成过程的自优化贝叶斯算法
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-12 DOI: 10.1039/D4DD00223G
Runzhe Liu, Zihao Wang, Wenbo Yang, Jinzhe Cao and Shengyang Tao

The integration of artificial intelligence (AI) and chemistry has propelled the advancement of continuous flow synthesis, facilitating program-controlled automatic process optimization. Optimization algorithms play a pivotal role in the automated optimization process. The increased accuracy and predictive capability of the algorithms will further mitigate the costs associated with optimization processes. A self-optimizing Bayesian algorithm (SOBayesian), incorporating Gaussian process regression as a proxy model, has been devised. Adaptive strategies are implemented during the model training process, rather than on the acquisition function, to elevate the modeling efficacy of the model. This algorithm facilitated optimizing the continuous flow synthesis process of pyridinylbenzamide, an important pharmaceutical intermediate, via the Buchwald–Hartwig reaction. Achieving a yield of 79.1% in under 30 rounds of iterative optimization, subsequent optimization with reduced prior data resulted in a successful 27.6% reduction in the number of experiments, significantly lowering experimental costs. Based on the experimental results, it can be concluded that the reaction is kinetically controlled. It provides ideas for optimizing similar reactions and new research ideas in continuous flow automated optimization.

人工智能(AI)与化学的融合推动了连续流合成技术的发展,促进了程序控制的自动流程优化。优化算法在自动优化过程中发挥着举足轻重的作用。算法准确性和预测能力的提高将进一步降低优化流程的相关成本。我们设计了一种自优化贝叶斯算法(SOBayesian),将高斯过程回归作为代理模型。自适应策略在模型训练过程中实施,而不是在获取函数时实施,以提高模型的建模效率。该算法有助于优化通过布赫瓦尔德-哈特维格反应合成吡啶基苯甲酰胺(一种重要的医药中间体)的连续流合成工艺。在不到 30 轮的迭代优化中,产量达到了 79.1%,在减少先验数据的情况下进行的后续优化成功减少了 27.6% 的实验次数,大大降低了实验成本。根据实验结果可以得出结论,该反应是受动力学控制的。这为类似反应的优化提供了思路,也为连续流自动优化提供了新的研究思路。
{"title":"Self-optimizing Bayesian for continuous flow synthesis process†","authors":"Runzhe Liu, Zihao Wang, Wenbo Yang, Jinzhe Cao and Shengyang Tao","doi":"10.1039/D4DD00223G","DOIUrl":"10.1039/D4DD00223G","url":null,"abstract":"<p >The integration of artificial intelligence (AI) and chemistry has propelled the advancement of continuous flow synthesis, facilitating program-controlled automatic process optimization. Optimization algorithms play a pivotal role in the automated optimization process. The increased accuracy and predictive capability of the algorithms will further mitigate the costs associated with optimization processes. A self-optimizing Bayesian algorithm (SOBayesian), incorporating Gaussian process regression as a proxy model, has been devised. Adaptive strategies are implemented during the model training process, rather than on the acquisition function, to elevate the modeling efficacy of the model. This algorithm facilitated optimizing the continuous flow synthesis process of pyridinylbenzamide, an important pharmaceutical intermediate, <em>via</em> the Buchwald–Hartwig reaction. Achieving a yield of 79.1% in under 30 rounds of iterative optimization, subsequent optimization with reduced prior data resulted in a successful 27.6% reduction in the number of experiments, significantly lowering experimental costs. Based on the experimental results, it can be concluded that the reaction is kinetically controlled. It provides ideas for optimizing similar reactions and new research ideas in continuous flow automated optimization.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 1958-1966"},"PeriodicalIF":6.2,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00223g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Connectivity stepwise derivation (CSD) method: a generic chemical structure information extraction method for the full step matrix† Connectivity Stepwise Derivation (CSD) method:全阶矩阵的通用化学结构信息提取方法
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-08 DOI: 10.1039/D4DD00125G
Jialiang Xiong, Xiaojie Feng, Jingxuan Xue, Yueji Wang, Haoren Niu, Yu Gu, Qingzhu Jia, Qiang Wang and Fangyou Yan

Emerging advanced exploration modalities such as property prediction, molecular recognition, and molecular design boost the fields of chemistry, drugs, and materials. Foremost in performing these advanced exploration tasks is how to describe/encode the molecular structure to the computer, i.e., from what the human eye sees to what is machine-readable. In this effort, a chemical structure information extraction method termed connectivity step derivation (CSD) for generating the full step matrix (MSF) is exhaustively depicted. The CSD method consists of structure information extraction, atomic connectivity relationship extraction, adjacency matrix generation, and MSF generation. For testing the run speed of the MSF generation, over 54 000 molecules have been collected covering organic molecules, polymers, and MOF structures. Test outcomes show that as the number of atoms in a molecule increases from 100 to 1000, the CSD method has an increasing advantage over the classical Floyd–Warshall algorithm, with the running speed rising from 28.34 to 289.95 times in the Python environment and from 2.86 to 25.49 times in the C++ environment. The proposed CSD method, that is, the elaboration of chemical structure information extraction, promises to bring new inspiration to data scientists in chemistry, drugs, and materials as well as facilitating the development of property modeling and molecular generation methods.

新兴的高级探索模式,如性质预测、分子识别和分子设计,推动了化学、药物和材料领域的发展。在执行这些高级探索任务时,最重要的是如何向计算机描述/编码分子结构,即从人眼所见到机器可读。在这项工作中,我们详尽地描述了一种用于生成全步骤矩阵(MSF)的化学结构信息提取方法,即连接步骤推导法(CSD)。CSD 方法包括结构信息提取、原子连接关系提取、邻接矩阵生成和 MSF 生成。为测试 MSF 生成的运行速度,收集了超过 54,000 个分子,涵盖有机分子、聚合物和 MOF 结构。测试结果表明,随着分子中原子数从 100 个增加到 1000 个,CSD 方法与经典的 Floyd-Warshall 算法相比优势越来越大,在 Python 环境下运行速度从 28.34 倍提高到 289.95 倍,在 C++ 环境下运行速度从 2.86 倍提高到 25.49 倍。所提出的 CSD 方法,即对化学结构信息提取的阐述,有望为化学、药物和材料领域的数据科学家带来新的灵感,并促进性质建模和分子生成方法的发展。
{"title":"Connectivity stepwise derivation (CSD) method: a generic chemical structure information extraction method for the full step matrix†","authors":"Jialiang Xiong, Xiaojie Feng, Jingxuan Xue, Yueji Wang, Haoren Niu, Yu Gu, Qingzhu Jia, Qiang Wang and Fangyou Yan","doi":"10.1039/D4DD00125G","DOIUrl":"10.1039/D4DD00125G","url":null,"abstract":"<p >Emerging advanced exploration modalities such as property prediction, molecular recognition, and molecular design boost the fields of chemistry, drugs, and materials. Foremost in performing these advanced exploration tasks is how to describe/encode the molecular structure to the computer, <em>i.e.</em>, from what the human eye sees to what is machine-readable. In this effort, a chemical structure information extraction method termed connectivity step derivation (CSD) for generating the full step matrix (MS<small><sub>F</sub></small>) is exhaustively depicted. The CSD method consists of structure information extraction, atomic connectivity relationship extraction, adjacency matrix generation, and MS<small><sub>F</sub></small> generation. For testing the run speed of the MS<small><sub>F</sub></small> generation, over 54 000 molecules have been collected covering organic molecules, polymers, and MOF structures. Test outcomes show that as the number of atoms in a molecule increases from 100 to 1000, the CSD method has an increasing advantage over the classical Floyd–Warshall algorithm, with the running speed rising from 28.34 to 289.95 times in the Python environment and from 2.86 to 25.49 times in the C++ environment. The proposed CSD method, that is, the elaboration of chemical structure information extraction, promises to bring new inspiration to data scientists in chemistry, drugs, and materials as well as facilitating the development of property modeling and molecular generation methods.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1842-1851"},"PeriodicalIF":6.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00125g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PerQueue: managing complex and dynamic workflows† PerQueue:管理复杂的动态工作流
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-08 DOI: 10.1039/D4DD00134F
Benjamin Heckscher Sjølin, William Sandholt Hansen, Armando Antonio Morin-Martinez, Martin Hoffmann Petersen, Laura Hannemose Rieger, Tejs Vegge, Juan Maria García-Lastra and Ivano E. Castelli

Workflow managers play a critical role in the efficient planning and execution of complex workloads. A handful of these already exist within the world of computational materials discovery, but their dynamic capabilities are somewhat lacking. The PerQueue workflow manager is the answer to this need. By utilizing modular and dynamic building blocks to define a workflow explicitly before starting, PerQueue can give a better overview of the workflow while allowing full flexibility and high dynamism. To exemplify its usage, we present four use cases at different scales within computational materials discovery. These encapsulate high-throughput screening with Density Functional Theory, using active learning to train a Machine-Learning Interatomic Potential with Molecular Dynamics and reusing this potential for kinetic Monte Carlo simulations of extended systems. Lastly, it is used for an active-learning-accelerated image segmentation procedure with a human-in-the-loop.

工作流管理器在高效规划和执行复杂工作负载方面发挥着至关重要的作用。在计算材料发现领域,已经存在一些这样的工作流管理器,但它们的动态功能略显不足。PerQueue 工作流管理器正是对这一需求的回应。PerQueue 利用模块化动态构件在开始之前明确定义工作流,可以更好地概述工作流,同时具有充分的灵活性和高度的动态性。为了举例说明其用法,我们介绍了计算材料发现中不同规模的四个用例。这些案例包括利用密度泛函理论进行高通量筛选、利用主动学习来训练分子动力学的机器学习原子间位势,以及将该位势重新用于扩展系统的动力学蒙特卡洛模拟。最后,它还被用于主动学习加速图像分割程序,并将人纳入环路。
{"title":"PerQueue: managing complex and dynamic workflows†","authors":"Benjamin Heckscher Sjølin, William Sandholt Hansen, Armando Antonio Morin-Martinez, Martin Hoffmann Petersen, Laura Hannemose Rieger, Tejs Vegge, Juan Maria García-Lastra and Ivano E. Castelli","doi":"10.1039/D4DD00134F","DOIUrl":"10.1039/D4DD00134F","url":null,"abstract":"<p >Workflow managers play a critical role in the efficient planning and execution of complex workloads. A handful of these already exist within the world of computational materials discovery, but their dynamic capabilities are somewhat lacking. The PerQueue workflow manager is the answer to this need. By utilizing modular and dynamic building blocks to define a workflow explicitly before starting, PerQueue can give a better overview of the workflow while allowing full flexibility and high dynamism. To exemplify its usage, we present four use cases at different scales within computational materials discovery. These encapsulate high-throughput screening with Density Functional Theory, using active learning to train a Machine-Learning Interatomic Potential with Molecular Dynamics and reusing this potential for kinetic Monte Carlo simulations of extended systems. Lastly, it is used for an active-learning-accelerated image segmentation procedure with a human-in-the-loop.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1832-1841"},"PeriodicalIF":6.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00134f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An automated electrochemistry platform for studying pH-dependent molecular electrocatalysis† 研究 pH 依赖性分子电催化的自动电化学平台
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-05 DOI: 10.1039/D4DD00186A
Michael A. Pence, Gavin Hazen and Joaquín Rodríguez-López

Comprehensive studies of molecular electrocatalysis require tedious titration-type experiments that slow down manual experimentation. We present eLab as an automated electrochemical platform designed for molecular electrochemistry that uses opensource software to modularly interconnect various commercial instruments, enabling users to chain together multiple instruments for complex electrochemical operations. We benchmarked the solution handling performance of our platform through gravimetric calibration, acid–base titrations, and voltammetric diffusion coefficient measurements. We then used the platform to explore the TEMPO-catalyzed electrooxidation of alcohols, demonstrating our platforms capabilities for pH-dependent molecular electrocatalysis. We performed combined acid–base titrations and cyclic voltammetry on six different alcohol substrates, collecting 684 voltammograms with 171 different solution conditions over the course of 16 hours, demonstrating high throughput in an unsupervised experiment. The high versatility, transferability, and ease of implementation of eLab promises the rapid discovery and characterization of pH-dependent processes, including mediated electrocatalysis for energy conversion, fuel valorization, and bioelectrochemical sensing, among many applications.

对分子电催化的全面研究需要进行繁琐的滴定型实验,从而降低了手动实验的速度。我们介绍的 eLab 是专为分子电化学设计的自动化电化学平台,它使用开源软件模块化地连接各种商用仪器,使用户能够将多台仪器串联起来进行复杂的电化学操作。我们通过重量校准、酸碱滴定和伏安扩散系数测量,对平台的溶液处理性能进行了基准测试。然后,我们利用该平台探索了 TEMPO 催化的醇类电氧化,展示了我们的平台在 pH 依赖性分子电催化方面的能力。我们对六种不同的醇类底物进行了酸碱滴定和循环伏安测定,在 16 个小时的时间里收集了 684 张伏安图,涉及 171 种不同的溶液条件,展示了无监督实验的高吞吐量。eLab 的多功能性、可移植性和易实施性使其有望快速发现和表征 pH 依赖性过程,包括用于能量转换、燃料价值化和生物电化学传感等多种应用的介导电催化。
{"title":"An automated electrochemistry platform for studying pH-dependent molecular electrocatalysis†","authors":"Michael A. Pence, Gavin Hazen and Joaquín Rodríguez-López","doi":"10.1039/D4DD00186A","DOIUrl":"10.1039/D4DD00186A","url":null,"abstract":"<p >Comprehensive studies of molecular electrocatalysis require tedious titration-type experiments that slow down manual experimentation. We present eLab as an automated electrochemical platform designed for molecular electrochemistry that uses opensource software to modularly interconnect various commercial instruments, enabling users to chain together multiple instruments for complex electrochemical operations. We benchmarked the solution handling performance of our platform through gravimetric calibration, acid–base titrations, and voltammetric diffusion coefficient measurements. We then used the platform to explore the TEMPO-catalyzed electrooxidation of alcohols, demonstrating our platforms capabilities for pH-dependent molecular electrocatalysis. We performed combined acid–base titrations and cyclic voltammetry on six different alcohol substrates, collecting 684 voltammograms with 171 different solution conditions over the course of 16 hours, demonstrating high throughput in an unsupervised experiment. The high versatility, transferability, and ease of implementation of eLab promises the rapid discovery and characterization of pH-dependent processes, including mediated electrocatalysis for energy conversion, fuel valorization, and bioelectrochemical sensing, among many applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1812-1821"},"PeriodicalIF":6.2,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00186a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extracting structured data from organic synthesis procedures using a fine-tuned large language model† 使用微调大语言模型从有机合成程序中提取结构化数据
IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-31 DOI: 10.1039/D4DD00091A
Qianxiang Ai, Fanwang Meng, Jiale Shi, Brenden Pelkie and Connor W. Coley

The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.

数据驱动方法和机器学习(ML)技术在有机化学领域及其各个子领域的普及提高了结构化反应数据的价值。化学领域的大多数数据都是非结构化文本,而且由于有机化学文献(论文、专利)浩如烟海,从非结构化文本到结构化数据的手动转换仍然主要是人工操作。完成这项任务的软件工具将有助于下游应用,如反应预测和条件推荐。在本研究中,我们利用经过微调的大型语言模型(LLMs)的强大功能,按照开放反应数据库(ORD)模式从有机合成过程文本中提取反应信息,并将其转换为结构化数据,这是一种专为有机反应设计的综合数据结构。经过微调的模型能生成语法正确的 ORD 记录,对 ORD "信息"(如完整的化合物、工作步骤或条件定义)的平均准确率为 91.25%,对单个数据字段(如化合物标识符、质量数)的平均准确率为 92.25%,并能识别化合物参考标记和推断反应作用。我们对其故障模式进行了研究,并对特定子任务(如反应角色分类)的性能进行了评估。
{"title":"Extracting structured data from organic synthesis procedures using a fine-tuned large language model†","authors":"Qianxiang Ai, Fanwang Meng, Jiale Shi, Brenden Pelkie and Connor W. Coley","doi":"10.1039/D4DD00091A","DOIUrl":"10.1039/D4DD00091A","url":null,"abstract":"<p >The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (<em>e.g.</em>, full compound, workups, or condition definitions) and 92.25% for individual data fields (<em>e.g.</em>, compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1822-1831"},"PeriodicalIF":6.2,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00091a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Digital discovery
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1