How much chemistry can be described by looking only at each atom, its neighbours and its next-nearest neighbours? We present a method for predicting reaction sites based only on a simple, two-bond model. Machine learning classification models were trained and evaluated using atom-level labels and descriptors, including bond strength and connectivity. Despite limitations in covering only local chemical environments, the models achieved over 80% accuracy even with challenging datasets that cover a diverse chemical space. Whilst this simplistic model is necessarily incomplete, it describes a large amount of interesting chemistry.
{"title":"Every atom counts: predicting sites of reaction based on chemistry within two bonds†","authors":"Ching Ching Lam and Jonathan M. Goodman","doi":"10.1039/D4DD00092G","DOIUrl":"https://doi.org/10.1039/D4DD00092G","url":null,"abstract":"<p >How much chemistry can be described by looking only at each atom, its neighbours and its next-nearest neighbours? We present a method for predicting reaction sites based only on a simple, two-bond model. Machine learning classification models were trained and evaluated using atom-level labels and descriptors, including bond strength and connectivity. Despite limitations in covering only local chemical environments, the models achieved over 80% accuracy even with challenging datasets that cover a diverse chemical space. Whilst this simplistic model is necessarily incomplete, it describes a large amount of interesting chemistry.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1878-1888"},"PeriodicalIF":6.2,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00092g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keith A. Brown, Fedwa El Mellouhi and Claudiane Ouellet-Plamondon
A graphical abstract is available for this content
本内容有图解摘要
{"title":"Introduction to “Accelerate Conference 2022”","authors":"Keith A. Brown, Fedwa El Mellouhi and Claudiane Ouellet-Plamondon","doi":"10.1039/D4DD90036G","DOIUrl":"https://doi.org/10.1039/D4DD90036G","url":null,"abstract":"<p >A graphical abstract is available for this content</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1659-1661"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd90036g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An external chemical substance (which may be a medicinal drug or an exposome), after ingestion, undergoes a series of dynamic movements and metabolic alterations known as pharmacokinetic events while exerting different physiological actions on the body (pharmacodynamics events). Plasma protein binding and hepatocyte intrinsic clearance are crucial pharmacokinetic events that influence the efficacy and safety of a chemical substance. Plasma protein binding determines the fraction of a chemical compound bound to plasma proteins, affecting the distribution and duration of action of the compound. The compounds with high protein binding may have a smaller free fraction available for pharmacological activity, potentially altering their therapeutic effects. On the other hand, hepatocyte intrinsic clearance represents the liver's capacity to eliminate a chemical compound through metabolism. It is a critical determinant of the elimination half-life of the chemical substance. Understanding hepatic clearance is essential for predicting chemical toxicity and designing safety guidelines. Recently, the huge expansion of computational resources has led to the development of various in silico models to generate predictive models as an alternative to animal experimentation. In this research work, we developed different types of machine learning (ML) based quantitative structure–activity relationship (QSAR) models for the prediction of the compound's plasma protein fraction unbound values and hepatocyte intrinsic clearance. Here, we have developed regression-based models with the protein fraction unbound (fu) human data set (n = 1812) and a classification-based model with the hepatocyte intrinsic clearance (Clint) human data set (n = 1241) collected from the recently published ICE (Integrated Chemical Environment) database. We have further analyzed the influence of the plasma protein binding on the hepatocyte intrinsic clearance, by considering the compounds having both types of target variable values. For the fraction unbound data set, the support vector machine (SVM) model shows superior results compared to other models, but for the hepatocyte intrinsic clearance data set, random forest (RF) shows the best results. We have further made predictions of these important pharmacokinetic parameters through the similarity-based read-across (RA) method. A Python-based tool for predicting the endpoints has been developed and made available from https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/pkpy-tool.
{"title":"Insights into pharmacokinetic properties for exposure chemicals: predictive modelling of human plasma fraction unbound (fu) and hepatocyte intrinsic clearance (Clint) data using machine learning†","authors":"Souvik Pore and Kunal Roy","doi":"10.1039/D4DD00082J","DOIUrl":"https://doi.org/10.1039/D4DD00082J","url":null,"abstract":"<p >An external chemical substance (which may be a medicinal drug or an exposome), after ingestion, undergoes a series of dynamic movements and metabolic alterations known as pharmacokinetic events while exerting different physiological actions on the body (pharmacodynamics events). Plasma protein binding and hepatocyte intrinsic clearance are crucial pharmacokinetic events that influence the efficacy and safety of a chemical substance. Plasma protein binding determines the fraction of a chemical compound bound to plasma proteins, affecting the distribution and duration of action of the compound. The compounds with high protein binding may have a smaller free fraction available for pharmacological activity, potentially altering their therapeutic effects. On the other hand, hepatocyte intrinsic clearance represents the liver's capacity to eliminate a chemical compound through metabolism. It is a critical determinant of the elimination half-life of the chemical substance. Understanding hepatic clearance is essential for predicting chemical toxicity and designing safety guidelines. Recently, the huge expansion of computational resources has led to the development of various <em>in silico</em> models to generate predictive models as an alternative to animal experimentation. In this research work, we developed different types of machine learning (ML) based quantitative structure–activity relationship (QSAR) models for the prediction of the compound's plasma protein fraction unbound values and hepatocyte intrinsic clearance. Here, we have developed regression-based models with the protein fraction unbound (<em>f</em><small><sub>u</sub></small>) human data set (<em>n</em> = 1812) and a classification-based model with the hepatocyte intrinsic clearance (Cl<small><sub>int</sub></small>) human data set (<em>n</em> = 1241) collected from the recently published ICE (Integrated Chemical Environment) database. We have further analyzed the influence of the plasma protein binding on the hepatocyte intrinsic clearance, by considering the compounds having both types of target variable values. For the fraction unbound data set, the support vector machine (SVM) model shows superior results compared to other models, but for the hepatocyte intrinsic clearance data set, random forest (RF) shows the best results. We have further made predictions of these important pharmacokinetic parameters through the similarity-based read-across (RA) method. A Python-based tool for predicting the endpoints has been developed and made available from https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/pkpy-tool.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1852-1877"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00082j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative models have received significant attention in recent years for materials science applications, particularly in the area of inverse design for materials discovery. However, these models are usually assessed based on newly generated, unverified materials, using heuristic metrics such as charge neutrality, which provide a narrow evaluation of a model's performance. Also, current efforts for inorganic materials have predominantly focused on small, periodic crystals (≤20 atoms), even though the capability to generate large, more intricate and disordered structures would expand the applicability of generative modeling to a broader spectrum of materials. In this work, we present the Disordered Materials & Interfaces Benchmark (Dismai-Bench), a generative model benchmark that uses datasets of disordered alloys, interfaces, and amorphous silicon (256–264 atoms per structure). Models are trained on each dataset independently, and evaluated through direct structural comparisons between training and generated structures. Such comparisons are only possible because the material system of each training dataset is fixed. Benchmarking was performed on two graph diffusion models and two (coordinate-based) U-Net diffusion models. The graph models were found to significantly outperform the U-Net models due to the higher expressive power of graphs. While noise in the less expressive models can assist in discovering materials by facilitating exploration beyond the training distribution, these models face significant challenges when confronted with more complex structures. To further demonstrate the benefits of this benchmarking in the development process of a generative model, we considered the case of developing a point-cloud-based generative adversarial network (GAN) to generate low-energy disordered interfaces. We tested different GAN architectures and identified reasons for good/poor performance. We show that the best performing architecture, CryinGAN, outperforms the U-Net models, and is competitive against the graph models despite its lack of invariances and weaker expressive power. This work provides a new framework and insights to guide the development of future generative models, whether for ordered or disordered materials.
{"title":"Dismai-Bench: benchmarking and designing generative models using disordered materials and interfaces†","authors":"Adrian Xiao Bin Yong, Tianyu Su and Elif Ertekin","doi":"10.1039/D4DD00100A","DOIUrl":"https://doi.org/10.1039/D4DD00100A","url":null,"abstract":"<p >Generative models have received significant attention in recent years for materials science applications, particularly in the area of inverse design for materials discovery. However, these models are usually assessed based on newly generated, unverified materials, using heuristic metrics such as charge neutrality, which provide a narrow evaluation of a model's performance. Also, current efforts for inorganic materials have predominantly focused on small, periodic crystals (≤20 atoms), even though the capability to generate large, more intricate and disordered structures would expand the applicability of generative modeling to a broader spectrum of materials. In this work, we present the Disordered Materials & Interfaces Benchmark (Dismai-Bench), a generative model benchmark that uses datasets of disordered alloys, interfaces, and amorphous silicon (256–264 atoms per structure). Models are trained on each dataset independently, and evaluated through direct structural comparisons between training and generated structures. Such comparisons are only possible because the material system of each training dataset is fixed. Benchmarking was performed on two graph diffusion models and two (coordinate-based) U-Net diffusion models. The graph models were found to significantly outperform the U-Net models due to the higher expressive power of graphs. While noise in the less expressive models can assist in discovering materials by facilitating exploration beyond the training distribution, these models face significant challenges when confronted with more complex structures. To further demonstrate the benefits of this benchmarking in the development process of a generative model, we considered the case of developing a point-cloud-based generative adversarial network (GAN) to generate low-energy disordered interfaces. We tested different GAN architectures and identified reasons for good/poor performance. We show that the best performing architecture, CryinGAN, outperforms the U-Net models, and is competitive against the graph models despite its lack of invariances and weaker expressive power. This work provides a new framework and insights to guide the development of future generative models, whether for ordered or disordered materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1889-1909"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00100a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Liu, Berkay Yucel, Baskar Ganapathysubramanian, Surya R. Kalidindi, Daniel Wheeler and Olga Wodo
Data-driven approaches now allow for systematic mappings from materials microstructures to materials properties. In particular, diverse data-driven approaches are available to establish mappings using varied microstructure representations, each posing different demands on the resources required to calibrate machine learning models. In this work, using active learning regression and iteratively increasing the data pool, three questions are explored: (a) what is the minimal subset of data required to train a predictive structure–property model with sufficient accuracy? (b) Is this minimal subset highly dependent on the sampling strategy managing the datapool? And (c) what is the cost associated with the model calibration? Using case studies with different types of microstructure (composite vs. spinodal), dimensionality (two- and three-dimensional), and properties (elastic and electronic), we explore these questions using two separate microstructure representations: graph-based descriptors derived from a graph representation of the microstructure and two-point correlation functions. This work demonstrates that as few as 5% of evaluations are required to calibrate robust data-driven structure–property maps when selections are made from a library of diverse microstructures. The findings show that both representations (graph-based descriptors and two-point correlation functions) can be effective with only a small quantity of property evaluations when combined with different active learning strategies. However, the dimensionality of the latent space differs substantially depending on the microstructure representation and active learning strategy.
{"title":"Active learning for regression of structure–property mapping: the importance of sampling and representation†","authors":"Hao Liu, Berkay Yucel, Baskar Ganapathysubramanian, Surya R. Kalidindi, Daniel Wheeler and Olga Wodo","doi":"10.1039/D4DD00073K","DOIUrl":"10.1039/D4DD00073K","url":null,"abstract":"<p >Data-driven approaches now allow for systematic mappings from materials microstructures to materials properties. In particular, diverse data-driven approaches are available to establish mappings using varied microstructure representations, each posing different demands on the resources required to calibrate machine learning models. In this work, using active learning regression and iteratively increasing the data pool, three questions are explored: (a) what is the minimal subset of data required to train a predictive structure–property model with sufficient accuracy? (b) Is this minimal subset highly dependent on the sampling strategy managing the datapool? And (c) what is the cost associated with the model calibration? Using case studies with different types of microstructure (composite <em>vs.</em> spinodal), dimensionality (two- and three-dimensional), and properties (elastic and electronic), we explore these questions using two separate microstructure representations: graph-based descriptors derived from a graph representation of the microstructure and two-point correlation functions. This work demonstrates that as few as 5% of evaluations are required to calibrate robust data-driven structure–property maps when selections are made from a library of diverse microstructures. The findings show that both representations (graph-based descriptors and two-point correlation functions) can be effective with only a small quantity of property evaluations when combined with different active learning strategies. However, the dimensionality of the latent space differs substantially depending on the microstructure representation and active learning strategy.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 1997-2009"},"PeriodicalIF":6.2,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00073k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runzhe Liu, Zihao Wang, Wenbo Yang, Jinzhe Cao and Shengyang Tao
The integration of artificial intelligence (AI) and chemistry has propelled the advancement of continuous flow synthesis, facilitating program-controlled automatic process optimization. Optimization algorithms play a pivotal role in the automated optimization process. The increased accuracy and predictive capability of the algorithms will further mitigate the costs associated with optimization processes. A self-optimizing Bayesian algorithm (SOBayesian), incorporating Gaussian process regression as a proxy model, has been devised. Adaptive strategies are implemented during the model training process, rather than on the acquisition function, to elevate the modeling efficacy of the model. This algorithm facilitated optimizing the continuous flow synthesis process of pyridinylbenzamide, an important pharmaceutical intermediate, via the Buchwald–Hartwig reaction. Achieving a yield of 79.1% in under 30 rounds of iterative optimization, subsequent optimization with reduced prior data resulted in a successful 27.6% reduction in the number of experiments, significantly lowering experimental costs. Based on the experimental results, it can be concluded that the reaction is kinetically controlled. It provides ideas for optimizing similar reactions and new research ideas in continuous flow automated optimization.
{"title":"Self-optimizing Bayesian for continuous flow synthesis process†","authors":"Runzhe Liu, Zihao Wang, Wenbo Yang, Jinzhe Cao and Shengyang Tao","doi":"10.1039/D4DD00223G","DOIUrl":"10.1039/D4DD00223G","url":null,"abstract":"<p >The integration of artificial intelligence (AI) and chemistry has propelled the advancement of continuous flow synthesis, facilitating program-controlled automatic process optimization. Optimization algorithms play a pivotal role in the automated optimization process. The increased accuracy and predictive capability of the algorithms will further mitigate the costs associated with optimization processes. A self-optimizing Bayesian algorithm (SOBayesian), incorporating Gaussian process regression as a proxy model, has been devised. Adaptive strategies are implemented during the model training process, rather than on the acquisition function, to elevate the modeling efficacy of the model. This algorithm facilitated optimizing the continuous flow synthesis process of pyridinylbenzamide, an important pharmaceutical intermediate, <em>via</em> the Buchwald–Hartwig reaction. Achieving a yield of 79.1% in under 30 rounds of iterative optimization, subsequent optimization with reduced prior data resulted in a successful 27.6% reduction in the number of experiments, significantly lowering experimental costs. Based on the experimental results, it can be concluded that the reaction is kinetically controlled. It provides ideas for optimizing similar reactions and new research ideas in continuous flow automated optimization.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 1958-1966"},"PeriodicalIF":6.2,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00223g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialiang Xiong, Xiaojie Feng, Jingxuan Xue, Yueji Wang, Haoren Niu, Yu Gu, Qingzhu Jia, Qiang Wang and Fangyou Yan
Emerging advanced exploration modalities such as property prediction, molecular recognition, and molecular design boost the fields of chemistry, drugs, and materials. Foremost in performing these advanced exploration tasks is how to describe/encode the molecular structure to the computer, i.e., from what the human eye sees to what is machine-readable. In this effort, a chemical structure information extraction method termed connectivity step derivation (CSD) for generating the full step matrix (MSF) is exhaustively depicted. The CSD method consists of structure information extraction, atomic connectivity relationship extraction, adjacency matrix generation, and MSF generation. For testing the run speed of the MSF generation, over 54 000 molecules have been collected covering organic molecules, polymers, and MOF structures. Test outcomes show that as the number of atoms in a molecule increases from 100 to 1000, the CSD method has an increasing advantage over the classical Floyd–Warshall algorithm, with the running speed rising from 28.34 to 289.95 times in the Python environment and from 2.86 to 25.49 times in the C++ environment. The proposed CSD method, that is, the elaboration of chemical structure information extraction, promises to bring new inspiration to data scientists in chemistry, drugs, and materials as well as facilitating the development of property modeling and molecular generation methods.
{"title":"Connectivity stepwise derivation (CSD) method: a generic chemical structure information extraction method for the full step matrix†","authors":"Jialiang Xiong, Xiaojie Feng, Jingxuan Xue, Yueji Wang, Haoren Niu, Yu Gu, Qingzhu Jia, Qiang Wang and Fangyou Yan","doi":"10.1039/D4DD00125G","DOIUrl":"10.1039/D4DD00125G","url":null,"abstract":"<p >Emerging advanced exploration modalities such as property prediction, molecular recognition, and molecular design boost the fields of chemistry, drugs, and materials. Foremost in performing these advanced exploration tasks is how to describe/encode the molecular structure to the computer, <em>i.e.</em>, from what the human eye sees to what is machine-readable. In this effort, a chemical structure information extraction method termed connectivity step derivation (CSD) for generating the full step matrix (MS<small><sub>F</sub></small>) is exhaustively depicted. The CSD method consists of structure information extraction, atomic connectivity relationship extraction, adjacency matrix generation, and MS<small><sub>F</sub></small> generation. For testing the run speed of the MS<small><sub>F</sub></small> generation, over 54 000 molecules have been collected covering organic molecules, polymers, and MOF structures. Test outcomes show that as the number of atoms in a molecule increases from 100 to 1000, the CSD method has an increasing advantage over the classical Floyd–Warshall algorithm, with the running speed rising from 28.34 to 289.95 times in the Python environment and from 2.86 to 25.49 times in the C++ environment. The proposed CSD method, that is, the elaboration of chemical structure information extraction, promises to bring new inspiration to data scientists in chemistry, drugs, and materials as well as facilitating the development of property modeling and molecular generation methods.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1842-1851"},"PeriodicalIF":6.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00125g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Heckscher Sjølin, William Sandholt Hansen, Armando Antonio Morin-Martinez, Martin Hoffmann Petersen, Laura Hannemose Rieger, Tejs Vegge, Juan Maria García-Lastra and Ivano E. Castelli
Workflow managers play a critical role in the efficient planning and execution of complex workloads. A handful of these already exist within the world of computational materials discovery, but their dynamic capabilities are somewhat lacking. The PerQueue workflow manager is the answer to this need. By utilizing modular and dynamic building blocks to define a workflow explicitly before starting, PerQueue can give a better overview of the workflow while allowing full flexibility and high dynamism. To exemplify its usage, we present four use cases at different scales within computational materials discovery. These encapsulate high-throughput screening with Density Functional Theory, using active learning to train a Machine-Learning Interatomic Potential with Molecular Dynamics and reusing this potential for kinetic Monte Carlo simulations of extended systems. Lastly, it is used for an active-learning-accelerated image segmentation procedure with a human-in-the-loop.
{"title":"PerQueue: managing complex and dynamic workflows†","authors":"Benjamin Heckscher Sjølin, William Sandholt Hansen, Armando Antonio Morin-Martinez, Martin Hoffmann Petersen, Laura Hannemose Rieger, Tejs Vegge, Juan Maria García-Lastra and Ivano E. Castelli","doi":"10.1039/D4DD00134F","DOIUrl":"10.1039/D4DD00134F","url":null,"abstract":"<p >Workflow managers play a critical role in the efficient planning and execution of complex workloads. A handful of these already exist within the world of computational materials discovery, but their dynamic capabilities are somewhat lacking. The PerQueue workflow manager is the answer to this need. By utilizing modular and dynamic building blocks to define a workflow explicitly before starting, PerQueue can give a better overview of the workflow while allowing full flexibility and high dynamism. To exemplify its usage, we present four use cases at different scales within computational materials discovery. These encapsulate high-throughput screening with Density Functional Theory, using active learning to train a Machine-Learning Interatomic Potential with Molecular Dynamics and reusing this potential for kinetic Monte Carlo simulations of extended systems. Lastly, it is used for an active-learning-accelerated image segmentation procedure with a human-in-the-loop.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1832-1841"},"PeriodicalIF":6.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00134f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael A. Pence, Gavin Hazen and Joaquín Rodríguez-López
Comprehensive studies of molecular electrocatalysis require tedious titration-type experiments that slow down manual experimentation. We present eLab as an automated electrochemical platform designed for molecular electrochemistry that uses opensource software to modularly interconnect various commercial instruments, enabling users to chain together multiple instruments for complex electrochemical operations. We benchmarked the solution handling performance of our platform through gravimetric calibration, acid–base titrations, and voltammetric diffusion coefficient measurements. We then used the platform to explore the TEMPO-catalyzed electrooxidation of alcohols, demonstrating our platforms capabilities for pH-dependent molecular electrocatalysis. We performed combined acid–base titrations and cyclic voltammetry on six different alcohol substrates, collecting 684 voltammograms with 171 different solution conditions over the course of 16 hours, demonstrating high throughput in an unsupervised experiment. The high versatility, transferability, and ease of implementation of eLab promises the rapid discovery and characterization of pH-dependent processes, including mediated electrocatalysis for energy conversion, fuel valorization, and bioelectrochemical sensing, among many applications.
对分子电催化的全面研究需要进行繁琐的滴定型实验,从而降低了手动实验的速度。我们介绍的 eLab 是专为分子电化学设计的自动化电化学平台,它使用开源软件模块化地连接各种商用仪器,使用户能够将多台仪器串联起来进行复杂的电化学操作。我们通过重量校准、酸碱滴定和伏安扩散系数测量,对平台的溶液处理性能进行了基准测试。然后,我们利用该平台探索了 TEMPO 催化的醇类电氧化,展示了我们的平台在 pH 依赖性分子电催化方面的能力。我们对六种不同的醇类底物进行了酸碱滴定和循环伏安测定,在 16 个小时的时间里收集了 684 张伏安图,涉及 171 种不同的溶液条件,展示了无监督实验的高吞吐量。eLab 的多功能性、可移植性和易实施性使其有望快速发现和表征 pH 依赖性过程,包括用于能量转换、燃料价值化和生物电化学传感等多种应用的介导电催化。
{"title":"An automated electrochemistry platform for studying pH-dependent molecular electrocatalysis†","authors":"Michael A. Pence, Gavin Hazen and Joaquín Rodríguez-López","doi":"10.1039/D4DD00186A","DOIUrl":"10.1039/D4DD00186A","url":null,"abstract":"<p >Comprehensive studies of molecular electrocatalysis require tedious titration-type experiments that slow down manual experimentation. We present eLab as an automated electrochemical platform designed for molecular electrochemistry that uses opensource software to modularly interconnect various commercial instruments, enabling users to chain together multiple instruments for complex electrochemical operations. We benchmarked the solution handling performance of our platform through gravimetric calibration, acid–base titrations, and voltammetric diffusion coefficient measurements. We then used the platform to explore the TEMPO-catalyzed electrooxidation of alcohols, demonstrating our platforms capabilities for pH-dependent molecular electrocatalysis. We performed combined acid–base titrations and cyclic voltammetry on six different alcohol substrates, collecting 684 voltammograms with 171 different solution conditions over the course of 16 hours, demonstrating high throughput in an unsupervised experiment. The high versatility, transferability, and ease of implementation of eLab promises the rapid discovery and characterization of pH-dependent processes, including mediated electrocatalysis for energy conversion, fuel valorization, and bioelectrochemical sensing, among many applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1812-1821"},"PeriodicalIF":6.2,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00186a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qianxiang Ai, Fanwang Meng, Jiale Shi, Brenden Pelkie and Connor W. Coley
The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.
数据驱动方法和机器学习(ML)技术在有机化学领域及其各个子领域的普及提高了结构化反应数据的价值。化学领域的大多数数据都是非结构化文本,而且由于有机化学文献(论文、专利)浩如烟海,从非结构化文本到结构化数据的手动转换仍然主要是人工操作。完成这项任务的软件工具将有助于下游应用,如反应预测和条件推荐。在本研究中,我们利用经过微调的大型语言模型(LLMs)的强大功能,按照开放反应数据库(ORD)模式从有机合成过程文本中提取反应信息,并将其转换为结构化数据,这是一种专为有机反应设计的综合数据结构。经过微调的模型能生成语法正确的 ORD 记录,对 ORD "信息"(如完整的化合物、工作步骤或条件定义)的平均准确率为 91.25%,对单个数据字段(如化合物标识符、质量数)的平均准确率为 92.25%,并能识别化合物参考标记和推断反应作用。我们对其故障模式进行了研究,并对特定子任务(如反应角色分类)的性能进行了评估。
{"title":"Extracting structured data from organic synthesis procedures using a fine-tuned large language model†","authors":"Qianxiang Ai, Fanwang Meng, Jiale Shi, Brenden Pelkie and Connor W. Coley","doi":"10.1039/D4DD00091A","DOIUrl":"10.1039/D4DD00091A","url":null,"abstract":"<p >The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (<em>e.g.</em>, full compound, workups, or condition definitions) and 92.25% for individual data fields (<em>e.g.</em>, compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1822-1831"},"PeriodicalIF":6.2,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00091a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}