首页 > 最新文献

Digital discovery最新文献

英文 中文
Knowledge Graph Representation of Zeolitic Crystalline Materials 沸石晶体材料的知识图谱表示法
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-13 DOI: 10.1039/d4dd00166d
Aleksandar Kondinski, Pavlo Rutkevych, Laura Pascazio, Dan Tran, Feroz Farazi, Srishti Ganguly, Markus Kraft
Zeolites are complex and porous crystalline inorganic materials that serve as hosts for a variety of molecular, ionic and cluster species. Formal, machine-actionable representation of this chemistry presents a challenge as a variety of concepts need to be semantically interlinked. This work demonstrates the potential of knowledge engineering in overcoming this challenge. We develop ontologies OntoCrystal and OntoZeolite, enabling the representation and instantiation of crystalline zeolite information into a dynamic, interoperable knowledge graph called The World Avatar (TWA). In TWA, crystalline zeolite instances are semantically interconnected with chemical species that act as guests in these materials. Information can be obtained via custom or templated SPARQL queries administered through a user-friendly web interface. Unstructured exploration is facilitated through natural language processing using the Marie System, showcasing promise for the blended large language model – knowledge graph approach in providing accurate responses on zeolite chemistry in natural language.
沸石是一种复杂多孔的结晶无机材料,可作为各种分子、离子和团簇物种的宿主。由于各种概念需要在语义上相互关联,因此对这种化学性质进行正式的、机器可操作的表述是一项挑战。这项工作展示了知识工程在克服这一挑战方面的潜力。我们开发了本体论 OntoCrystal 和 OntoZeolite,使结晶沸石信息的表示和实例化成为一个动态、可互操作的知识图谱,称为 "世界阿凡达"(TWA)。在 TWA 中,结晶沸石实例与作为这些材料客体的化学物种在语义上相互关联。可通过用户友好的网络界面管理自定义或模板 SPARQL 查询来获取信息。通过使用玛丽系统进行自然语言处理,可以方便地进行非结构化探索,从而展示了混合大型语言模型-知识图谱方法在用自然语言提供沸石化学准确回复方面的前景。
{"title":"Knowledge Graph Representation of Zeolitic Crystalline Materials","authors":"Aleksandar Kondinski, Pavlo Rutkevych, Laura Pascazio, Dan Tran, Feroz Farazi, Srishti Ganguly, Markus Kraft","doi":"10.1039/d4dd00166d","DOIUrl":"https://doi.org/10.1039/d4dd00166d","url":null,"abstract":"Zeolites are complex and porous crystalline inorganic materials that serve as hosts for a variety of molecular, ionic and cluster species. Formal, machine-actionable representation of this chemistry presents a challenge as a variety of concepts need to be semantically interlinked. This work demonstrates the potential of knowledge engineering in overcoming this challenge. We develop ontologies OntoCrystal and OntoZeolite, enabling the representation and instantiation of crystalline zeolite information into a dynamic, interoperable knowledge graph called The World Avatar (TWA). In TWA, crystalline zeolite instances are semantically interconnected with chemical species that act as guests in these materials. Information can be obtained via custom or templated SPARQL queries administered through a user-friendly web interface. Unstructured exploration is facilitated through natural language processing using the Marie System, showcasing promise for the blended large language model – knowledge graph approach in providing accurate responses on zeolite chemistry in natural language.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Physics-based reward driven image analysis in microscopy 基于物理奖励的显微图像分析
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-12 DOI: 10.1039/d4dd00132j
Kamyar Barakati, Hui Yuan, Amit Goyal, Sergei V. Kalinin
The rise of electron microscopy has expanded our ability to acquire nanometer and atomically resolved images of complex materials. The resulting vast datasets are typically analyzed by human operators, an intrinsically challenging process due to the multiple possible analysis steps and the corresponding need to build and optimize complex analysis workflows. We present a methodology based on the concept of a Reward Function coupled with Bayesian Optimization, to optimize image analysis workflows dynamically. The Reward Function is engineered to closely align with the experimental objectives and broader context and is quantifiable upon completion of the analysis. Here, cross-section, high-angle annular dark field (HAADF) images of ion-irradiated (Y, Dy)Ba2Cu3O7-δ thin-films were used as a model system. The reward functions were formed based on the expected materials density and atomic spacings and used to drive multi-objective optimization of the classical Laplacian-of-Gaussian (LoG) method. These results can be benchmarked against the DCNN segmentation. This optimized LoG* compares favorably against DCNN in the presence of the additional noise. We further extend the reward function approach towards the identification of partially-disordered regions, creating a physics-driven reward function and action space of high-dimensional clustering. We pose that with correct definition, the reward function approach allows real-time optimization of complex analysis workflows at much higher speeds and lower computational costs than classical DCNN-based inference, ensuring the attainment of results that are both precise and aligned with the human-defined objectives.
电子显微镜的兴起扩大了我们获取复杂材料的纳米和原子分辨率图像的能力。由此产生的大量数据集通常由人类操作员进行分析,由于可能存在多个分析步骤以及相应地需要建立和优化复杂的分析工作流程,这是一个具有内在挑战性的过程。我们提出了一种基于奖励函数与贝叶斯优化概念的方法,用于动态优化图像分析工作流程。奖励函数的设计与实验目标和更广泛的背景密切相关,并可在分析完成后量化。在这里,离子辐照 (Y, Dy)Ba2Cu3O7-δ 薄膜的横截面高角度环形暗场 (HAADF) 图像被用作模型系统。奖励函数是根据预期的材料密度和原子间距形成的,并用于驱动经典高斯拉普拉斯(LoG)方法的多目标优化。这些结果可以与 DCNN 细分法进行比较。在存在额外噪声的情况下,优化后的 LoG* 与 DCNN 相比更胜一筹。我们进一步扩展了奖励函数方法,使其适用于识别部分失序区域,创建了一个物理驱动的奖励函数和高维聚类的行动空间。我们提出,与基于 DCNN 的经典推理相比,只要定义正确,奖励函数方法就能以更高的速度和更低的计算成本对复杂的分析工作流程进行实时优化,确保获得既精确又符合人类定义目标的结果。
{"title":"Physics-based reward driven image analysis in microscopy","authors":"Kamyar Barakati, Hui Yuan, Amit Goyal, Sergei V. Kalinin","doi":"10.1039/d4dd00132j","DOIUrl":"https://doi.org/10.1039/d4dd00132j","url":null,"abstract":"The rise of electron microscopy has expanded our ability to acquire nanometer and atomically resolved images of complex materials. The resulting vast datasets are typically analyzed by human operators, an intrinsically challenging process due to the multiple possible analysis steps and the corresponding need to build and optimize complex analysis workflows. We present a methodology based on the concept of a Reward Function coupled with Bayesian Optimization, to optimize image analysis workflows dynamically. The Reward Function is engineered to closely align with the experimental objectives and broader context and is quantifiable upon completion of the analysis. Here, cross-section, high-angle annular dark field (HAADF) images of ion-irradiated (Y, Dy)Ba<small><sub>2</sub></small>Cu<small><sub>3</sub></small>O<small><sub>7-δ</sub></small> thin-films were used as a model system. The reward functions were formed based on the expected materials density and atomic spacings and used to drive multi-objective optimization of the classical Laplacian-of-Gaussian (LoG) method. These results can be benchmarked against the DCNN segmentation. This optimized LoG* compares favorably against DCNN in the presence of the additional noise. We further extend the reward function approach towards the identification of partially-disordered regions, creating a physics-driven reward function and action space of high-dimensional clustering. We pose that with correct definition, the reward function approach allows real-time optimization of complex analysis workflows at much higher speeds and lower computational costs than classical DCNN-based inference, ensuring the attainment of results that are both precise and aligned with the human-defined objectives.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Machine Learning Approach for the Prediction of Aqueous Solubility of Pharmaceuticals: A Comparative Model and Dataset Analysis 预测药物水溶性的机器学习方法:模型和数据集比较分析
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-09 DOI: 10.1039/d4dd00065j
Mohammad Amin Ghanavati, Soroush Ahmadi, Sohrab Rohani
The effectiveness of drug treatments depends significantly on the water solubility of compounds, influencing bioavailability and therapeutic outcomes. A reliable predictive solubility tool enables drug developers to swiftly identify drugs with low solubility and implement proactive solubility enhancement techniques. The current research proposes three predictive models based on four solubility datasets (ESOL, AQUA, PHYS, OCHEM), encompassing 3942 unique molecules. Three different molecular representations were obtained, including electrostatic potential (ESP) maps, molecular Graph, and tabular features (extracted from ESP maps and tabular Mordred descriptors). We conducted 3942 DFT calculations to acquire ESP maps and extract features from them. Subsequently, we applied two deep learning models, EdgeConv and Graph Convolutional Network (GCN), to the point cloud (ESP) and graph modalities of molecules. In addition, we utilized a random forest-based feature selection on tabular features, followed by mapping with XGBoost. A t-SNE analysis visualized chemical space across datasets and unique molecules, providing valuable insights for model evaluation. The proposed machine learning (ML)-based models, trained on 80% of each dataset and evaluated on the remaining 20%, showcased superior performance, particularly with XGBoost utilizing the extracted and selected tabular features. This yielded average test data Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R2) values of 0.458, 0.613, and 0.918, respectively.
药物治疗的效果在很大程度上取决于化合物的水溶性,它影响着生物利用率和治疗效果。可靠的溶解度预测工具可帮助药物开发人员迅速识别溶解度低的药物,并实施积极的溶解度增强技术。目前的研究基于四个溶解度数据集(ESOL、AQUA、PHYS、OCHEM)提出了三种预测模型,涵盖 3942 种独特的分子。我们获得了三种不同的分子表征,包括静电位(ESP)图、分子图和表列特征(从ESP图和表列Mordred描述符中提取)。我们进行了 3942 次 DFT 计算,以获取 ESP 图并从中提取特征。随后,我们将 EdgeConv 和 Graph Convolutional Network (GCN) 这两种深度学习模型应用于分子的点云(ESP)和图模式。此外,我们还在表格特征上使用了基于随机森林的特征选择,然后使用 XGBoost 进行映射。t-SNE 分析可视化跨数据集和独特分子的化学空间,为模型评估提供了宝贵的见解。所提出的基于机器学习(ML)的模型在每个数据集的 80% 数据上进行了训练,并在剩余的 20% 数据上进行了评估,显示出卓越的性能,尤其是在利用提取和选择的表格特征进行 XGBoost 时。测试数据的平均绝对误差 (MAE)、均方根误差 (RMSE) 和 R 平方 (R2) 值分别为 0.458、0.613 和 0.918。
{"title":"A Machine Learning Approach for the Prediction of Aqueous Solubility of Pharmaceuticals: A Comparative Model and Dataset Analysis","authors":"Mohammad Amin Ghanavati, Soroush Ahmadi, Sohrab Rohani","doi":"10.1039/d4dd00065j","DOIUrl":"https://doi.org/10.1039/d4dd00065j","url":null,"abstract":"The effectiveness of drug treatments depends significantly on the water solubility of compounds, influencing bioavailability and therapeutic outcomes. A reliable predictive solubility tool enables drug developers to swiftly identify drugs with low solubility and implement proactive solubility enhancement techniques. The current research proposes three predictive models based on four solubility datasets (ESOL, AQUA, PHYS, OCHEM), encompassing 3942 unique molecules. Three different molecular representations were obtained, including electrostatic potential (ESP) maps, molecular Graph, and tabular features (extracted from ESP maps and tabular Mordred descriptors). We conducted 3942 DFT calculations to acquire ESP maps and extract features from them. Subsequently, we applied two deep learning models, EdgeConv and Graph Convolutional Network (GCN), to the point cloud (ESP) and graph modalities of molecules. In addition, we utilized a random forest-based feature selection on tabular features, followed by mapping with XGBoost. A t-SNE analysis visualized chemical space across datasets and unique molecules, providing valuable insights for model evaluation. The proposed machine learning (ML)-based models, trained on 80% of each dataset and evaluated on the remaining 20%, showcased superior performance, particularly with XGBoost utilizing the extracted and selected tabular features. This yielded average test data Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R2) values of 0.458, 0.613, and 0.918, respectively.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Outstanding Reviewers for Digital Discovery in 2023 2023 年数字发现杰出评审员
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-06 DOI: 10.1039/d4dd90037e
We would like to take this opportunity to thank all of Digital Discovery’s reviewers for helping to preserve quality and integrity in chemical science literature. We would also like to highlight the Outstanding Reviewers for Digital Discovery in 2023.
我们想借此机会感谢数字发现的所有审稿人,感谢他们帮助维护化学科学文献的质量和完整性。同时,我们还想特别介绍一下2023年数字发现的杰出审稿人。
{"title":"Outstanding Reviewers for Digital Discovery in 2023","authors":"","doi":"10.1039/d4dd90037e","DOIUrl":"https://doi.org/10.1039/d4dd90037e","url":null,"abstract":"We would like to take this opportunity to thank all of <em>Digital Discovery</em>’s reviewers for helping to preserve quality and integrity in chemical science literature. We would also like to highlight the Outstanding Reviewers for <em>Digital Discovery</em> in 2023.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated processing of chromatograms: a comprehensive python package with a GUI for intelligent peak identification and deconvolution in chemical reaction analysis 色谱自动处理:用于化学反应分析中智能峰值识别和解卷积的带图形用户界面的 Python 综合软件包
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-05 DOI: 10.1039/d4dd00214h
Jan Obořil, Christian P. Haas, Maximilian Lübbesmeyer, Rachel Nicholls, Thorsten Gressling, Klavs F. Jensen, Giulio Volpin, Julius Hillenbrand
Reaction screening and high-throughput experimentation (HTE) coupled with liquid chromatography (HPLC and UHPLC) are becoming more important than ever in synthetic chemistry. With a growing number of experiments, it is increasingly difficult to ensure correct peak identification and integration, especially due to unknown side components which often overlap with the peaks of interest. We developed an improved version of the MOCCA Python package with a web-based graphical user interface (GUI) for automated processing of chromatograms, including baseline correction, intelligent peak picking, peak purity checks, deconvolution of overlapping peaks, and compound tracking. The individual automatic processing steps have been improved compared to the previous version of MOCCA to make the software more dependable and versatile. The algorithm accuracy was benchmarked using three datasets and compared to the previous MOCCA implementation and published results. The processing is fully automated with the possibility to include calibration and internal standards. The software supports chromatograms with photo-diode array detector (DAD) data from most commercial HPLC systems, and the Python package and GUI implementation are open-source to allow addition of new features and further development.
反应筛选和高通量实验 (HTE) 与液相色谱法(高效液相色谱法和超高效液相色谱法)相结合,在合成化学中变得比以往任何时候都更加重要。随着实验数量的不断增加,确保正确的峰识别和整合变得越来越困难,特别是由于未知的副成分经常与感兴趣的峰重叠。我们开发了 MOCCA Python 软件包的改进版,该软件包具有基于网络的图形用户界面 (GUI),用于自动处理色谱图,包括基线校正、智能选峰、峰纯度检查、重叠峰解卷积和化合物跟踪。与 MOCCA 的前一版本相比,各个自动处理步骤都有所改进,使软件更加可靠和通用。使用三个数据集对算法的准确性进行了基准测试,并将其与之前的 MOCCA 实施方案和已公布的结果进行了比较。处理过程完全自动化,可加入校准和内标。该软件支持来自大多数商用高效液相色谱系统的带有光电二极管阵列检测器 (DAD) 数据的色谱图,Python 软件包和图形用户界面实现是开源的,允许添加新功能和进一步开发。
{"title":"Automated processing of chromatograms: a comprehensive python package with a GUI for intelligent peak identification and deconvolution in chemical reaction analysis","authors":"Jan Obořil, Christian P. Haas, Maximilian Lübbesmeyer, Rachel Nicholls, Thorsten Gressling, Klavs F. Jensen, Giulio Volpin, Julius Hillenbrand","doi":"10.1039/d4dd00214h","DOIUrl":"https://doi.org/10.1039/d4dd00214h","url":null,"abstract":"Reaction screening and high-throughput experimentation (HTE) coupled with liquid chromatography (HPLC and UHPLC) are becoming more important than ever in synthetic chemistry. With a growing number of experiments, it is increasingly difficult to ensure correct peak identification and integration, especially due to unknown side components which often overlap with the peaks of interest. We developed an improved version of the MOCCA Python package with a web-based graphical user interface (GUI) for automated processing of chromatograms, including baseline correction, intelligent peak picking, peak purity checks, deconvolution of overlapping peaks, and compound tracking. The individual automatic processing steps have been improved compared to the previous version of MOCCA to make the software more dependable and versatile. The algorithm accuracy was benchmarked using three datasets and compared to the previous MOCCA implementation and published results. The processing is fully automated with the possibility to include calibration and internal standards. The software supports chromatograms with photo-diode array detector (DAD) data from most commercial HPLC systems, and the Python package and GUI implementation are open-source to allow addition of new features and further development.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Physics-Informed Neural Networks and Beyond: Enforcing Physical Constraints in Quantum Dissipative Dynamics 物理信息神经网络及其他:在量子耗散动力学中执行物理约束
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-05 DOI: 10.1039/d4dd00153b
Arif Ullah, Yu Huang, Ming Yang, Pavlo O. Dral
Neural networks (NNs) accelerate simulations of quantum dissipative dynamics. Ensuring that these simulations adhere to fundamental physical laws is crucial, but has been largely ignored in the state-of-the-art NN approaches. We show that this may lead to implausible results measured by violation of the trace conservation. To recover the correct physical behavior, we develop physics-informed NNs (PINNs) that mitigate the violations to a good extend. Beyond that, we propose a novel uncertainty-aware approach that enforces perfect trace conservation by design, surpassing PINNs.
神经网络(NN)可加速量子耗散动力学模拟。确保这些模拟符合基本物理定律至关重要,但最先进的神经网络方法在很大程度上忽视了这一点。我们的研究表明,这可能会导致违反痕量守恒的难以置信的结果。为了恢复正确的物理行为,我们开发了物理信息 NN(PINN),可以很好地减轻违反物理规律的情况。除此以外,我们还提出了一种新颖的不确定性感知方法,通过设计实现完美的轨迹守恒,超越了 PINNs。
{"title":"Physics-Informed Neural Networks and Beyond: Enforcing Physical Constraints in Quantum Dissipative Dynamics","authors":"Arif Ullah, Yu Huang, Ming Yang, Pavlo O. Dral","doi":"10.1039/d4dd00153b","DOIUrl":"https://doi.org/10.1039/d4dd00153b","url":null,"abstract":"Neural networks (NNs) accelerate simulations of quantum dissipative dynamics. Ensuring that these simulations adhere to fundamental physical laws is crucial, but has been largely ignored in the state-of-the-art NN approaches. We show that this may lead to implausible results measured by violation of the trace conservation. To recover the correct physical behavior, we develop physics-informed NNs (PINNs) that mitigate the violations to a good extend. Beyond that, we propose a novel uncertainty-aware approach that enforces perfect trace conservation by design, surpassing PINNs.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Regio-MPNN: predicting regioselectivity for general metal-catalyzed cross-coupling reactions using a chemical knowledge informed message passing neural network Regio-MPNN:利用化学知识信息传递神经网络预测一般金属催化交叉偶联反应的区域选择性
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-03 DOI: 10.1039/d4dd00244j
Baochen Li, Yuru Liu, Haibin Sun, Rentao Zhang, Yongli Xie, Klement Foo, Frankie S. Mak, Ruimao Zhang, Tianshu Yu, Sen Lin, Peng Wang, Xiaoxue Wang
As a fundamental problem in organic chemistry, synthesis planning aims at designing energy and cost-efficient reaction pathways for target compounds. In synthesis planning, it is crucial to understand regioselectivity, or the preference of a reaction over competing reaction sites. Precisely predicting regioselectivity enables early exclusion of unproductive reactions and paves the way to designing high-yielding synthetic routes with minimal separation and material costs. However, it is still at the emerging state to combine chemical knowledge and data-driven methods to make practical predictions for regioselectivity. At the same time, metal-catalyzed cross-coupling reactions have profoundly transformed medicinal chemistry, and thus become one of the most frequently encountered types of reactions in synthesis planning. In this work, we for the first time introduce a chemical knowledge informed message passing neural network (MPNN) framework that directly identifies the intrinsic major products for metal-catalyzed cross-coupling reactions with regioselective ambiguity. Integrating both first principles methods and data-driven methods, our model achieves an overall accuracy of 96.51% on the test set of eight typical metal-catalyzed cross-coupling reaction types, including Suzuki–Miyaura, Stille, Sonogashira, Buchwald–Hartwig, Hiyama, Kumada, Negishi, and Heck reactions, outperforming other commonly used model types. To integrate electronic effects with steric effects in regioselectivity prediction, we propose a quantitative method to measure the steric hindrance effect. Our steric hindrance checker can successfully identify regioselectivity induced solely by steric hindrance. Notably under practical scenarios, our model outperforms 6 experimental organic chemists with an average working experience of over 10 years in the organic synthesis industry in terms of predicting major products in regioselective cases. We have also exemplified the practical usage of our model by fixing routes designed by open-access synthesis planning software and improving reactions by identifying low-cost starting materials. To assist general chemists in making prompt decisions about regioselectivity, we have developed a free web-based AI-empowered tool. Our code and web tool have been made available at https://github.com/Chemlex-AI/regioselectivity and https://ai.tools.chemlex.com/region-choose, respectively.
作为有机化学中的一个基本问题,合成规划旨在为目标化合物设计具有能源和成本效益的反应途径。在合成规划中,了解反应的区域选择性或反应对竞争反应位点的偏好至关重要。精确预测区域选择性可以及早排除非生产性反应,并为设计分离成本和材料成本最低的高产合成路线铺平道路。然而,如何将化学知识和数据驱动方法结合起来,对区域选择性进行实用预测,目前仍处于新兴阶段。与此同时,金属催化的交叉偶联反应深刻地改变了药物化学,并因此成为合成规划中最常遇到的反应类型之一。在这项工作中,我们首次引入了一种基于化学知识的消息传递神经网络(MPNN)框架,该框架可直接识别具有区域选择性模糊性的金属催化交叉偶联反应的内在主要产物。我们的模型综合了第一性原理方法和数据驱动方法,在八种典型金属催化交叉偶联反应类型(包括铃木-宫浦反应、斯蒂尔反应、园平反应、布赫瓦尔德-哈特维希反应、日山反应、熊田反应、根岸反应和赫克反应)的测试集上,总体准确率达到 96.51%,优于其他常用模型类型。为了在区域选择性预测中整合电子效应和立体效应,我们提出了一种测量立体阻碍效应的定量方法。我们的立体阻碍检查器能成功识别仅由立体阻碍引起的区域选择性。值得注意的是,在实际情况下,我们的模型在预测区域选择性情况下的主要产物方面优于 6 位平均工作经验超过 10 年的有机合成实验化学家。我们还通过修正由开放式合成规划软件设计的路线,以及通过识别低成本起始材料来改进反应,举例说明了我们模型的实际用途。为了帮助普通化学家迅速做出有关区域选择性的决策,我们开发了一个基于人工智能的免费网络工具。我们的代码和网络工具已分别发布在 https://github.com/Chemlex-AI/regioselectivity 和 https://ai.tools.chemlex.com/region-choose 上。
{"title":"Regio-MPNN: predicting regioselectivity for general metal-catalyzed cross-coupling reactions using a chemical knowledge informed message passing neural network","authors":"Baochen Li, Yuru Liu, Haibin Sun, Rentao Zhang, Yongli Xie, Klement Foo, Frankie S. Mak, Ruimao Zhang, Tianshu Yu, Sen Lin, Peng Wang, Xiaoxue Wang","doi":"10.1039/d4dd00244j","DOIUrl":"https://doi.org/10.1039/d4dd00244j","url":null,"abstract":"As a fundamental problem in organic chemistry, synthesis planning aims at designing energy and cost-efficient reaction pathways for target compounds. In synthesis planning, it is crucial to understand regioselectivity, or the preference of a reaction over competing reaction sites. Precisely predicting regioselectivity enables early exclusion of unproductive reactions and paves the way to designing high-yielding synthetic routes with minimal separation and material costs. However, it is still at the emerging state to combine chemical knowledge and data-driven methods to make practical predictions for regioselectivity. At the same time, metal-catalyzed cross-coupling reactions have profoundly transformed medicinal chemistry, and thus become one of the most frequently encountered types of reactions in synthesis planning. In this work, we for the first time introduce a chemical knowledge informed message passing neural network (MPNN) framework that directly identifies the intrinsic major products for metal-catalyzed cross-coupling reactions with regioselective ambiguity. Integrating both first principles methods and data-driven methods, our model achieves an overall accuracy of 96.51% on the test set of eight typical metal-catalyzed cross-coupling reaction types, including Suzuki–Miyaura, Stille, Sonogashira, Buchwald–Hartwig, Hiyama, Kumada, Negishi, and Heck reactions, outperforming other commonly used model types. To integrate electronic effects with steric effects in regioselectivity prediction, we propose a quantitative method to measure the steric hindrance effect. Our steric hindrance checker can successfully identify regioselectivity induced solely by steric hindrance. Notably under practical scenarios, our model outperforms 6 experimental organic chemists with an average working experience of over 10 years in the organic synthesis industry in terms of predicting major products in regioselective cases. We have also exemplified the practical usage of our model by fixing routes designed by open-access synthesis planning software and improving reactions by identifying low-cost starting materials. To assist general chemists in making prompt decisions about regioselectivity, we have developed a free web-based AI-empowered tool. Our code and web tool have been made available at https://github.com/Chemlex-AI/regioselectivity and https://ai.tools.chemlex.com/region-choose, respectively.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extracting Recalcitrant Redox Data on Fluorophores to Pair with Optical Data for Predicting Small-Molecule, Ionic Isolation Lattices 提取荧光团的氧化还原数据与光学数据配对,以预测小分子离子隔离晶格
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-03 DOI: 10.1039/d4dd00137k
Michaela K Loveless, Minwei Che, Alec Sanchez, Vikrant Tripathy, Bo W. Laursen, Sudhakar Pamidighantam, Krishnan Raghavachari, Amar H Flood
Redox and optical data of organic fluorophores are essential for using design rules and properties screening to identify new candidate dyes capable of forming optical materials. One such optical material is small-molecule, ionic isolation lattices (SMILES), which have properties defined by the optical and electrochemical properties of the fluorophores used. While optical data are available and readily extracted, the promise of digital discovery to mine the data and identify new dye candidates for making new fluorescent compounds is limited by experimental electrochemical data, which is reported with varying quality. We report methods to extract data from 20,000+ literature-reported dyes for generating a library of both redox and optical data constituted by 206 dye-solvent entries. Wide heterogeneity in data collection and reporting practices predicated use of a workflow involving manual data extraction, expert annotations of data quality and validation. Chemometric analysis shows distributions of solvents, electrolytes, and reference electrodes used in electrochemistry and the distributions of dye families and molecular weights. Data were extracted and screened to identify fluorophores predicted to form fluorescent solids based on SMILES. Screening used three design rules requiring dyes to be cationic, have a redox window within –1.9 and +1.5 V (vs ferrocene), and a size less than 2 nm. A set of 47 dyes are compliant with all design rules showcasing the potential for using paired electrochemical-optical data in a workflow for designing optical materials.
有机荧光团的氧化还原和光学数据对于利用设计规则和特性筛选来确定能够形成光学材料的新候选染料至关重要。小分子离子隔离晶格(SMILES)就是这样一种光学材料,其特性由所用荧光团的光学和电化学特性决定。虽然光学数据可以获得并很容易提取,但数字发现技术在挖掘数据并确定用于制造新荧光化合物的新候选染料方面的前景却受到实验电化学数据的限制,而实验电化学数据的质量参差不齐。我们报告了从 20,000 多种文献报道的染料中提取数据的方法,以生成一个由 206 个染料-溶剂条目组成的氧化还原和光学数据库。由于数据收集和报告方法存在很大的差异,因此需要使用一种工作流程,其中包括手动数据提取、专家对数据质量的注释和验证。化学计量分析显示了电化学中使用的溶剂、电解质和参比电极的分布情况,以及染料家族和分子量的分布情况。对数据进行提取和筛选,以确定根据 SMILES 预测可形成荧光固体的荧光团。筛选采用了三条设计规则,要求染料必须是阳离子,氧化还原窗口在 -1.9 和 +1.5 V 之间(与二茂铁相比),且尺寸小于 2 nm。一组 47 种染料符合所有设计规则,展示了在设计光学材料的工作流程中使用成对电化学-光学数据的潜力。
{"title":"Extracting Recalcitrant Redox Data on Fluorophores to Pair with Optical Data for Predicting Small-Molecule, Ionic Isolation Lattices","authors":"Michaela K Loveless, Minwei Che, Alec Sanchez, Vikrant Tripathy, Bo W. Laursen, Sudhakar Pamidighantam, Krishnan Raghavachari, Amar H Flood","doi":"10.1039/d4dd00137k","DOIUrl":"https://doi.org/10.1039/d4dd00137k","url":null,"abstract":"Redox and optical data of organic fluorophores are essential for using design rules and properties screening to identify new candidate dyes capable of forming optical materials. One such optical material is small-molecule, ionic isolation lattices (SMILES), which have properties defined by the optical and electrochemical properties of the fluorophores used. While optical data are available and readily extracted, the promise of digital discovery to mine the data and identify new dye candidates for making new fluorescent compounds is limited by experimental electrochemical data, which is reported with varying quality. We report methods to extract data from 20,000+ literature-reported dyes for generating a library of both redox and optical data constituted by 206 dye-solvent entries. Wide heterogeneity in data collection and reporting practices predicated use of a workflow involving manual data extraction, expert annotations of data quality and validation. Chemometric analysis shows distributions of solvents, electrolytes, and reference electrodes used in electrochemistry and the distributions of dye families and molecular weights. Data were extracted and screened to identify fluorophores predicted to form fluorescent solids based on SMILES. Screening used three design rules requiring dyes to be cationic, have a redox window within –1.9 and +1.5 V (vs ferrocene), and a size less than 2 nm. A set of 47 dyes are compliant with all design rules showcasing the potential for using paired electrochemical-optical data in a workflow for designing optical materials.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of machine learning for predicting G9a inhibitors 应用机器学习预测 G9a 抑制剂
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-09-02 DOI: 10.1039/d4dd00101j
Mariya L. Ivanova, Nicola Russo, Nadia Djaid, Konstantin Nikolic
Object and significance: the G9a enzyme is an epigenomic regulator, making gene expression directly dependent on how various substances in the cell affect this enzyme. Therefore, it is crucial to consider this impact in any biochemical research involving the development of new compounds introduced into the body. While this can be examined experimentally, it would be highly advantageous to predict these effects using computer simulations. Purpose: the purpose of the model was to assist in answering the question of the potential effect that a compound under development could have on the G9a activity, and thus reduce the need for laboratory experiments and facilitate faster and more productive research and development. Solution: the paper proposes a cost-effective machine learning model that determines whether a compound is an active G9a inhibitor. The proposed approach utilises the already existing very extensive PubChem database. The starting point was the quantitative high-throughput screening assay for inhibitors of histone lysine methyltransferase G9a (also available on PubChem) which screened around 350 000 compounds. For these compounds, datasets of 60 features were created. Then different ML algorithms were deployed to find the best performing one, which can then be used to predict if some untested compound would actively inhibit G9a. Results: six different ML classifiers have been implemented on five dataset variations. Different variants of the dataset were created by using two different data balancing approaches and including or not the influence of water solubility at a pH of 7.4. The most successful combination was a dataset with five features and a random forest classifier that reached 90% accuracy. The classifier was trained with 60 244 and tested with 15 062 compounds. Feature reduction was obtained by analysing three different feature importance algorithms, which resulted in not only feature reduction but also some insights for further biochemical research.
目的和意义:G9a 酶是一种表观基因组调控因子,使基因表达直接取决于细胞中各种物质对该酶的影响。因此,在任何涉及开发引入体内的新化合物的生化研究中,考虑这种影响至关重要。虽然这可以通过实验来检验,但利用计算机模拟来预测这些影响是非常有利的。目的:该模型的目的是帮助回答正在开发的化合物可能对 G9a 活性产生的潜在影响,从而减少对实验室实验的需求,促进更快、更有成效的研究和开发。解决方案:本文提出了一种具有成本效益的机器学习模型,用于确定化合物是否是一种活性 G9a 抑制剂。所提出的方法利用了现有的非常广泛的 PubChem 数据库。起点是组蛋白赖氨酸甲基转移酶 G9a 抑制剂的定量高通量筛选测定(也可在 PubChem 上查阅),该测定筛选了约 350,000 种化合物。针对这些化合物,我们创建了包含 60 个特征的数据集。然后采用不同的 ML 算法,找出性能最好的算法,然后用它来预测某些未经测试的化合物是否会对 G9a 产生积极的抑制作用。结果:在五个数据集变体上实现了六种不同的 ML 分类器。通过使用两种不同的数据平衡方法,并考虑或不考虑 pH 值为 7.4 时水溶性的影响,创建了数据集的不同变体。最成功的组合是具有五个特征的数据集和随机森林分类器,准确率达到 90%。分类器用 60 244 种化合物进行了训练,并用 15 062 种化合物进行了测试。通过分析三种不同的特征重要性算法,不仅减少了特征,还为进一步的生化研究提供了一些启示。
{"title":"Application of machine learning for predicting G9a inhibitors","authors":"Mariya L. Ivanova, Nicola Russo, Nadia Djaid, Konstantin Nikolic","doi":"10.1039/d4dd00101j","DOIUrl":"https://doi.org/10.1039/d4dd00101j","url":null,"abstract":"<em>Object and significance</em>: the G9a enzyme is an epigenomic regulator, making gene expression directly dependent on how various substances in the cell affect this enzyme. Therefore, it is crucial to consider this impact in any biochemical research involving the development of new compounds introduced into the body. While this can be examined experimentally, it would be highly advantageous to predict these effects using computer simulations. <em>Purpose</em>: the purpose of the model was to assist in answering the question of the potential effect that a compound under development could have on the G9a activity, and thus reduce the need for laboratory experiments and facilitate faster and more productive research and development. <em>Solution</em>: the paper proposes a cost-effective machine learning model that determines whether a compound is an active G9a inhibitor. The proposed approach utilises the already existing very extensive PubChem database. The starting point was the quantitative high-throughput screening assay for inhibitors of histone lysine methyltransferase G9a (also available on PubChem) which screened around 350 000 compounds. For these compounds, datasets of 60 features were created. Then different ML algorithms were deployed to find the best performing one, which can then be used to predict if some untested compound would actively inhibit G9a. <em>Results</em>: six different ML classifiers have been implemented on five dataset variations. Different variants of the dataset were created by using two different data balancing approaches and including or not the influence of water solubility at a pH of 7.4. The most successful combination was a dataset with five features and a random forest classifier that reached 90% accuracy. The classifier was trained with 60 244 and tested with 15 062 compounds. Feature reduction was obtained by analysing three different feature importance algorithms, which resulted in not only feature reduction but also some insights for further biochemical research.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pellet dispensomixer and pellet distributor: open hardware for nanocomposite space exploration via automated material compounding 颗粒分配器和颗粒分配器:通过自动化材料复合探索纳米复合材料空间的开放式硬件
Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-30 DOI: 10.1039/d4dd00198b
Miguel Hernández-del-Valle, Jorge Ilarraza-Zuazo, Enrique Dios-Lázaro, Javier Rubio, Joris Audoux, Maciej Haranczyk
The development of novel polymer-based nanocomposites necessitates the experimental preparation and characterization of numerous compositions to identify optimal formulations. For thermoplastic-based materials, the compounding process typically involves the labor-intensive tasks of dispensing, weighing, mixing, and extruding solid components such as polymers and additives. Herein, we present an open hardware solution that aims to automate this process. Our setup system is designed to streamline material surveying tasks associated with experimental design or closed-loop, self-driving laboratories. Our hardware setup consists of two main components: a multi-material pellet dispenser, which simplifies the preparation of targeted compositions from a range of master batches, and a pellet collector-distributor, which efficiently gathers and distributes processed materials into various containers throughout the experiment.
新型聚合物基纳米复合材料的开发需要对大量成分进行实验制备和表征,以确定最佳配方。对于热塑性材料,混料过程通常涉及聚合物和添加剂等固体成分的分配、称重、混合和挤出等劳动密集型任务。在此,我们介绍一种开放式硬件解决方案,旨在实现这一过程的自动化。我们的设置系统旨在简化与实验设计或闭环自动驾驶实验室相关的材料测量任务。我们的硬件设置由两个主要部分组成:一个是多材料颗粒分配器,可简化从一系列母料中制备目标成分的过程;另一个是颗粒收集器-分配器,可在整个实验过程中有效地将处理过的材料收集并分配到各种容器中。
{"title":"Pellet dispensomixer and pellet distributor: open hardware for nanocomposite space exploration via automated material compounding","authors":"Miguel Hernández-del-Valle, Jorge Ilarraza-Zuazo, Enrique Dios-Lázaro, Javier Rubio, Joris Audoux, Maciej Haranczyk","doi":"10.1039/d4dd00198b","DOIUrl":"https://doi.org/10.1039/d4dd00198b","url":null,"abstract":"The development of novel polymer-based nanocomposites necessitates the experimental preparation and characterization of numerous compositions to identify optimal formulations. For thermoplastic-based materials, the compounding process typically involves the labor-intensive tasks of dispensing, weighing, mixing, and extruding solid components such as polymers and additives. Herein, we present an open hardware solution that aims to automate this process. Our setup system is designed to streamline material surveying tasks associated with experimental design or closed-loop, self-driving laboratories. Our hardware setup consists of two main components: a multi-material pellet dispenser, which simplifies the preparation of targeted compositions from a range of master batches, and a pellet collector-distributor, which efficiently gathers and distributes processed materials into various containers throughout the experiment.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Digital discovery
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1