Forecasting experimental chemical shifts of organic compounds is a long-standing challenge in organic chemistry. Recent advances in machine learning (ML) have led to routines that surpass the accuracy of ab initio Density Functional Theory (DFT) in estimating experimental 13C shifts. The extraction of knowledge from other models, known as transfer learning, has demonstrated remarkable improvements, particularly in scenarios with limited data availability. However, the extent to which transfer learning improves predictive accuracy in low-data regimes for experimental chemical shift predictions remains unexplored. This study indicates that atomic features derived from a message passing neural network (MPNN) forcefield are robust descriptors for atomic properties. A dense network utilizing these descriptors to predict 13C shifts achieves a mean absolute error (MAE) of 1.68 ppm. When these features are used as node labels in a simple graph neural network (GNN), the model attains a better MAE of 1.34 ppm. On the other hand, embeddings from a self-supervised pre-trained 3D aware transformer are not sufficiently descriptive for a feedforward model but show reasonable accuracy within the GNN framework, achieving an MAE of 1.51 ppm. Under low-data conditions, all transfer-learned models show a significant improvement in predictive accuracy compared to existing literature models, regardless of the sampling strategy used to select from the pool of unlabeled examples. We demonstrated that extracting atomic features from models trained on large and diverse datasets is an effective transfer learning strategy for predicting NMR chemical shifts, achieving results on par with existing literature models. This method provides several benefits, such as reduced training times, simpler models with fewer trainable parameters, and strong performance in low-data scenarios, without the need for costly ab initio data of the target property. This technique can be applied to other chemical tasks opening many new potential applications where the amount of data is a limiting factor.
{"title":"Transfer learning based on atomic feature extraction for the prediction of experimental 13C chemical shifts†","authors":"Žarko Ivković, Jesús Jover and Jeremy Harvey","doi":"10.1039/D4DD00168K","DOIUrl":"https://doi.org/10.1039/D4DD00168K","url":null,"abstract":"<p >Forecasting experimental chemical shifts of organic compounds is a long-standing challenge in organic chemistry. Recent advances in machine learning (ML) have led to routines that surpass the accuracy of <em>ab initio</em> Density Functional Theory (DFT) in estimating experimental <small><sup>13</sup></small>C shifts. The extraction of knowledge from other models, known as transfer learning, has demonstrated remarkable improvements, particularly in scenarios with limited data availability. However, the extent to which transfer learning improves predictive accuracy in low-data regimes for experimental chemical shift predictions remains unexplored. This study indicates that atomic features derived from a message passing neural network (MPNN) forcefield are robust descriptors for atomic properties. A dense network utilizing these descriptors to predict <small><sup>13</sup></small>C shifts achieves a mean absolute error (MAE) of 1.68 ppm. When these features are used as node labels in a simple graph neural network (GNN), the model attains a better MAE of 1.34 ppm. On the other hand, embeddings from a self-supervised pre-trained 3D aware transformer are not sufficiently descriptive for a feedforward model but show reasonable accuracy within the GNN framework, achieving an MAE of 1.51 ppm. Under low-data conditions, all transfer-learned models show a significant improvement in predictive accuracy compared to existing literature models, regardless of the sampling strategy used to select from the pool of unlabeled examples. We demonstrated that extracting atomic features from models trained on large and diverse datasets is an effective transfer learning strategy for predicting NMR chemical shifts, achieving results on par with existing literature models. This method provides several benefits, such as reduced training times, simpler models with fewer trainable parameters, and strong performance in low-data scenarios, without the need for costly <em>ab initio</em> data of the target property. This technique can be applied to other chemical tasks opening many new potential applications where the amount of data is a limiting factor.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 11","pages":" 2242-2251"},"PeriodicalIF":6.2,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00168k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computational materials science produces large quantities of data, both in terms of high-throughput calculations and individual studies. Extracting knowledge from this large and heterogeneous pool of data is challenging due to the wide variety of computational methods and approximations, resulting in significant veracity in the sheer amount of available data. One way of dealing with the problem is using similarity measures to group data, but also to understand where possible differences may come from. Here, we present MADAS, a Python framework for computing similarity relations between material properties. It can be used to automate the download of data from various sources, compute descriptors and similarities between materials, analyze the relationship between materials through their properties, and can incorporate a variety of existing machine learning methods. We explain the architecture of the package and demonstrate its power with representative examples.
{"title":"MADAS: A Python framework for assessing similarity in materials-science data","authors":"Martin Kuban, Santiago Rigamonti, Claudia Draxl","doi":"10.1039/d4dd00258j","DOIUrl":"https://doi.org/10.1039/d4dd00258j","url":null,"abstract":"Computational materials science produces large quantities of data, both in terms of high-throughput calculations and individual studies. Extracting knowledge from this large and heterogeneous pool of data is challenging due to the wide variety of computational methods and approximations, resulting in significant veracity in the sheer amount of available data. One way of dealing with the problem is using similarity measures to group data, but also to understand where possible differences may come from. Here, we present MADAS, a Python framework for computing similarity relations between material properties. It can be used to automate the download of data from various sources, compute descriptors and similarities between materials, analyze the relationship between materials through their properties, and can incorporate a variety of existing machine learning methods. We explain the architecture of the package and demonstrate its power with representative examples.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142248408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ralf Wanzenböck, Esther Heid, Michele Riva, Giada Franceschi, Alexander M. Imre, Jesús Carrete, Ulrike Diebold and Georg K. H. Madsen
The investigation of inhomogeneous surfaces, where various local structures coexist, is crucial for understanding interfaces of technological interest, yet it presents significant challenges. Here, we study the atomic configurations of the (2 × m) Ti-rich surfaces at (110)-oriented SrTiO3 by bringing together scanning tunneling microscopy and transferable neural-network force fields combined with evolutionary exploration. We leverage an active learning methodology to iteratively extend the training data as needed for different configurations. Training on only small well-known reconstructions, we are able to extrapolate to the complicated and diverse overlayers encountered in different regions of the inhomogeneous SrTiO3(110)-(2 × m) surface. Our machine-learning-backed approach generates several new candidate structures, in good agreement with experiment and verified using density functional theory. The approach could be extended to other complex metal oxides featuring large coexisting surface reconstructions.
对各种局部结构共存的非均质表面进行研究,对于了解具有技术意义的界面至关重要,但这也带来了巨大的挑战。在这里,我们通过将扫描隧道显微镜和可转移神经网络力场与进化探索相结合,研究了 (110) 取向 SrTiO3 的 (2 × m) 富钛表面的原子构型。我们利用主动学习方法,根据不同配置的需要迭代扩展训练数据。我们仅在众所周知的小型重构上进行训练,就能推断出在不均匀的 SrTiO3(110)-(2 × m) 表面的不同区域所遇到的复杂多样的覆盖层。我们的机器学习方法生成了几种新的候选结构,与实验结果吻合,并通过密度泛函理论进行了验证。该方法可扩展到其他具有大量共存表面重构特征的复杂金属氧化物。
{"title":"Exploring inhomogeneous surfaces: Ti-rich SrTiO3(110) reconstructions via active learning†","authors":"Ralf Wanzenböck, Esther Heid, Michele Riva, Giada Franceschi, Alexander M. Imre, Jesús Carrete, Ulrike Diebold and Georg K. H. Madsen","doi":"10.1039/D4DD00231H","DOIUrl":"10.1039/D4DD00231H","url":null,"abstract":"<p >The investigation of inhomogeneous surfaces, where various local structures coexist, is crucial for understanding interfaces of technological interest, yet it presents significant challenges. Here, we study the atomic configurations of the (2 × <em>m</em>) Ti-rich surfaces at (110)-oriented SrTiO<small><sub>3</sub></small> by bringing together scanning tunneling microscopy and transferable neural-network force fields combined with evolutionary exploration. We leverage an active learning methodology to iteratively extend the training data as needed for different configurations. Training on only small well-known reconstructions, we are able to extrapolate to the complicated and diverse overlayers encountered in different regions of the inhomogeneous SrTiO<small><sub>3</sub></small>(110)-(2 × <em>m</em>) surface. Our machine-learning-backed approach generates several new candidate structures, in good agreement with experiment and verified using density functional theory. The approach could be extended to other complex metal oxides featuring large coexisting surface reconstructions.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2137-2145"},"PeriodicalIF":6.2,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00231h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142248409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yusaku Nakajima, Kai Kawasaki, Yasuo Takeichi, Masashi Hamaya, Yoshitaka Ushiku and Kanta Ono
We demonstrate a novel mechanochemical synthesis method using a robotic powder grinding system that applies a precisely controlled and constant mechanical force. This approach significantly enhances reproducibility and enables detailed analysis of reaction pathways. Our results indicate that robotic force control can alter the reaction rate and influence the reaction pathway, highlighting its potential for elucidating chemical reaction mechanisms and fostering the discovery of new chemical reactions. Despite its significance, the application of a controllable constant force in macroscale mechanochemical synthesis remains challenging. To address this gap, we compared the reproducibilities of various mechanochemical syntheses using conventional manual grinding, ball milling, and our novel robotic approach with perovskite materials. Our findings indicate that the robotic approach provides significantly higher reproducibility than conventional methods, facilitating the analysis of reaction pathways. By manipulating the grinding force and speed, we revealed that robotic force control can alter both the reaction rate and pathway. Consequently, robotic mechanochemical synthesis has significant potential for advancing the understanding of chemical reaction mechanisms and discovering new reactions.
{"title":"Force-controlled robotic mechanochemical synthesis†","authors":"Yusaku Nakajima, Kai Kawasaki, Yasuo Takeichi, Masashi Hamaya, Yoshitaka Ushiku and Kanta Ono","doi":"10.1039/D4DD00189C","DOIUrl":"10.1039/D4DD00189C","url":null,"abstract":"<p >We demonstrate a novel mechanochemical synthesis method using a robotic powder grinding system that applies a precisely controlled and constant mechanical force. This approach significantly enhances reproducibility and enables detailed analysis of reaction pathways. Our results indicate that robotic force control can alter the reaction rate and influence the reaction pathway, highlighting its potential for elucidating chemical reaction mechanisms and fostering the discovery of new chemical reactions. Despite its significance, the application of a controllable constant force in macroscale mechanochemical synthesis remains challenging. To address this gap, we compared the reproducibilities of various mechanochemical syntheses using conventional manual grinding, ball milling, and our novel robotic approach with perovskite materials. Our findings indicate that the robotic approach provides significantly higher reproducibility than conventional methods, facilitating the analysis of reaction pathways. By manipulating the grinding force and speed, we revealed that robotic force control can alter both the reaction rate and pathway. Consequently, robotic mechanochemical synthesis has significant potential for advancing the understanding of chemical reaction mechanisms and discovering new reactions.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2130-2136"},"PeriodicalIF":6.2,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00189c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142248410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aleksandar Kondinski, Pavlo Rutkevych, Laura Pascazio, Dan N. Tran, Feroz Farazi, Srishti Ganguly and Markus Kraft
Zeolites are complex and porous crystalline inorganic materials that serve as hosts for a variety of molecular, ionic and cluster species. Formal, machine-actionable representation of this chemistry presents a challenge as a variety of concepts need to be semantically interlinked. This work demonstrates the potential of knowledge engineering in overcoming this challenge. We develop ontologies OntoCrystal and OntoZeolite, enabling the representation and instantiation of crystalline zeolite information into a dynamic, interoperable knowledge graph called The World Avatar (TWA). In TWA, crystalline zeolite instances are semantically interconnected with chemical species that act as guests in these materials. Information can be obtained via custom or templated SPARQL queries administered through a user-friendly web interface. Unstructured exploration is facilitated through natural language processing using the Marie System, showcasing promise for the blended large language model – knowledge graph approach in providing accurate responses on zeolite chemistry in natural language.
沸石是一种复杂多孔的结晶无机材料,可作为各种分子、离子和团簇物种的宿主。由于各种概念需要在语义上相互关联,因此对这种化学性质进行正式的、机器可操作的表述是一项挑战。这项工作展示了知识工程在克服这一挑战方面的潜力。我们开发了本体论 OntoCrystal 和 OntoZeolite,使结晶沸石信息的表示和实例化成为一个动态、可互操作的知识图谱,称为 "世界阿凡达"(TWA)。在 TWA 中,结晶沸石实例与作为这些材料客体的化学物种在语义上相互关联。可通过用户友好的网络界面管理自定义或模板 SPARQL 查询来获取信息。通过使用玛丽系统进行自然语言处理,可以方便地进行非结构化探索,从而展示了混合大型语言模型-知识图谱方法在用自然语言提供沸石化学准确回复方面的前景。
{"title":"Knowledge graph representation of zeolitic crystalline materials†","authors":"Aleksandar Kondinski, Pavlo Rutkevych, Laura Pascazio, Dan N. Tran, Feroz Farazi, Srishti Ganguly and Markus Kraft","doi":"10.1039/D4DD00166D","DOIUrl":"10.1039/D4DD00166D","url":null,"abstract":"<p >Zeolites are complex and porous crystalline inorganic materials that serve as hosts for a variety of molecular, ionic and cluster species. Formal, machine-actionable representation of this chemistry presents a challenge as a variety of concepts need to be semantically interlinked. This work demonstrates the potential of knowledge engineering in overcoming this challenge. We develop ontologies OntoCrystal and OntoZeolite, enabling the representation and instantiation of crystalline zeolite information into a dynamic, interoperable knowledge graph called The World Avatar (TWA). In TWA, crystalline zeolite instances are semantically interconnected with chemical species that act as guests in these materials. Information can be obtained <em>via</em> custom or templated SPARQL queries administered through a user-friendly web interface. Unstructured exploration is facilitated through natural language processing using the Marie System, showcasing promise for the blended large language model – knowledge graph approach in providing accurate responses on zeolite chemistry in natural language.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2070-2084"},"PeriodicalIF":6.2,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00166d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Dalland, Linden Schrecker and King Kuok (Mimi) Hii
The ability and desire to collect kinetic data has greatly increased in recent years, requiring more automated and quantitative methods for analysis. In this work, an automated program (Auto-VTNA) is developed, to simplify the kinetic analysis workflow. Auto-VTNA allows all the reaction orders to be determined concurrently, expediting the process of kinetic analysis. Auto-VTNA performs well on noisy or sparse data sets and can handle complex reactions involving multiple reaction orders. Quantitative error analysis and facile visualisation allows users to numerically justify and robustly present their findings. Auto-VTNA can be used through a free graphical user interface (GUI), requiring no coding or expert kinetic model input from the user, and can be customised and built on if required.
{"title":"Auto-VTNA: an automatic VTNA platform for determination of global rate laws†‡","authors":"Daniel Dalland, Linden Schrecker and King Kuok (Mimi) Hii","doi":"10.1039/D4DD00111G","DOIUrl":"10.1039/D4DD00111G","url":null,"abstract":"<p >The ability and desire to collect kinetic data has greatly increased in recent years, requiring more automated and quantitative methods for analysis. In this work, an automated program (Auto-VTNA) is developed, to simplify the kinetic analysis workflow. Auto-VTNA allows all the reaction orders to be determined concurrently, expediting the process of kinetic analysis. Auto-VTNA performs well on noisy or sparse data sets and can handle complex reactions involving multiple reaction orders. Quantitative error analysis and facile visualisation allows users to numerically justify and robustly present their findings. Auto-VTNA can be used through a free graphical user interface (GUI), requiring no coding or expert kinetic model input from the user, and can be customised and built on if required.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2118-2129"},"PeriodicalIF":6.2,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00111g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142248411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Barakati, Hui Yuan, Amit Goyal and S. V. Kalinin
The rise of electron microscopy has expanded our ability to acquire nanometer and atomically resolved images of complex materials. The resulting vast datasets are typically analyzed by human operators, an intrinsically challenging process due to the multiple possible analysis steps and the corresponding need to build and optimize complex analysis workflows. We present a methodology based on the concept of a Reward Function coupled with Bayesian Optimization, to optimize image analysis workflows dynamically. The Reward Function is engineered to closely align with the experimental objectives and broader context and is quantifiable upon completion of the analysis. Here, cross-section, high-angle annular dark field (HAADF) images of ion-irradiated (Y, Dy)Ba2Cu3O7−δ thin-films were used as a model system. The reward functions were formed based on the expected materials density and atomic spacings and used to drive multi-objective optimization of the classical Laplacian-of-Gaussian (LoG) method. These results can be benchmarked against the DCNN segmentation. This optimized LoG* compares favorably against DCNN in the presence of the additional noise. We further extend the reward function approach towards the identification of partially-disordered regions, creating a physics-driven reward function and action space of high-dimensional clustering. We pose that with correct definition, the reward function approach allows real-time optimization of complex analysis workflows at much higher speeds and lower computational costs than classical DCNN-based inference, ensuring the attainment of results that are both precise and aligned with the human-defined objectives.
{"title":"Physics-based reward driven image analysis in microscopy","authors":"K. Barakati, Hui Yuan, Amit Goyal and S. V. Kalinin","doi":"10.1039/D4DD00132J","DOIUrl":"10.1039/D4DD00132J","url":null,"abstract":"<p >The rise of electron microscopy has expanded our ability to acquire nanometer and atomically resolved images of complex materials. The resulting vast datasets are typically analyzed by human operators, an intrinsically challenging process due to the multiple possible analysis steps and the corresponding need to build and optimize complex analysis workflows. We present a methodology based on the concept of a Reward Function coupled with Bayesian Optimization, to optimize image analysis workflows dynamically. The Reward Function is engineered to closely align with the experimental objectives and broader context and is quantifiable upon completion of the analysis. Here, cross-section, high-angle annular dark field (HAADF) images of ion-irradiated (Y, Dy)Ba<small><sub>2</sub></small>Cu<small><sub>3</sub></small>O<small><sub>7−<em>δ</em></sub></small> thin-films were used as a model system. The reward functions were formed based on the expected materials density and atomic spacings and used to drive multi-objective optimization of the classical Laplacian-of-Gaussian (LoG) method. These results can be benchmarked against the DCNN segmentation. This optimized LoG* compares favorably against DCNN in the presence of the additional noise. We further extend the reward function approach towards the identification of partially-disordered regions, creating a physics-driven reward function and action space of high-dimensional clustering. We pose that with correct definition, the reward function approach allows real-time optimization of complex analysis workflows at much higher speeds and lower computational costs than classical DCNN-based inference, ensuring the attainment of results that are both precise and aligned with the human-defined objectives.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2061-2069"},"PeriodicalIF":6.2,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00132j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Amin Ghanavati, Soroush Ahmadi and Sohrab Rohani
The effectiveness of drug treatments depends significantly on the water solubility of compounds, influencing bioavailability and therapeutic outcomes. A reliable predictive solubility tool enables drug developers to swiftly identify drugs with low solubility and implement proactive solubility enhancement techniques. The current research proposes three predictive models based on four solubility datasets (ESOL, AQUA, PHYS, OCHEM), encompassing 3942 unique molecules. Three different molecular representations were obtained, including electrostatic potential (ESP) maps, molecular graph, and tabular features (extracted from ESP maps and tabular Mordred descriptors). We conducted 3942 DFT calculations to acquire ESP maps and extract features from them. Subsequently, we applied two deep learning models, EdgeConv and Graph Convolutional Network (GCN), to the point cloud (ESP) and graph modalities of molecules. In addition, we utilized a random forest-based feature selection on tabular features, followed by mapping with XGBoost. A t-SNE analysis visualized chemical space across datasets and unique molecules, providing valuable insights for model evaluation. The proposed machine learning (ML)-based models, trained on 80% of each dataset and evaluated on the remaining 20%, showcased superior performance, particularly with XGBoost utilizing the extracted and selected tabular features. This yielded average test data Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R2) values of 0.458, 0.613, and 0.918, respectively. Furthermore, an ensemble of the three models showed improvement in error metrics across all datasets, consistently outperforming each individual model. This Ensemble model was also tested on the Solubility Challenge 2019, achieving an RMSE of 0.865 and outperforming 37 models with an average RMSE of 1.62. Transferability analysis of our work further indicated robust performance across different datasets. Additionally, SHAP explainability for the feature-based XGBoost model provided transparency in solubility predictions, enhancing the interpretability of the results.
{"title":"A machine learning approach for the prediction of aqueous solubility of pharmaceuticals: a comparative model and dataset analysis†","authors":"Mohammad Amin Ghanavati, Soroush Ahmadi and Sohrab Rohani","doi":"10.1039/D4DD00065J","DOIUrl":"10.1039/D4DD00065J","url":null,"abstract":"<p >The effectiveness of drug treatments depends significantly on the water solubility of compounds, influencing bioavailability and therapeutic outcomes. A reliable predictive solubility tool enables drug developers to swiftly identify drugs with low solubility and implement proactive solubility enhancement techniques. The current research proposes three predictive models based on four solubility datasets (ESOL, AQUA, PHYS, OCHEM), encompassing 3942 unique molecules. Three different molecular representations were obtained, including electrostatic potential (ESP) maps, molecular graph, and tabular features (extracted from ESP maps and tabular Mordred descriptors). We conducted 3942 DFT calculations to acquire ESP maps and extract features from them. Subsequently, we applied two deep learning models, EdgeConv and Graph Convolutional Network (GCN), to the point cloud (ESP) and graph modalities of molecules. In addition, we utilized a random forest-based feature selection on tabular features, followed by mapping with XGBoost. A t-SNE analysis visualized chemical space across datasets and unique molecules, providing valuable insights for model evaluation. The proposed machine learning (ML)-based models, trained on 80% of each dataset and evaluated on the remaining 20%, showcased superior performance, particularly with XGBoost utilizing the extracted and selected tabular features. This yielded average test data Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and <em>R</em>-squared (<em>R</em><small><sup>2</sup></small>) values of 0.458, 0.613, and 0.918, respectively. Furthermore, an ensemble of the three models showed improvement in error metrics across all datasets, consistently outperforming each individual model. This Ensemble model was also tested on the Solubility Challenge 2019, achieving an RMSE of 0.865 and outperforming 37 models with an average RMSE of 1.62. Transferability analysis of our work further indicated robust performance across different datasets. Additionally, SHAP explainability for the feature-based XGBoost model provided transparency in solubility predictions, enhancing the interpretability of the results.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2085-2104"},"PeriodicalIF":6.2,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00065j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We would like to take this opportunity to thank all of Digital Discovery’s reviewers for helping to preserve quality and integrity in chemical science literature. We would also like to highlight the Outstanding Reviewers for Digital Discovery in 2023.
{"title":"Outstanding Reviewers for Digital Discovery in 2023","authors":"","doi":"10.1039/D4DD90037E","DOIUrl":"10.1039/D4DD90037E","url":null,"abstract":"<p >We would like to take this opportunity to thank all of <em>Digital Discovery</em>’s reviewers for helping to preserve quality and integrity in chemical science literature. We would also like to highlight the Outstanding Reviewers for <em>Digital Discovery</em> in 2023.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 1922-1922"},"PeriodicalIF":6.2,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd90037e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Obořil, Christian P. Haas, Maximilian Lübbesmeyer, Rachel Nicholls, Thorsten Gressling, Klavs F. Jensen, Giulio Volpin and Julius Hillenbrand
Reaction screening and high-throughput experimentation (HTE) coupled with liquid chromatography (HPLC and UHPLC) are becoming more important than ever in synthetic chemistry. With a growing number of experiments, it is increasingly difficult to ensure correct peak identification and integration, especially due to unknown side components which often overlap with the peaks of interest. We developed an improved version of the MOCCA Python package with a web-based graphical user interface (GUI) for automated processing of chromatograms, including baseline correction, intelligent peak picking, peak purity checks, deconvolution of overlapping peaks, and compound tracking. The individual automatic processing steps have been improved compared to the previous version of MOCCA to make the software more dependable and versatile. The algorithm accuracy was benchmarked using three datasets and compared to the previous MOCCA implementation and published results. The processing is fully automated with the possibility to include calibration and internal standards. The software supports chromatograms with photo-diode array detector (DAD) data from most commercial HPLC systems, and the Python package and GUI implementation are open-source to allow addition of new features and further development.
{"title":"Automated processing of chromatograms: a comprehensive python package with a GUI for intelligent peak identification and deconvolution in chemical reaction analysis","authors":"Jan Obořil, Christian P. Haas, Maximilian Lübbesmeyer, Rachel Nicholls, Thorsten Gressling, Klavs F. Jensen, Giulio Volpin and Julius Hillenbrand","doi":"10.1039/D4DD00214H","DOIUrl":"10.1039/D4DD00214H","url":null,"abstract":"<p >Reaction screening and high-throughput experimentation (HTE) coupled with liquid chromatography (HPLC and UHPLC) are becoming more important than ever in synthetic chemistry. With a growing number of experiments, it is increasingly difficult to ensure correct peak identification and integration, especially due to unknown side components which often overlap with the peaks of interest. We developed an improved version of the MOCCA Python package with a web-based graphical user interface (GUI) for automated processing of chromatograms, including baseline correction, intelligent peak picking, peak purity checks, deconvolution of overlapping peaks, and compound tracking. The individual automatic processing steps have been improved compared to the previous version of MOCCA to make the software more dependable and versatile. The algorithm accuracy was benchmarked using three datasets and compared to the previous MOCCA implementation and published results. The processing is fully automated with the possibility to include calibration and internal standards. The software supports chromatograms with photo-diode array detector (DAD) data from most commercial HPLC systems, and the Python package and GUI implementation are open-source to allow addition of new features and further development.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2041-2051"},"PeriodicalIF":6.2,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00214h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}